Finger tracking

Last updated
Finger tracking of two pianists' fingers playing the same piece (slow motion, no sound) [1]

In the field of gesture recognition and image processing, finger tracking is a high-resolution technique developed in 1969 that is employed to know the consecutive position of the fingers of the user and hence represent objects in 3D. In addition to that, the finger tracking technique is used as a tool of the computer, acting as an external device in our computer, similar to a keyboard and a mouse.

Contents

Introduction

The finger tracking system is focused on user-data interaction, where the user interacts with virtual data, by handling through the fingers the volumetric of a 3D object that we want to represent. This system was born based on the human-computer interaction problem. The objective is to allow the communication between them and the use of gestures and hand movements to be more intuitive, Finger tracking systems have been created. These systems track in real time the position in 3D and 2D of the orientation of the fingers of each marker and use the intuitive hand movements and gestures to interact.

Types of tracking

There are many options for the implementation of finger tracking, principally those used with or without an interface.

Tracking with interface

This system mostly uses inertial and optical motion capture systems.

Inertial motion capture gloves

Inertial motion capture systems are able to capture finger motion by reading the rotation of each finger segment in 3D space. Applying these rotations to kinematic chain, the whole human hand can be tracked in real time, without occlusion and wireless.

Hand inertial motion capture systems, like for example Synertial mocap gloves, use tiny IMU based sensors, located on each finger segment. Precise capture requires at least 16 sensors to be used. There are also mocap glove models with less sensors (13 / 7 sensors) for which the rest of the finger segments is interpolated (proximal segments) or extrapolated (distal segments). The sensors are typically inserted into textile glove which makes the use of the sensors more comfortable.

Inertial sensors can capture movement in all 3 directions, which means finger and thumb flexion, extension and abduction can be detected.

Hand skeleton

Since inertial sensors track only rotations, the rotations have to be applied to some hand skeleton in order to get proper output. To get precise output (for example to be able to touch the fingertips), the hand skeleton has to be properly scaled to match the real hand. For this purpose manual measurement of the hand or automatic measurement extraction can be used.

Hand position tracking

On the top of finger tracking, many users require positional tracking for the whole hand in space. Multiple methods can be used for this purpose:

  • Capturing the whole body using an inertial mocap system (hand skeleton is attached at the end of the body skeleton kinematic chain). The position of the palm is determined from the body.
  • Capturing position of the palm (forearm) using optical mocap system.
  • Capturing position of the palm (forearm) using other position tracking method, widely used in VR headsets (for example HTC Vive Lighthouse).
Disadvantages of inertial motion capture systems

Inertial sensors have two main disadvantages connected with finger tracking:

  • Problems capturing the absolute position of the hand in space.
  • Magnetic interference
  • The metal materials use to interfere with sensors. This problem may be noticeable mainly because hands are often in contact with different things, often made of metal. Current generations of motion capture gloves are able to withstand magnetic interference. The degree to which they are immune to magnetic interference depends on manufacturer, price range and number of sensors used in the mocap glove. Notably, stretch sensors are silicone-based capacitors that are completely unaffected by magnetic interference.

Optical motion capture systems

a tracking of the location of the markers and patterns in 3D is performed, the system identifies them and labels each marker according to the position of the user’s fingers. The coordinates in 3D of the labels of these markers are produced in real time with other applications.

Markers

Some of the optical systems, like Vicon or ART, are able to capture hand motion through markers. In each hand we have a marker per each “operative” finger. Three high-resolution cameras are responsible for capturing each marker and measure its positions. This will be only produced when the camera is able to see them. The visual markers, usually known as rings or bracelets, are used to recognize user gesture in 3D. In addition, as the classification indicates, these rings act as an interface in 2D.

Occlusion as an interaction method

The visual occlusion is a very intuitive method to provide a more realistic viewpoint of the virtual information in three dimensions. The interfaces provide more natural 3D interaction techniques over base 6.

Marker functionality

Markers operate through interaction points, which are usually already set and we have the knowledge about the regions. Because of that, it is not necessary to follow each marker all the time; the multipointers can be treated in the same way when there is only one operating pointer. To detect such pointers through an interaction, we enable ultrasound infrared sensors. The fact that many pointers can be handled as one, problems would be solved. In the case when we are exposed to operate under difficult conditions like bad illumination, motion blurs, malformation of the marker or occlusion. The system allows following the object, even though if some markers are not visible. Because of the spatial relationships of all the markers are known, the positions of the markers that are not visible can be computed by using the markers that are known. There are several methods for marker detection like border marker and estimated marker methods.

  • The Homer technique includes ray selection with direct handling: An object is selected and then its position and orientation are handled like if it was connected directly to the hand.
  • The Conner technique presents a set of 3D widgets that permit an indirect interaction with the virtual objects through a virtual widget that acts as an intermediary.
Fusing data with optical motion capture systems

Because of marker occlusion during capture, tracking fingers is the most challenging part for optical motion capture systems (like Vicon, Optitrack, ART, ..). Users of optical mocap systems claims that the most post-process work is usually due to finger capture. As the inertial mocap systems (if properly calibrated) are mostly without the need for post-process, the typical use for high end mocap users is to fuse data from inertial mocap systems (fingers) with optical mocap systems (body + position in space).
The process of fusing mocap data is based on matching time codes of each frame for inertial and optical mocap system data source. This way any 3rd party software (for example MotionBuilder, Blender) can apply motions from two sources, independently of the mocap method used.

Stretch sensor finger tracking

Stretch sensor enabled motion capture systems use flexible parallel plate capacitors to detect differences in capacitance when the sensors stretch, bend, shear or are subjected to pressure. Stretch sensors are commonly silicone-based, which means they are unaffected by magnetic interference, occlusion or positional drift (common in inertial systems). The robust and flexible qualities of these sensors leads to high fidelity finger tracking and feature in mocap gloves produced by StretchSense. [2]

Articulated hand tracking

Articulated hand tracking is simpler and less expensive than many methods because it only needs one camera. This simplicity results in less precision. It provides a new base for new interactions in the modeling, the control of the animation and the added realism. It uses a glove composed of a set of colors which are assigned according to the position of the fingers. This color test is limited to the vision system of the computers and based on the capture function and the position of the color, the position of the hand is known.

Tracking without interface

In terms of visual perception, the legs and hands can be modeled as articulated mechanisms, system of rigid bodies that are connected between them to articulations with one or more degrees of freedom. This model can be applied to a more reduced scale to describe hand motion and based on a wide scale to describe a complete body motion. A certain finger motion, for example, can be recognized from its usual angles and it does not depend on the position of the hand in relation to the camera.

Many tracking systems are based on a model focused on a problem of sequence estimation, where a sequence of images is given and a model of changing, we estimate the 3D configuration for each photo. All the possible hand configurations are represented by vectors on a state space, which codes the position of the hand and the angles of the finger’s joint. Each hand configuration generates a set of images through the detection of the borders of the occlusion of the finger’s joint. The estimation of each image is calculated by finding the state vector that better fits to the measured characteristics. The finger joints have the added 21 states more than the rigid body movement of the palms; this means that the cost computational of the estimation is increased. The technique consists of label each finger joint links is modeled as a cylinder. We do the axes at each joint and bisector of this axis is the projection of the joint. Hence we use 3 DOF, because there are only 3 degrees of movement.

In this case, it is the same as in the previous typology as there is a wide variety of deployment thesis on this subject. Therefore, the steps and treatment technique are different depending on the purpose and needs of the person who will use this technique. Anyway, we can say that a very general way and in most systems, you should carry out the following steps:

Other tracking techniques

It is also possible to perform active tracking of fingers. The Smart Laser Scanner is a marker-less finger tracking system using a modified laser scanner/projector developed at the University of Tokyo in 2003-2004. It is capable of acquiring three-dimensional coordinates in real time without the need of any image processing at all (essentially, it is a rangefinder scanner that instead of continuously scanning over the full field of view, restricts its scanning area to a very narrow window precisely the size of the target). Gesture recognition has been demonstrated with this system. The sampling rate can be very high (500 Hz), enabling smooth trajectories to be acquired without the need of filtering (such as Kalman).

Application

Definitely, the finger tracking systems are used to represent a virtual reality. However its application has gone to professional level 3D modeling, companies and projects directly in this case overturned. Thus such systems rarely have been used in consumer applications due to its high price and complexity. In any case, the main objective is to facilitate the task of executing commands to the computer via natural language or interacting gesture.

The objective is centered on the following idea computers should be easier in terms of usage if there is a possibility to operate through natural language or gesture interaction. The main application of this technique is to highlight the 3D design and animation, where software like Maya and 3D StudioMax employ these kinds of tools. The reason is to allow a more accurate and simple control of the instructions that we want to execute. This technology offers many possibilities, where the sculpture, building and modeling in 3D in real time through the use of a computer is the most important.

See also

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

<span class="mw-page-title-main">Computer mouse</span> Pointing device used to control a computer

A computer mouse is a hand-held pointing device that detects two-dimensional motion relative to a surface. This motion is typically translated into the motion of the pointer on a display, which allows a smooth control of the graphical user interface of a computer.

<span class="mw-page-title-main">Pointing device</span> Human interface device for computers

A pointing device is a human interface device that allows a user to input spatial data to a computer. CAD systems and graphical user interfaces (GUI) allow the user to control and provide data to the computer using physical gestures by moving a hand-held mouse or similar device across the surface of the physical desktop and activating switches on the mouse. Movements of the pointing device are echoed on the screen by movements of the pointer and other visual changes. Common gestures are point and click and drag and drop.

<span class="mw-page-title-main">Motion capture</span> Process of recording the movement of objects or people

Motion capture is the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision and robots. In filmmaking and video game development, it refers to recording actions of human actors and using that information to animate digital character models in 2D or 3D computer animation. When it includes face and fingers or captures subtle expressions, it is often referred to as performance capture. In many fields, motion capture is sometimes called motion tracking, but in filmmaking and games, motion tracking usually refers more to match moving.

<span class="mw-page-title-main">Touchscreen</span> Input and output device

A touchscreen or touch screen is the assembly of both an input and output (display) device. The touch panel is normally layered on the top of an electronic visual display of an electronic device.

<span class="mw-page-title-main">Gesture recognition</span> Topic in computer science and language technology

Gesture recognition is an area of research and development in computer science and language technology concerned with the recognition and interpretation of human gestures. A subdiscipline of computer vision, it employs mathematical algorithms to interpret gestures. Gestures can originate from any bodily motion or state, but commonly originate from the face or hand. One area of the field is emotion recognition derived from facial expressions and hand gestures. Users can make simple gestures to control or interact with devices without physically touching them. Many approaches have been made using cameras and computer vision algorithms to interpret sign language, however, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques. Gesture recognition is a path for computers to begin to better understand and interpret human body language, previously not possible through text or unenhanced graphical (GUI) user interfaces.

In visual effects, match moving is a technique that allows the insertion of 2D elements, other live action elements or CG computer graphics into live-action footage with correct position, scale, orientation, and motion relative to the photographed objects in the shot. It also allows for the removal of live action elements from the live action shot. The term is used loosely to describe several different methods of extracting camera motion information from a motion picture. Sometimes referred to as motion tracking or camera solving, match moving is related to rotoscoping and photogrammetry. Match moving is sometimes confused with motion capture, which records the motion of objects, often human actors, rather than the camera. Typically, motion capture requires special cameras and sensors and a controlled environment. Match moving is also distinct from motion control photography, which uses mechanical hardware to execute multiple identical camera moves. Match moving, by contrast, is typically a software-based technology, applied after the fact to normal footage recorded in uncontrolled environments with an ordinary camera.

Video tracking is the process of locating a moving object over time using a camera. It has a variety of uses, some of which are: human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, medical imaging and video editing. Video tracking can be a time-consuming process due to the amount of data that is contained in video. Adding further to the complexity is the possible need to use object recognition techniques for tracking, a challenging problem in its own right.

<span class="mw-page-title-main">Wired glove</span> Input device for human–computer interaction

A wired glove is an input device for human–computer interaction worn like a glove.

Facial motion capture is the process of electronically converting the movements of a person's face into a digital database using cameras or laser scanners. This database may then be used to produce computer graphics (CG), computer animation for movies, games, or real-time avatars. Because the motion of CG characters is derived from the movements of real people, it results in a more realistic and nuanced computer character animation than if the animation were created manually.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

Visual servoing, also known as vision-based robot control and abbreviated VS, is a technique which uses feedback information extracted from a vision sensor to control the motion of a robot. One of the earliest papers that talks about visual servoing was from the SRI International Labs in 1979.

A Hand-Over is a term used in the animation industry to refer to the process of adding finger and hand motion capture data to the pre-existing full-body motion capture data, using a hand motion capture device.

In computing, 3D interaction is a form of human-machine interaction where users are able to move and perform interaction in 3D space. Both human and machine process information where the physical position of elements in the 3D space is relevant.

iClone is a real-time 3D animation and rendering software program. Real-time playback is enabled by using a 3D videogame engine for instant on-screen rendering.

<span class="mw-page-title-main">Inertial measurement unit</span> Accelerometer-based navigational device

An inertial measurement unit (IMU) is an electronic device that measures and reports a body's specific force, angular rate, and sometimes the orientation of the body, using a combination of accelerometers, gyroscopes, and sometimes magnetometers. When the magnetometer is included, IMUs are referred to as IMMUs.

Project Digits is a Microsoft Research Project under Microsoft's computer science laboratory at the University of Cambridge; researchers from Newcastle University and University of Crete are also involved in this project. Project is led by David Kim a Microsoft Research PhD and also a PhD Student in computer science at Newcastle University. Digits is an input device which can be mounted on the wrist of human hand and it captures and displays a complete 3D graphical representation of the user's hand on screen without using any external sensing device or hand covering material like data gloves. This project aims to make gesture controlled interfaces completely hands free with greater mobility and accuracy. It allows user to interact with whatever hardware while moving from room to room or walking down the street without any line of sight connection with the hardware.

<span class="mw-page-title-main">Motion capture suit</span> Garment that records the body movements of the wearer

A motion capture suit is a wearable device that records the body movements of the wearer. Some of these suits also function as haptic suits.

<span class="mw-page-title-main">Pose tracking</span>

In virtual reality (VR) and augmented reality (AR), a pose tracking system detects the precise pose of head-mounted displays, controllers, other objects or body parts within Euclidean space. Pose tracking is often referred to as 6DOF tracking, for the six degrees of freedom in which the pose is often tracked.

References

  1. Goebl, W.; Palmer, C. (2013). Balasubramaniam, Ramesh (ed.). "Temporal Control and Hand Movement Efficiency in Skilled Music Performance". PLOS ONE. 8 (1): e50901. Bibcode:2013PLoSO...850901G. doi: 10.1371/journal.pone.0050901 . PMC   3536780 . PMID   23300946.
  2. "The world's leading motion capture glove". StretchSense. Retrieved 2020-11-24.