Facial motion capture

Last updated December 15, 2024

Facial motion capture is the process of electronically converting the movements of a person's face into a digital database using cameras or laser scanners. This database may then be used to produce computer graphics (CG), computer animation for movies, games, or real-time avatars. Because the motion of CG characters is derived from the movements of real people, it results in a more realistic and nuanced computer character animation than if the animation were created manually.

A facial motion capture database describes the coordinates or relative positions of reference points on the actor's face. The capture may be in two dimensions, in which case the capture process is sometimes called "expression tracking", or in three dimensions. Two-dimensional capture can be achieved using a single camera and capture software. This produces less sophisticated tracking, and is unable to fully capture three-dimensional motions such as head rotation. Three-dimensional capture is accomplished using multi-camera rigs or laser marker system. Such systems are typically far more expensive, complicated, and time-consuming to use. Two predominate technologies exist: marker and marker-less tracking systems.

Facial motion capture is related to body motion capture, but is more challenging due to the higher resolution requirements to detect and track subtle expressions possible from small movements of the eyes and lips. These movements are often less than a few millimeters, requiring even greater resolution and fidelity and different filtering techniques than usually used in full body capture. The additional constraints of the face also allow more opportunities for using models and rules.

Facial expression capture is similar to facial motion capture. It is a process of using visual or mechanical means to manipulate computer generated characters with input from human faces, or to recognize emotions from a user.

History

One of the first papers discussing performance-driven animation was published by Lance Williams in 1990. There, he describes 'a means of acquiring the expressions of realfaces, and applying them to computer-generated faces'.^[1]

Technologies

Marker-based

Traditional marker based systems apply up to 350 markers to the actors face and track the marker movement with high resolution cameras. This has been used on movies such as The Polar Express and Beowulf to allow an actor such as Tom Hanks to drive the facial expressions of several different characters. Unfortunately this is relatively cumbersome and makes the actors expressions overly driven once the smoothing and filtering have taken place. Next generation systems such as CaptiveMotion utilize offshoots of the traditional marker based system with higher levels of details.

Active LED Marker technology is currently being used to drive facial animation in real-time to provide user feedback.

Markerless

Markerless technologies use the features of the face such as nostrils, the corners of the lips and eyes, and wrinkles and then track them. This technology is discussed and demonstrated at CMU,^[2] IBM,^[3] University of Manchester (where much of this started with Tim Cootes,^[4] Gareth Edwards and Chris Taylor) and other locations, using active appearance models, principal component analysis, eigen tracking, deformable surface models and other techniques to track the desired facial features from frame to frame. This technology is much less cumbersome, and allows greater expression for the actor.

These vision based approaches also have the ability to track pupil movement, eyelids, teeth occlusion by the lips and tongue, which are obvious problems in most computer-animated features. Typical limitations of vision based approaches are resolution and frame rate, both of which are decreasing as issues as high speed, high resolution CMOS cameras become available from multiple sources.

The technology for markerless face tracking is related to that in a Facial recognition system, since a facial recognition system can potentially be applied sequentially to each frame of video, resulting in face tracking. For example, the Neven Vision system^[5] (formerly Eyematics, now acquired by Google) allowed real-time 2D face tracking with no person-specific training; their system was also amongst the best-performing facial recognition systems in the U.S. Government's 2002 Facial Recognition Vendor Test (FRVT). On the other hand, some recognition systems do not explicitly track expressions or even fail on non-neutral expressions, and so are not suitable for tracking. Conversely, systems such as deformable surface models pool temporal information to disambiguate and obtain more robust results, and thus could not be applied from a single photograph.

Markerless face tracking has progressed to commercial systems such as Image Metrics, which has been applied in movies such as The Matrix sequels^[6] and The Curious Case of Benjamin Button . The latter used the Mova system to capture a deformable facial model, which was then animated with a combination of manual and vision tracking.^[7] Avatar was another prominent performance capture movie however it used painted markers rather than being markerless. Dynamixyz ^{[ permanent dead link ‍]} is another commercial system currently in use.

Markerless systems can be classified according to several distinguishing criteria:

2D versus 3D tracking
whether person-specific training or other human assistance is required
real-time performance (which is only possible if no training or supervision is required)
whether they need an additional source of information such as projected patterns or invisible paint such as used in the Mova system.

To date, no system is ideal with respect to all these criteria. For example, the Neven Vision system was fully automatic and required no hidden patterns or per-person training, but was 2D. The Face/Off system^[8] is 3D, automatic, and real-time but requires projected patterns.

Facial expression capture

Technology

Digital video-based methods are becoming increasingly preferred, as mechanical systems tend to be cumbersome and difficult to use.

Using digital cameras, the input user's expressions are processed to provide the head pose, which allows the software to then find the eyes, nose and mouth. The face is initially calibrated using a neutral expression. Then depending on the architecture, the eyebrows, eyelids, cheeks, and mouth can be processed as differences from the neutral expression. This is done by looking for the edges of the lips for instance and recognizing it as a unique object. Often contrast enhancing makeup or markers are worn, or some other method to make the processing faster. Like voice recognition, the best techniques are only good 90 percent of the time, requiring a great deal of tweaking by hand, or tolerance for errors.

Since computer generated characters don't actually have muscles, different techniques are used to achieve the same results. Some animators create bones or objects that are controlled by the capture software, and move them accordingly, which when the character is rigged correctly gives a good approximation. Since faces are very elastic this technique is often mixed with others, adjusting the weights differently for the skin elasticity and other factors depending on the desired expressions.

Usage

Several commercial companies are developing products that have been used, but are rather expensive.^{[ citation needed ]}

It is expected that this will become a major input device for computer games once the software is available in an affordable format, but the hardware and software do not yet exist, despite the research for the last 15 years producing results that are almost usable.^{[ citation needed ]}

Communication with real-time avatars

The first application that got wide adoption is communication. Initially video telephony and multimedia messaging and later in 3D with mixed reality headsets.

With the advance of machine learning, compute power and advanced sensors, especially on mobile phones, facial motion capture technology became widely available. Two notable examples are Snapchat's lens feature and Apple's Memoji^[9] that can be used to record messages with avatars or live via the FaceTime app. With these applications (and many other) most modern mobile phones today are capable of performing real time facial motion capture! More recently, real time facial motion capture, combined with realistic 3D avatars were introduced to enable immersive communication in mixed reality (MR) and virtual reality (VR). Meta demonstrated their Codec Avatars to communicate via their MR headset Meta Quest Pro to record a podcast with two remote participants. ^[10] Apple's MR headset Apple Vision Pro also supports real-time facial motion capture that can be used with applications such as FaceTime. Real-time communication applications prioritize low latency to facilitate natural conversation and ease of use, aiming to make the technology accessible to a broad audience. These considerations may limit on the possible accuracy of the motion capture.

Related Research Articles

<span class="mw-page-title-main">Computer animation</span> Art of creating moving images using computers

Computer animation is the process used for digitally generating moving images. The more general term computer-generated imagery (CGI) encompasses both still images and moving images, while computer animation only refers to moving images. Modern computer animation usually uses 3D computer graphics.

<span class="mw-page-title-main">Motion capture</span> Process of recording the movement of objects or people

Motion capture is the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision and robots. In films, television shows and video games, motion capture refers to recording actions of human actors and using that information to animate digital character models in 2D or 3D computer animation. When it includes face and fingers or captures subtle expressions, it is often referred to as performance capture. In many fields, motion capture is sometimes called motion tracking, but in filmmaking and games, motion tracking usually refers more to match moving.

A facial recognition system is a technology potentially capable of matching a human face from a digital image or a video frame against a database of faces. Such a system is typically employed to authenticate users through ID verification services, and works by pinpointing and measuring facial features from a given image.

Gesture recognition is an area of research and development in computer science and language technology concerned with the recognition and interpretation of human gestures. A subdiscipline of computer vision, it employs mathematical algorithms to interpret gestures.

Computer facial animation is primarily an area of computer graphics that encapsulates methods and techniques for generating and animating images or models of a character face. The character can be a human, a humanoid, an animal, a legendary creature or character, etc. Due to its subject and output type, it is also related to many other scientific and artistic fields from psychology to traditional animation. The importance of human faces in verbal and non-verbal communication and advances in computer graphics hardware and software have caused considerable scientific, technological, and artistic interests in computer facial animation.

<span class="mw-page-title-main">Virtual cinematography</span> Also referred to as CGI

Virtual cinematography is the set of cinematographic techniques performed in a computer graphics environment. It includes a wide variety of subjects like photographing real objects, often with stereo or multi-camera setup, for the purpose of recreating them as three-dimensional objects and algorithms for the automated creation of real and simulated camera angles. Virtual cinematography can be used to shoot scenes from otherwise impossible camera angles, create the photography of animated films, and manipulate the appearance of computer-generated effects.

Digital puppetry is the manipulation and performance of digitally animated 2D or 3D figures and objects in a virtual environment that are rendered in real-time by computers. It is most commonly used in filmmaking and television production but has also been used in interactive theme park attractions and live theatre.

<span class="mw-page-title-main">Human image synthesis</span> Computer generation of human images

Human image synthesis is technology that can be applied to make believable and even photorealistic renditions of human-likenesses, moving or still. It has effectively existed since the early 2000s. Many films using computer generated imagery have featured synthetic images of human-like characters digitally composited onto the real or other simulated film material. Towards the end of the 2010s deep learning artificial intelligence has been applied to synthesize images and video that look like humans, without need for human assistance, once the training phase has been completed, whereas the old school 7D-route required massive amounts of human work .

<span class="mw-page-title-main">PlayStation Eye</span> Digital camera device for the PlayStation 3

The PlayStation Eye is a digital camera device, similar to a webcam, for the PlayStation 3. The technology uses computer vision and gesture recognition to process images taken by the camera. This allows players to interact with games using motion and color detection as well as sound through its built-in microphone array. It is the successor to the EyeToy for the PlayStation 2, which was released in 2003.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

<span class="mw-page-title-main">Image Metrics</span> American company providing facial animation software for the visual effects industries

Image Metrics is a 3D facial animation and Virtual Try-on company headquartered in El Segundo, with offices in Las Vegas, and research facilities in Manchester. Image Metrics are the makers of the Live Driver and Portable You SDKs for software developers and are providers of facial animation software and services to the visual effects industries.

<span class="mw-page-title-main">Finger tracking</span> High-resolution technique in gesture recognition and image processing

In the field of gesture recognition and image processing, finger tracking is a high-resolution technique developed in 1969 that is employed to know the consecutive position of the fingers of the user and hence represent objects in 3D. In addition to that, the finger tracking technique is used as a tool of the computer, acting as an external device in our computer, similar to a keyboard and a mouse.

iClone is a real-time 3D animation and rendering software program. Real-time playback is enabled by using a 3D videogame engine for instant on-screen rendering.

The history of computer animation began as early as the 1940s and 1950s, when people began to experiment with computer graphics – most notably by John Whitney. It was only by the early 1960s when digital computers had become widely established, that new avenues for innovative computer graphics blossomed. Initially, uses were mainly for scientific, engineering and other research purposes, but artistic experimentation began to make its appearance by the mid-1960s – most notably by Dr. Thomas Calvert. By the mid-1970s, many such efforts were beginning to enter into public media. Much computer graphics at this time involved 2-D imagery, though increasingly as computer power improved, efforts to achieve 3-D realism became the emphasis. By the late 1980s, photo-realistic 3-D was beginning to appear in film movies, and by mid-1990s had developed to the point where 3-D animation could be used for entire feature film production.

Computer-generated imagery (CGI) is a specific-technology or application of computer graphics for creating or improving images in art, printed media, simulators, videos and video games. These images are either static or dynamic. CGI both refers to 2D computer graphics and 3D computer graphics with the purpose of designing characters, virtual worlds, or scenes and special effects. The application of CGI for creating/improving animations is called computer animation, or CGI animation.

CrazyTalk is Reallusion's brand name for its 2D animation software. The product series includes CrazyTalk, a 2D facial animation software tool, and CrazyTalk Animator, a face and body 2D animation suite.

Hao Li is a computer scientist, innovator, and entrepreneur from Germany, working in the fields of computer graphics and computer vision. He is co-founder and CEO of Pinscreen, Inc, as well as associate professor of computer vision at the Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI). He was previously a Distinguished Fellow at the University of California, Berkeley, an associate professor of computer science at the University of Southern California, and former director of the Vision and Graphics Lab at the USC Institute for Creative Technologies. He was also a visiting professor at Weta Digital and a research lead at Industrial Light & Magic / Lucasfilm.

Faceware Technologies is a facial animation and motion capture development company in America. The company was established under Image Metrics and became its own company at the beginning of 2012.

Visage SDK is a multi-platform software development kit (SDK) created by Visage Technologies AB. Visage SDK allows software programmers to build facial motion capture and eye tracking applications.

Live2D is an animation technique used to animate static images—usually anime-style characters—that involves separating an image into parts and animating each part accordingly, without the need of frame-by-frame animation or a 3D model. This enables characters to move using 2.5D movement while maintaining the original illustration.

References

↑ Performance-Driven Facial Animation, Lance Williams, Computer Graphics, Volume 24, Number 4, August 1990
↑ AAM Fitting Algorithms Archived 2017-02-22 at the Wayback Machine from the Carnegie Mellon Robotics Institute
↑ "Real World Real-time Automatic Recognition of Facial Expressions" (PDF). Archived from the original (PDF) on 2015-11-19. Retrieved 2015-11-17.
↑ Modelling and Search Software Archived 2009-02-23 at the Wayback Machine ("This document describes how to build, display and use statistical appearance models.")
↑ Wiskott, Laurenz; J.-M. Fellous; N. Kruger; C. von der Malsurg (1997), "Face recognition by elastic bunch graph matching", Computer Analysis of Images and Patterns, Lecture Notes in Computer Science, vol. 1296, Springer, pp. 456–463, CiteSeerX 10.1.1.18.1256 , doi:10.1007/3-540-63460-6_150, ISBN 978-3-540-63460-7
↑ Borshukov, George; D. Piponi; O. Larsen; J. Lewis; C. Templelaar-Lietz (2003), "Universal Capture - Image-based Facial Animation for "The Matrix Reloaded"", ACM SIGGRAPH
↑ Barba, Eric; Steve Preeg (18 March 2009), "The Curious Face of Benjamin Button", Presentation at Vancouver ACM SIGGRAPH Chapter, 18 March 2009.
↑ Weise, Thibaut; H. Li; L. Van Gool; M. Pauly (2009), "Face/off: Live Facial Puppetry", ACM Symposium on Computer Animation
↑ "Use Memoji on your iPhone or iPad Pro". support.apple.com. Retrieved October 16, 2024..
↑ "#398 – Mark Zuckerberg: First Interview in the Metaverse". lexfriedman.com. Retrieved October 16, 2024.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[LW1990-1] Performance-Driven Facial Animation, Lance Williams, Computer Graphics, Volume 24, Number 4, August 1990

[2] AAM Fitting Algorithms Archived 2017-02-22 at the Wayback Machine from the Carnegie Mellon Robotics Institute

[3] "Real World Real-time Automatic Recognition of Facial Expressions" (PDF). Archived from the original (PDF) on 2015-11-19. Retrieved 2015-11-17.

[4] Modelling and Search Software Archived 2009-02-23 at the Wayback Machine ("This document describes how to build, display and use statistical appearance models.")

[5] Wiskott, Laurenz; J.-M. Fellous; N. Kruger; C. von der Malsurg (1997), "Face recognition by elastic bunch graph matching", Computer Analysis of Images and Patterns, Lecture Notes in Computer Science, vol. 1296, Springer, pp. 456–463, CiteSeerX 10.1.1.18.1256 , doi:10.1007/3-540-63460-6_150, ISBN 978-3-540-63460-7

[6] Borshukov, George; D. Piponi; O. Larsen; J. Lewis; C. Templelaar-Lietz (2003), "Universal Capture - Image-based Facial Animation for "The Matrix Reloaded"", ACM SIGGRAPH

[7] Barba, Eric; Steve Preeg (18 March 2009), "The Curious Face of Benjamin Button", Presentation at Vancouver ACM SIGGRAPH Chapter, 18 March 2009.

[8] Weise, Thibaut; H. Li; L. Van Gool; M. Pauly (2009), "Face/off: Live Facial Puppetry", ACM Symposium on Computer Animation

[9] "Use Memoji on your iPhone or iPad Pro". support.apple.com. Retrieved October 16, 2024..

[10] "#398 – Mark Zuckerberg: First Interview in the Metaverse". lexfriedman.com. Retrieved October 16, 2024.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]