Human image synthesis

Last updated

In this morph target animation system four "expressions" have been defined as deformations of the geometry of the model. Any combination of these four expressions can be used to animate the mouth shape. Similar controls can be applied to animate an entire human-like model. Sintel-face-morph.png
In this morph target animation system four "expressions" have been defined as deformations of the geometry of the model. Any combination of these four expressions can be used to animate the mouth shape. Similar controls can be applied to animate an entire human-like model.

Human image synthesis is technology that can be applied to make believable and even photorealistic renditions [1] [2] of human-likenesses, moving or still. It has effectively existed since the early 2000s. Many films using computer generated imagery have featured synthetic images of human-like characters digitally composited onto the real or other simulated film material. Towards the end of the 2010s deep learning artificial intelligence has been applied to synthesize images and video that look like humans, without need for human assistance, once the training phase has been completed, whereas the old school 7D-route required massive amounts of human work .

Contents

Timeline of human image synthesis

Key breakthrough to photorealism: reflectance capture

ESPER LightCage is an example of a spherical light stage with multi-camera setup around the sphere suitable for capturing into a 7D reflectance model. ESPER LightCage.jpg
ESPER LightCage is an example of a spherical light stage with multi-camera setup around the sphere suitable for capturing into a 7D reflectance model.

In 1999 Paul Debevec et al. of USC did the first known reflectance capture over the human face with their extremely simple light stage. They presented their method and results in SIGGRAPH 2000. [4]

Bidirectional scattering distribution function (BSDF) for human skin likeness requires both BRDF and special case of BTDF where light enters the skin, is transmitted and exits the skin. BSDF05 800.png
Bidirectional scattering distribution function (BSDF) for human skin likeness requires both BRDF and special case of BTDF where light enters the skin, is transmitted and exits the skin.

The scientific breakthrough required finding the subsurface light component (the simulation models are glowing from within slightly) which can be found using knowledge that light that is reflected from the oil-to-air layer retains its polarization and the subsurface light loses its polarization. So equipped only with a movable light source, movable video camera, 2 polarizers and a computer program doing extremely simple math and the last piece required to reach photorealism was acquired. [4]

For a believable result both light reflected from skin (BRDF) and within the skin (a special case of BTDF) which together make up the BSDF must be captured and simulated.

Capturing

Synthesis

The whole process of making digital look-alikes i.e. characters so lifelike and realistic that they can be passed off as pictures of humans is a very complex task as it requires photorealistically modeling, animating, cross-mapping, and rendering the soft body dynamics of the human appearance.

Synthesis with an actor and suitable algorithms is applied using powerful computers. The actor's part in the synthesis is to take care of mimicking human expressions in still picture synthesizing and also human movement in motion picture synthesizing. Algorithms are needed to simulate laws of physics and physiology and to map the models and their appearance, movements and interaction accordingly.

Often both physics/physiology based (i.e. skeletal animation) and image-based modeling and rendering are employed in the synthesis part. Hybrid models employing both approaches have shown best results in realism and ease-of-use. Morph target animation reduces the workload by giving higher level control, where different facial expressions are defined as deformations of the model, which facial allows expressions to be tuned intuitively. Morph target animation can then morph the model between different defined facial expressions or body poses without much need for human intervention.

Using displacement mapping plays an important part in getting a realistic result with fine detail of skin such as pores and wrinkles as small as 100 µm.

Machine learning approach

In the late 2010s, machine learning, and more precisely generative adversarial networks (GAN), were used by NVIDIA to produce random yet photorealistic human-like portraits. The system, named StyleGAN, was trained on a database of 70,000 images from the images depository website Flickr. The source code was made public on GitHub in 2019. [30] Outputs of the generator network from random input were made publicly available on a number of websites. [31] [32]

Similarly, since 2018, deepfake technology has allowed GANs to swap faces between actors; combined with the ability to fake voices, GANs can thus generate fake videos that seem convincing. [33]

Applications

Main applications fall within the domains of stock photography, synthetic datasets, virtual cinematography, computer and video games and covert disinformation attacks. [34] [32] Some facial-recognition AI use images generated by other AI as synthetic data for training. [35]

Furthermore, some research suggests that it can have therapeutic effects as "psychologists and counselors have also begun using avatars to deliver therapy to clients who have phobias, a history of trauma, addictions, Asperger’s syndrome or social anxiety." [36] The strong memory imprint and brain activation effects caused by watching a digital look-alike avatar of yourself is dubbed the Doppelgänger effect. [36] The doppelgänger effect can heal when covert disinformation attack is exposed as such to the targets of the attack.

The speech synthesis has been verging on being completely indistinguishable from a recording of a real human's voice since the 2016 introduction of the voice editing and generation software Adobe Voco, a prototype slated to be a part of the Adobe Creative Suite and DeepMind WaveNet, a prototype from Google. [37] Ability to steal and manipulate other peoples voices raises obvious ethical concerns. [38]

At the 2018 Conference on Neural Information Processing Systems (NeurIPS) researchers from Google presented the work 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis', which transfers learning from speaker verification to achieve text-to-speech synthesis, that can be made to sound almost like anybody from a speech sample of only 5 seconds (listen). [39]

Sourcing images for AI training raises a question of privacy as people who are used for training didn't consent. [40]

Digital sound-alikes technology found its way to the hands of criminals as in 2019 Symantec researchers knew of 3 cases where technology has been used for crime. [41] [42]

This coupled with the fact that (as of 2016) techniques which allow near real-time counterfeiting of facial expressions in existing 2D video have been believably demonstrated increases the stress on the disinformation situation. [14]

See also

Related Research Articles

<span class="mw-page-title-main">Rendering (computer graphics)</span> Process of generating an image from a model

Rendering or image synthesis is the process of generating a photorealistic or non-photorealistic image from a 2D or 3D model by means of a computer program. The resulting image is referred to as the render. Multiple models can be defined in a scene file containing objects in a strictly defined language or data structure. The scene file contains geometry, viewpoint, textures, lighting, and shading information describing the virtual scene. The data contained in the scene file is then passed to a rendering program to be processed and output to a digital image or raster graphics image file. The term "rendering" is analogous to the concept of an artist's impression of a scene. The term "rendering" is also used to describe the process of calculating effects in a video editing program to produce the final video output.

<span class="mw-page-title-main">Computer animation</span> Art of creating moving images using computers

Computer animation is the process used for digitally generating moving images. The more general term computer-generated imagery (CGI) encompasses both still images and moving images, while computer animation only refers to moving images. Modern computer animation usually uses 3D computer graphics.

<span class="mw-page-title-main">Paul Debevec</span> American computer graphics professional

Paul Ernest Debevec is a researcher in computer graphics at the University of Southern California's Institute for Creative Technologies. He is best known for his work in finding, capturing and synthesizing the bidirectional scattering distribution function utilizing the light stages his research team constructed to find and capture the reflectance field over the human face, high-dynamic-range imaging and image-based modeling and rendering.

<span class="mw-page-title-main">High-dynamic-range rendering</span> Rendering of computer graphics scenes by using lighting calculations done in high-dynamic-range

High-dynamic-range rendering, also known as high-dynamic-range lighting, is the rendering of computer graphics scenes by using lighting calculations done in high dynamic range (HDR). This allows preservation of details that may be lost due to limiting contrast ratios. Video games and computer-generated movies and special effects benefit from this as it creates more realistic scenes than with more simplistic lighting models.

Texture synthesis is the process of algorithmically constructing a large digital image from a small digital sample image by taking advantage of its structural content. It is an object of research in computer graphics and is used in many fields, amongst others digital image editing, 3D computer graphics and post-production of films.

Computer facial animation is primarily an area of computer graphics that encapsulates methods and techniques for generating and animating images or models of a character face. The character can be a human, a humanoid, an animal, a legendary creature or character, etc. Due to its subject and output type, it is also related to many other scientific and artistic fields from psychology to traditional animation. The importance of human faces in verbal and non-verbal communication and advances in computer graphics hardware and software have caused considerable scientific, technological, and artistic interests in computer facial animation.

<span class="mw-page-title-main">Virtual cinematography</span> CGI essentially

Virtual cinematography is the set of cinematographic techniques performed in a computer graphics environment. It includes a wide variety of subjects like photographing real objects, often with stereo or multi-camera setup, for the purpose of recreating them as three-dimensional objects and algorithms for the automated creation of real and simulated camera angles. Virtual cinematography can be used to shoot scenes from otherwise impossible camera angles, create the photography of animated films, and manipulate the appearance of computer-generated effects.

Facial motion capture is the process of electronically converting the movements of a person's face into a digital database using cameras or laser scanners. This database may then be used to produce computer graphics (CG), computer animation for movies, games, or real-time avatars. Because the motion of CG characters is derived from the movements of real people, it results in a more realistic and nuanced computer character animation than if the animation were created manually.

Digital puppetry is the manipulation and performance of digitally animated 2D or 3D figures and objects in a virtual environment that are rendered in real-time by computers. It is most commonly used in filmmaking and television production but has also been used in interactive theme park attractions and live theatre.

<span class="mw-page-title-main">Bidirectional scattering distribution function</span> Mathematical function

The definition of the BSDF is not well standardized. The term was probably introduced in 1980 by Bartell, Dereniak, and Wolfe. Most often it is used to name the general mathematical function which describes the way in which the light is scattered by a surface. However, in practice, this phenomenon is usually split into the reflected and transmitted components, which are then treated separately as BRDF and BTDF.

A virtual human, virtual persona, or digital clone is the creation or re-creation of a human being in image and voice using computer-generated imagery and sound, that is often indistinguishable from the real actor.

The history of computer animation began as early as the 1940s and 1950s, when people began to experiment with computer graphics – most notably by John Whitney. It was only by the early 1960s when digital computers had become widely established, that new avenues for innovative computer graphics blossomed. Initially, uses were mainly for scientific, engineering and other research purposes, but artistic experimentation began to make its appearance by the mid-1960s – most notably by Dr. Thomas Calvert. By the mid-1970s, many such efforts were beginning to enter into public media. Much computer graphics at this time involved 2-D imagery, though increasingly as computer power improved, efforts to achieve 3-D realism became the emphasis. By the late 1980s, photo-realistic 3-D was beginning to appear in film movies, and by mid-1990s had developed to the point where 3-D animation could be used for entire feature film production.

<span class="mw-page-title-main">Light stage</span> Equipment used for shape, texture, reflectance and motion capture

A light stage is an active illumination system used for shape, texture, reflectance and motion capture often with structured light and a multi-camera setup.

<span class="mw-page-title-main">Hao Li</span> American computer scientist & university professor

Hao Li is a computer scientist, innovator, and entrepreneur from Germany, working in the fields of computer graphics and computer vision. He is co-founder and CEO of Pinscreen, Inc, as well as associate professor of computer vision at the Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI). He was previously a Distinguished Fellow at the University of California, Berkeley, an associate professor of computer science at the University of Southern California, and former director of the Vision and Graphics Lab at the USC Institute for Creative Technologies. He was also a visiting professor at Weta Digital and a research lead at Industrial Light & Magic / Lucasfilm.

<span class="mw-page-title-main">Michael F. Cohen</span> American computer scientist

Michael F. Cohen is an American computer scientist and researcher in computer graphics. He is currently a Senior Fellow at Meta in their Generative AI Group. He was a senior research scientist at Microsoft Research for 21 years until he joined Facebook in 2015. In 1998, he received the ACM SIGGRAPH CG Achievement Award for his work in developing radiosity methods for realistic image synthesis. He was elected a Fellow of the Association for Computing Machinery in 2007 for his "contributions to computer graphics and computer vision." In 2019, he received the ACM SIGGRAPH Steven A. Coons Award for Outstanding Creative Contributions to Computer Graphics for “his groundbreaking work in numerous areas of research—radiosity, motion simulation & editing, light field rendering, matting & compositing, and computational photography”.

Deepfakes are synthetic media that have been digitally manipulated to replace one person's likeness convincingly with that of another. It can also refer to computer-generated images of human subjects that do not exist in real life. While the act of creating fake content is not new, deepfakes leverage tools and techniques from machine learning and artificial intelligence, including facial recognition algorithms and artificial neural networks such as variational autoencoders (VAEs) and generative adversarial networks (GANs). In turn the field of image forensics develops techniques to detect manipulated images.

<span class="mw-page-title-main">Video manipulation</span> Editing of video content for malicious intent

Video manipulation is a type of media manipulation that targets digital video using video processing and video editing techniques. The applications of these methods range from educational videos to videos aimed at (mass) manipulation and propaganda, a straightforward extension of the long-standing possibilities of photo manipulation. This form of computer-generated misinformation has contributed to fake news, and there have been instances when this technology was used during political campaigns. Other uses are less sinister; entertainment purposes and harmless pranks provide users with movie-quality artistic possibilities.

Synthetic media is a catch-all term for the artificial production, manipulation, and modification of data and media by automated means, especially through the use of artificial intelligence algorithms, such as for the purpose of misleading people or changing an original meaning. Synthetic media as a field has grown rapidly since the creation of generative adversarial networks, primarily through the rise of deepfakes as well as music synthesis, text generation, human image synthesis, speech synthesis, and more. Though experts use the term "synthetic media," individual methods such as deepfakes and text synthesis are sometimes not referred to as such by the media but instead by their respective terminology Significant attention arose towards the field of synthetic media starting in 2017 when Motherboard reported on the emergence of AI altered pornographic videos to insert the faces of famous actresses. Potential hazards of synthetic media include the spread of misinformation, further loss of trust in institutions such as media and government, the mass automation of creative and journalistic jobs and a retreat into AI-generated fantasy worlds. Synthetic media is an applied form of artificial imagination.

Identity replacement technology is any technology that is used to cover up all or parts of a person's identity, either in real life or virtually. This can include face masks, face authentication technology, and deepfakes on the Internet that spread fake editing of videos and images. Face replacement and identity masking are used by either criminals or law-abiding citizens. Identity replacement tech, when operated on by criminals, leads to heists or robbery activities. Law-abiding citizens utilize identity replacement technology to prevent government or various entities from tracking private information such as locations, social connections, and daily behaviors.

<span class="mw-page-title-main">Virtual human</span> Computer simulation of a person

A virtual human is a software fictional character or human being. Virtual human have been created as tools and artificial companions in simulation, video games, film production, human factors and ergonomic and usability studies in various industries, clothing industry, telecommunications (avatars), medicine, etc. These applications require domain-dependent simulation fidelity. A medical application might require an exact simulation of specific internal organs; film industry requires highest aesthetic standards, natural movements, and facial expressions; ergonomic studies require faithful body proportions for a particular population segment and realistic locomotion with constraints, etc.

References

  1. Physics-based muscle model for mouth shape control on IEEE Explore (requires membership)
  2. Realistic 3D facial animation in virtual space teleconferencing on IEEE Explore (requires membership)
  3. "Images de synthèse : palme de la longévité pour l'ombrage de Gouraud". 14 September 2008.
  4. 1 2 3 Debevec, Paul (2000). "Acquiring the reflectance field of a human face". Proceedings of the 27th annual conference on Computer graphics and interactive techniques - SIGGRAPH '00. ACM. pp. 145–156. doi:10.1145/344779.344855. ISBN   978-1581132083. S2CID   2860203 . Retrieved 24 May 2017.
  5. Pighin, Frédéric. "Siggraph 2005 Digital Face Cloning Course Notes" (PDF). Retrieved 24 May 2017.
  6. "St. Andrews Face Transformer". Futility Closet . 30 January 2005. Retrieved 7 December 2020.
  7. 1 2 West, Marc (4 December 2007). "Changing the face of science". Plus Magazine . Retrieved 7 December 2020.
  8. Goddard, John (27 January 2010). "The many faces of race research". thestar.com. Retrieved 7 December 2020.
  9. In this TED talk video at 00:04:59 you can see two clips, one with the real Emily shot with a real camera and one with a digital look-alike of Emily, shot with a simulation of a camera – Which is which is difficult to tell. Bruce Lawmen was scanned using USC light stage 6 in still position and also recorded running there on a treadmill. Many, many digital look-alikes of Bruce are seen running fluently and natural looking at the ending sequence of the TED talk video.
  10. ReForm – Hollywood's Creating Digital Clones (youtube). The Creators Project. 24 May 2017.
  11. Debevec, Paul. "Digital Ira SIGGRAPH 2013 Real-Time Live". Archived from the original on 21 February 2015. Retrieved 24 May 2017.
  12. "Scanning and printing a 3D portrait of President Barack Obama". University of Southern California. 2013. Retrieved 24 May 2017.
  13. Giardina, Carolyn (25 March 2015). "'Furious 7' and How Peter Jackson's Weta Created Digital Paul Walker". The Hollywood Reporter . Retrieved 24 May 2017.
  14. 1 2 Thies, Justus (2016). "Face2Face: Real-time Face Capture and Reenactment of RGB Videos". Proc. Computer Vision and Pattern Recognition (CVPR), IEEE. Retrieved 24 May 2017.
  15. Suwajanakorn, Supasorn; Seitz, Steven; Kemelmacher-Shlizerman, Ira (2017), Synthesizing Obama: Learning Lip Sync from Audio, University of Washington , retrieved 2 March 2018
  16. Roettgers, Janko (21 February 2018). "Porn Producers Offer to Help Hollywood Take Down Deepfake Videos". Variety. Retrieved 28 February 2018.
  17. Takahashi, Dean (21 March 2018). "Epic Games shows off amazing real-time digital human with Siren demo". VentureBeat . Retrieved 10 September 2018.
  18. Kuo, Lily (9 November 2018). "World's first AI news anchor unveiled in China". TheGuardian.com . Retrieved 9 November 2018.
  19. Hamilton, Isobel Asher (9 November 2018). "China created what it claims is the first AI news anchor — watch it in action here". Business Insider . Retrieved 9 November 2018.
  20. Harwell, Drew (30 December 2018). "Fake-porn videos are being weaponized to harass and humiliate women: 'Everybody is a potential target'". The Washington Post . Retrieved 14 March 2019. In September [of 2018], Google added "involuntary synthetic pornographic imagery" to its ban list
  21. "NVIDIA Open-Sources Hyper-Realistic Face Generator StyleGAN". Medium.com . 9 February 2019. Retrieved 3 October 2019.
  22. 1 2 Paez, Danny (13 February 2019). "This Person Does Not Exist Is the Best One-Off Website of 2019". Inverse . Retrieved 5 March 2018.
  23. "New state laws go into effect July 1". 24 June 2019.
  24. 1 2 "§ 18.2–386.2. Unlawful dissemination or sale of images of another; penalty". Virginia . Retrieved 1 January 2020.
  25. "Relating to the creation of a criminal offense for fabricating a deceptive video with intent to influence the outcome of an election". Texas. 14 June 2019. Retrieved 2 January 2020. In this section, "deep fake video" means a video, created with the intent to deceive, that appears to depict a real person performing an action that did not occur in reality
  26. Johnson, R.J. (30 December 2019). "Here Are the New California Laws Going Into Effect in 2020". KFI . iHeartMedia . Retrieved 1 January 2020.
  27. Mihalcik, Carrie (4 October 2019). "California laws seek to crack down on deepfakes in politics and porn". cnet.com . CNET . Retrieved 14 October 2019.
  28. "China seeks to root out fake news and deepfakes with new online content rules". Reuters.com . Reuters. 29 November 2019. Retrieved 8 December 2019.
  29. Statt, Nick (29 November 2019). "China makes it a criminal offense to publish deepfakes or fake news without disclosure". The Verge . Retrieved 8 December 2019.
  30. Synced (9 February 2019). "NVIDIA Open-Sources Hyper-Realistic Face Generator StyleGAN". Synced. Retrieved 4 August 2020.
  31. StyleGAN public showcase website
  32. 1 2 Porter, Jon (20 September 2019). "100,000 free AI-generated headshots put stock photo companies on notice". The Verge. Retrieved 7 August 2020.
  33. "What Is a Deepfake?". PCMAG.com. March 2020. Retrieved 8 June 2020.
  34. Harwell, Drew. "Dating apps need women. Advertisers need diversity. AI companies offer a solution: Fake people". Washington Post. Retrieved 4 August 2020.
  35. "Neural Networks Need Data to Learn. Even If It's Fake". Quanta Magazine. 11 December 2023. Retrieved 18 June 2023.
  36. 1 2 Murphy, Samantha (2023). "Scientific American: Your Avatar, Your Guide" (.pdf). Scientific American / Uni of Stanford. Retrieved 11 December 2023.
  37. "WaveNet: A Generative Model for Raw Audio". Deepmind.com. 8 September 2016. Archived from the original on 27 May 2017. Retrieved 24 May 2017.
  38. "Adobe Voco 'Photoshop-for-voice' causes concern". BBC.com . BBC. 7 November 2016. Retrieved 5 July 2016.
  39. Jia, Ye; Zhang, Yu; Weiss, Ron J. (12 June 2018), "Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis", Advances in Neural Information Processing Systems , 31: 4485–4495, arXiv: 1806.04558 , Bibcode:2018arXiv180604558J
  40. Rachel Metz (19 April 2019). "If your image is online, it might be training facial-recognition AI". CNN. Retrieved 4 August 2020.
  41. "Fake voices 'help cyber-crooks steal cash'". bbc.com . BBC. 8 July 2019. Retrieved 16 April 2020.
  42. Drew, Harwell (16 April 2020). "An artificial-intelligence first: Voice-mimicking software reportedly used in a major theft". Washington Post. Retrieved 8 September 2019.