Articulated body pose estimation

Last updated

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful. [1] [2]

Contents

Description

Perception of human beings in their neighboring environment is an important capability that robots must possess. If a person uses gestures to point to a particular object, then the interacting machine should be able to understand the situation in real world context. Thus pose estimation is an important and challenging problem in computer vision, and many algorithms have been deployed in solving this problem over the last two decades. Many solutions involve training complex models with large data sets.

Pose estimation is a difficult problem and an active subject of research because the human body has 244 degrees of freedom with 230 joints. Although not all movements between joints are evident, the human body is composed of 10 large parts with 20 degrees of freedom. Algorithms must account for large variability introduced by differences in appearance due to clothing, body shape, size, and hairstyles. Additionally, the results may be ambiguous due to partial occlusions from self-articulation, such as a person's hand covering their face, or occlusions from external objects. Finally, most algorithms estimate pose from monocular (two-dimensional) images, taken from a normal camera. Other issues include varying lighting and camera configurations. The difficulties are compounded if there are additional performance requirements. These images lack the three-dimensional information of an actual body pose, leading to further ambiguities. There is recent work in this area wherein images from RGBD cameras provide information about color and depth. [3]

Sensors

The typical articulated body pose estimation system involves a model-based approach, in which the pose estimation is achieved by maximizing/minimizing a similarity/dissimilarity between an observation (input) and a template model. Different kinds of sensors have been explored for use in making the observation, including the following:

These sensors produce intermediate representations that are directly used by the model. The representations include the following:

Classical models

Part models

The basic idea of part based model can be attributed to the human skeleton. Any object having the property of articulation can be broken down into smaller parts wherein each part can take different orientations, resulting in different articulations of the same object. Different scales and orientations of the main object can be articulated to scales and orientations of the corresponding parts. To formulate the model so that it can be represented in mathematical terms, the parts are connected to each other using springs. As such, the model is also known as a spring model. The degree of closeness between each part is accounted for by the compression and expansion of the springs. There is geometric constraint on the orientation of springs. For example, limbs of legs cannot move 360 degrees. Hence parts cannot have that extreme orientation. This reduces the possible permutations. [6]

The spring model forms a graph G(V,E) where V (nodes) corresponds to the parts and E (edges) represents springs connecting two neighboring parts. Each location in the image can be reached by the and coordinates of the pixel location. Let be point at location. Then the cost associated in joining the spring between and the point can be given by . Hence the total cost associated in placing components at locations is given by

The above equation simply represents the spring model used to describe body pose. To estimate pose from images, cost or energy function must be minimized. This energy function consists of two terms. The first is related to how each component matches the image data and the second deals with how much the oriented (deformed) parts match, thus accounting for articulation along with object detection. [7]

The part models, also known as pictorial structures, are of one of the basic models on which other efficient models are built by slight modification. One such example is the flexible mixture model which reduces the database of hundreds or thousands of deformed parts by exploiting the notion of local rigidity. [8]

Articulated model with quaternion

The kinematic skeleton is constructed by a tree-structured chain. [9] Each rigid body segment has its local coordinate system that can be transformed to the world coordinate system via a 4×4 transformation matrix ,

where denotes the local transformation from body segment to its parent . Each joint in the body has 3 degrees of freedom (DoF) rotation. Given a transformation matrix , the joint position at the T-pose can be transferred to its corresponding position in the world coordination. In many works, the 3D joint rotation is expressed as a normalized quaternion due to its continuity that can facilitate gradient-based optimization in the parameter estimation.

Deep learning based models

Since about 2016, deep learning has emerged as the dominant method for performing accurate articulated body pose estimation. Rather than building an explicit model for the parts as above, the appearances of the joints and relationships between the joints of the body are learned from large training sets. Models generally focus on extracting the 2D positions of joints (keypoints), the 3D positions of joints, or the 3D shape of the body from either a single or multiple images.

Supervised

2D joint positions

The first deep learning models that emerged focused on extracting the 2D positions of human joints in an image. Such models take in an image and pass it through a convolutional neural network to obtain a series of heatmaps (one for each joint) which take on high values where joints are detected. [10] [11]

When there are multiple people per image, two main techniques have emerged for grouping joints within each person. In the first, "bottom-up" approach, the neural network is trained to also generate "part affinity fields" which indicate the location of limbs. Using these fields, joints can be grouped limb by limb by solving a series of assignment problems. [11] In the second, "top-down" approach, an additional network is used to first detect people in the image and then the pose estimation network is applied to each image. [12]

3D joint positions

With the advent of multiple datasets with human pose annotated in multiple views, [13] [14] models which detect 3D joint positions became more popular. These again fell into two categories In the first, a neural network is used to detect 2D joint positions from each view and these detections are then triangulated to obtain 3D joint positions. [15] The 2D network may be refined to produce better detections based on the 3D data. [16] Furthermore, such approaches often have filters in both 2D and 3D to refine the detected points. [17] [18] In the second, a neural network is trained end-to-end to predict 3D joint positions directly from a set of images, without 2D joint position intermediate detections. Such approaches often project image features into a cube and then use a 3D convolutional neural network to predict a 3D heatmap for each joint. [19] [16] [20]

3D shape

Concurrently with the work above, scientists have been working on estimating the full 3D shape of a human or animal from a set of images. Most of the work is based on estimating the appropriate pose of the skinned multi-person linear (SMPL) model [21] within an image. Variants of the SMPL model for other animals have also been developed. [22] [23] [24] Generally, some keypoints and a silhouette are detected for each animal within the image, and then the parameters 3D shape model are fit to match the position of keypoints and silhouette.

Unsupervised

The above algorithms all rely on annotated images, which can be time-consuming to produce. To address this issue, computer vision researchers have developed new algorithms which can learn 3D keypoints given only annotated 2D images from a single view or identify keypoints given videos without any annotations.

Applications

Assisted living

Personal care robots may be deployed in future assisted living homes. For these robots, high-accuracy human detection and pose estimation is necessary to perform a variety of tasks, such as fall detection. Additionally, this application has a number of performance constraints. [ citation needed ]

Character animation

Traditionally, character animation has been a manual process. However, poses can be synced directly to a real-life actor through specialized pose estimation systems. Older systems relied on markers or specialized suits. Recent advances in pose estimation and motion capture have enabled markerless applications, sometimes in real time. [25]

Intelligent driver assisting system

Car accidents account for about two percent of deaths globally each year. As such, an intelligent system tracking driver pose may be useful for emergency alerts [ dubious ]. Along the same lines, pedestrian detection algorithms have been used successfully in autonomous cars, enabling the car to make smarter decisions. [ citation needed ]

Video games

Commercially, pose estimation has been used in the context of video games, popularized with the Microsoft Kinect sensor (a depth camera). These systems track the user to render their avatar in-game, in addition to performing tasks like gesture recognition to enable the user to interact with the game. As such, this application has a strict real-time requirement. [26]

Medical Applications

Pose estimation has been used to detect postural issues such as scoliosis by analyzing abnormalities in a patient's posture, [27] physical therapy, and the study of the cognitive brain development of young children by monitoring motor functionality. [28]

Other applications

Other applications include video surveillance, animal tracking and behavior understanding, sign language detection, advanced human–computer interaction, and markerless motion capturing.

A commercially successful but specialized computer vision-based articulated body pose estimation technique is optical motion capture. This approach involves placing markers on the individual at strategic locations to capture the 6 degrees-of-freedom of each body part.

Research groups

A number of groups and companies are researching pose estimation, including groups at Brown University, Carnegie Mellon University, MPI Saarbruecken, Stanford University, the University of California, San Diego, the University of Toronto, the École Centrale Paris, ETH Zurich, National University of Sciences and Technology (NUST), [29] the University of California, Irvine and Polytechnic University of Catalonia.

Companies

At present, several companies are working on articulated body pose estimation.

Related Research Articles

<span class="mw-page-title-main">Computer vision</span> Computerized information extraction from images

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

<span class="mw-page-title-main">Motion capture</span> Process of recording the movement of objects or people

Motion capture is the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision and robots. In filmmaking and video game development, it refers to recording actions of human actors and using that information to animate digital character models in 2D or 3D computer animation. When it includes face and fingers or captures subtle expressions, it is often referred to as performance capture. In many fields, motion capture is sometimes called motion tracking, but in filmmaking and games, motion tracking usually refers more to match moving.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

<span class="mw-page-title-main">Simultaneous localization and mapping</span> Computational navigational technique used by robots and autonomous vehicles

Simultaneous localization and mapping (SLAM) is the computational problem of constructing or updating a map of an unknown environment while simultaneously keeping track of an agent's location within it. While this initially appears to be a chicken or the egg problem, there are several algorithms known to solve it in, at least approximately, tractable time for certain environments. Popular approximate solution methods include the particle filter, extended Kalman filter, covariance intersection, and GraphSLAM. SLAM algorithms are based on concepts in computational geometry and computer vision, and are used in robot navigation, robotic mapping and odometry for virtual reality or augmented reality.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

<span class="mw-page-title-main">Gesture recognition</span> Topic in computer science and language technology

Gesture recognition is an area of research and development in computer science and language technology concerned with the recognition and interpretation of human gestures. A subdiscipline of computer vision, it employs mathematical algorithms to interpret gestures. Gestures can originate from any bodily motion or state, but commonly originate from the face or hand. One area of the field is emotion recognition derived from facial expressions and hand gestures. Users can make simple gestures to control or interact with devices without physically touching them. Many approaches have been made using cameras and computer vision algorithms to interpret sign language, however, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques. Gesture recognition is a path for computers to begin to better understand and interpret human body language, previously not possible through text or unenhanced graphical (GUI) user interfaces.

<span class="mw-page-title-main">Motion estimation</span> Process used in video coding/compression

In computer vision and image processing, motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

The condensation algorithm is a computer vision algorithm. The principal application is to detect and track the contour of objects moving in a cluttered environment. Object tracking is one of the more basic and difficult aspects of computer vision and is generally a prerequisite to object recognition. Being able to identify which pixels in an image make up the contour of an object is a non-trivial problem. Condensation is a probabilistic algorithm that attempts to solve this problem.

Video tracking is the process of locating a moving object over time using a camera. It has a variety of uses, some of which are: human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, medical imaging and video editing. Video tracking can be a time-consuming process due to the amount of data that is contained in video. Adding further to the complexity is the possible need to use object recognition techniques for tracking, a challenging problem in its own right.

Facial motion capture is the process of electronically converting the movements of a person's face into a digital database using cameras or laser scanners. This database may then be used to produce computer graphics (CG), computer animation for movies, games, or real-time avatars. Because the motion of CG characters is derived from the movements of real people, it results in a more realistic and nuanced computer character animation than if the animation were created manually.

Structure from motion (SfM) is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences that may be coupled with local motion signals. It is studied in the fields of computer vision and visual perception. In biological vision, SfM refers to the phenomenon by which humans can recover 3D structure from the projected 2D (retinal) motion field of a moving object or scene.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

The principal curvature-based region detector, also called PCBR is a feature detector used in the fields of computer vision and image analysis. Specifically the PCBR detector is designed for object recognition applications.

<span class="mw-page-title-main">Finger tracking</span> High-resolution technique in gesture recognition and image processing

In the field of gesture recognition and image processing, finger tracking is a high-resolution technique developed in 1969 that is employed to know the consecutive position of the fingers of the user and hence represent objects in 3D. In addition to that, the finger tracking technique is used as a tool of the computer, acting as an external device in our computer, similar to a keyboard and a mouse.

<span class="mw-page-title-main">3D pose estimation</span> Process of determining spatial characteristics of objects

3D pose estimation is a process of predicting the transformation of an object from a user-defined reference pose, given an image or a 3D scan. It arises in computer vision or robotics where the pose or transformation of an object can be used for alignment of a computer-aided design models, identification, grasping, or manipulation of the object.

<span class="mw-page-title-main">3D reconstruction from multiple images</span> Creation of a 3D model from a set of images

3D reconstruction from multiple images is the creation of three-dimensional models from a set of images. It is the reverse process of obtaining 2D images from 3D scenes.

<span class="mw-page-title-main">Point-set registration</span>

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

Chessboards arise frequently in computer vision theory and practice because their highly structured geometry is well-suited for algorithmic detection and processing. The appearance of chessboards in computer vision can be divided into two main areas: camera calibration and feature extraction. This article provides a unified discussion of the role that chessboards play in the canonical methods from these two areas, including references to the seminal literature, examples, and pointers to software implementations.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

References

  1. Moeslund, Thomas B.; Granum, Erik (2001-03-01). "A Survey of Computer Vision-Based Human Motion Capture". Computer Vision and Image Understanding. 81 (3): 231–268. doi:10.1006/cviu.2000.0897. ISSN   1077-3142.
  2. "Survey of Advances in Computer Vision-based Human Motion Capture (2006)". Archived from the original on 2008-03-02. Retrieved 2007-09-15.
  3. Droeschel, David, and Sven Behnke. "3D body pose estimation using an adaptive person model for articulated ICP." Intelligent Robotics and Applications. Springer Berlin Heidelberg, 2011. 157167.
  4. Han, J.; Gaszczak, A.; Maciol, R.; Barnes, S.E.; Breckon, T.P. (September 2013). "Human Pose Classification within the Context of Near-IR Imagery Tracking" (PDF). In Zamboni, Roberto; Kajzar, Francois; Szep, Attila A.; Burgess, Douglas; Owen, Gari (eds.). Proc. SPIE Optics and Photonics for Counterterrorism, Crime Fighting and Defence. Optics and Photonics for Counterterrorism, Crime Fighting and Defence IX; and Optical Materials and Biomaterials in Security and Defence Systems Technology X. Vol. 8901. SPIE. pp. 89010E. CiteSeerX   10.1.1.391.380 . doi:10.1117/12.2028375. S2CID   17034080 . Retrieved 5 November 2013.
  5. M. Ding and G. Fan, "Generalized Sum of Gaussians for Real-Time Human Pose Tracking from a Single Depth Sensor" 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), Jan 2015
  6. Fischler, Martin A., and Robert A. Elschlager. "The representation and matching of pictorial structures." IEEE Transactions on computers 1 (1973): 6792.
  7. Felzenszwalb, Pedro F., and Daniel P. Huttenlocher. "Pictorial structures for object recognition." International Journal of Computer Vision 61.1 (2005): 5579.
  8. Yang, Yi, and Deva Ramanan. "Articulated pose estimation with flexible mixtures-of-parts." Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
  9. M. Ding and G. Fan, "Articulated and Generalized Gaussian Kernel Correlation for Human Pose Estimation" IEEE Transactions on Image Processing, Vol. 25, No. 2, Feb 2016
  10. Insafutdinov, Eldar; Pishchulin, Leonid; Andres, Bjoern; Andriluka, Mykhaylo; Schiele, Bernt (2016), "DeeperCut: A Deeper, Stronger, and Faster Multi-person Pose Estimation Model", Computer Vision – ECCV 2016, Lecture Notes in Computer Science, Cham: Springer International Publishing, vol. 9910, pp. 34–50, arXiv: 1605.03170 , doi:10.1007/978-3-319-46466-4_3, ISBN   978-3-319-46465-7, S2CID   6736694 , retrieved 2021-06-30
  11. 1 2 Cao, Zhe; Simon, Tomas; Wei, Shih-En; Sheikh, Yaser (July 2017). "Realtime Multi-person 2D Pose Estimation Using Part Affinity Fields". 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. pp. 1302–1310. arXiv: 1611.08050 . doi:10.1109/cvpr.2017.143. ISBN   978-1-5386-0457-1. S2CID   16224674.
  12. Fang, Hao-Shu; Xie, Shuqin; Tai, Yu-Wing; Lu, Cewu (October 2017). "RMPE: Regional Multi-person Pose Estimation". 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 2353–2362. arXiv: 1612.00137 . doi:10.1109/iccv.2017.256. ISBN   978-1-5386-1032-9. S2CID   6529517.
  13. Ionescu, Catalin; Papava, Dragos; Olaru, Vlad; Sminchisescu, Cristian (July 2014). "Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments". IEEE Transactions on Pattern Analysis and Machine Intelligence. 36 (7): 1325–1339. doi:10.1109/tpami.2013.248. ISSN   0162-8828. PMID   26353306. S2CID   4244548.
  14. Sigal, Leonid; Balan, Alexandru O.; Black, Michael J. (2009-08-05). "HumanEva: Synchronized Video and Motion Capture Dataset and Baseline Algorithm for Evaluation of Articulated Human Motion". International Journal of Computer Vision. 87 (1–2): 4–27. doi:10.1007/s11263-009-0273-6. ISSN   0920-5691. S2CID   11279201.
  15. Nath, Tanmay; Mathis, Alexander; Chen, An Chi; Patel, Amir; Bethge, Matthias; Mathis, Mackenzie Weygandt (2018-11-24). "Using DeepLabCut for 3D markerless pose estimation across species and behaviors". bioRxiv: 476531. doi:10.1101/476531. S2CID   92206469 . Retrieved 2021-06-30.
  16. 1 2 Iskakov, Karim; Burkov, Egor; Lempitsky, Victor; Malkov, Yury (October 2019). "Learnable Triangulation of Human Pose". 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE. pp. 7717–7726. arXiv: 1905.05754 . doi:10.1109/iccv.2019.00781. ISBN   978-1-7281-4803-8. S2CID   153312868.
  17. Karashchuk, Pierre; Rupp, Katie L.; Dickinson, Evyn S.; Sanders, Elischa; Azim, Eiman; Brunton, Bingni W.; Tuthill, John C. (2020-05-29). "Anipose: a toolkit for robust markerless 3D pose estimation". bioRxiv. 36 (13). doi: 10.1101/2020.05.26.117325 . PMC   8498918 . PMID   34592148. S2CID   219167984.
  18. Günel, Semih; Rhodin, Helge; Morales, Daniel; Campagnolo, João; Ramdya, Pavan; Fua, Pascal (2019-10-04). O'Leary, Timothy; Calabrese, Ronald L; Shaevitz, Josh W (eds.). "DeepFly3D, a deep learning-based approach for 3D limb and appendage tracking in tethered, adult Drosophila". eLife. 8: e48571. doi: 10.7554/eLife.48571 . ISSN   2050-084X. PMC   6828327 . PMID   31584428.
  19. Dunn, Timothy W.; Marshall, Jesse D.; Severson, Kyle S.; Aldarondo, Diego E.; Hildebrand, David G. C.; Chettih, Selmaan N.; Wang, William L.; Gellis, Amanda J.; Carlson, David E.; Aronov, Dmitriy; Freiwald, Winrich A. (2021-04-19). "Geometric deep learning enables 3D kinematic profiling across species and environments". Nature Methods. 18 (5): 564–573. doi:10.1038/s41592-021-01106-6. ISSN   1548-7091. PMC   8530226 . PMID   33875887. S2CID   233310558.
  20. Zimmermann, Christian; Schneider, Artur; Alyahyay, Mansour; Brox, Thomas; Diester, Ilka (2020-02-27). "FreiPose: A Deep Learning Framework for Precise Animal Motion Capture in 3D Spaces". bioRxiv. doi:10.1101/2020.02.27.967620. S2CID   213583372 . Retrieved 2021-06-30.
  21. Loper, Matthew; Mahmood, Naureen; Romero, Javier; Pons-Moll, Gerard; Black, Michael J. (2015-11-04). "SMPL". ACM Transactions on Graphics. 34 (6): 1–16. doi:10.1145/2816795.2818013. ISSN   0730-0301. S2CID   229365481.
  22. Badger, Marc; Wang, Yufu; Modh, Adarsh; Perkes, Ammon; Kolotouros, Nikos; Pfrommer, Bernd G.; Schmidt, Marc F.; Daniilidis, Kostas (2020), "3D Bird Reconstruction: A Dataset, Model, and Shape Recovery from a Single View", Computer Vision – ECCV 2020, Lecture Notes in Computer Science, Cham: Springer International Publishing, vol. 12363, pp. 1–17, arXiv: 2008.06133 , doi:10.1007/978-3-030-58523-5_1, ISBN   978-3-030-58522-8, PMC   9273110 , PMID   35822859, S2CID   221135758 , retrieved 2021-06-30
  23. Zuffi, Silvia; Kanazawa, Angjoo; Black, Michael J. (June 2018). "Lions and Tigers and Bears: Capturing Non-rigid, 3D, Articulated Shape from Images". 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE. pp. 3955–3963. doi:10.1109/cvpr.2018.00416. ISBN   978-1-5386-6420-9. S2CID   46907802.
  24. Biggs, Benjamin; Roddick, Thomas; Fitzgibbon, Andrew; Cipolla, Roberto (2019), "Creatures Great and SMAL: Recovering the Shape and Motion of Animals from Video", Computer Vision – ACCV 2018, Lecture Notes in Computer Science, Cham: Springer International Publishing, vol. 11365, pp. 3–19, arXiv: 1811.05804 , doi:10.1007/978-3-030-20873-8_1, ISBN   978-3-030-20872-1, S2CID   53305772 , retrieved 2021-06-30
  25. Dent, Steven. "What you need to know about 3D motion capture". Engadget. AOL Inc. Retrieved 31 May 2017.
  26. Kohli, Pushmeet; Shotton, Jamie. "Key Developments in Human Pose Estimation for Kinect" (PDF). Microsoft. Retrieved 31 May 2017.
  27. Aroeira, Rozilene Maria C., Estevam B. de Las Casas, Antônio Eustáquio M. Pertence, Marcelo Greco, and João Manuel R.S. Tavares. “Non-Invasive Methods of Computer Vision in the Posture Evaluation of Adolescent Idiopathic Scoliosis.” Journal of Bodywork and Movement Therapies 20, no. 4 (October 2016): 832–43. https://doi.org/10.1016/j.jbmt.2016.02.004.
  28. Khan, Muhammad Hassan, Julien Helsper, Muhammad Shahid Farid, and Marcin Grzegorzek. “A Computer Vision-Based System for Monitoring Vojta Therapy.” International Journal of Medical Informatics 113 (May 2018): 85–95. https://doi.org/10.1016/j.ijmedinf.2018.02.010.
  29. "NUST-SMME RISE Research Center".