Motion analysis is used in computer vision, image processing, high-speed photography and machine vision that studies methods and applications in which two or more consecutive images from an image sequences, e.g., produced by a video camera or high-speed camera, are processed to produce information based on the apparent motion in the images. In some applications, the camera is fixed relative to the scene and objects are moving around in the scene, in some applications the scene is more or less fixed and the camera is moving, and in some cases both the camera and the scene are moving.
The motion analysis processing can in the simplest case be to detect motion, i.e., find the points in the image where something is moving. More complex types of processing can be to track a specific object in the image over time, to group points that belong to the same rigid object that is moving in the scene, or to determine the magnitude and direction of the motion of every point in the image. The information that is produced is often related to a specific image in the sequence, corresponding to a specific time-point, but then depends also on the neighboring images. This means that motion analysis can produce time-dependent information about motion.
Applications of motion analysis can be found in rather diverse areas, such as surveillance, medicine, film industry, automotive crash safety, [1] ballistic firearm studies, [2] biological science, [3] flame propagation, [4] and navigation of autonomous vehicles to name a few examples.
A video camera can be seen as an approximation of a pinhole camera, which means that each point in the image is illuminated by some (normally one) point in the scene in front of the camera, usually by means of light that the scene point reflects from a light source. Each visible point in the scene is projected along a straight line that passes through the camera aperture and intersects the image plane. This means that at a specific point in time, each point in the image refers to a specific point in the scene. This scene point has a position relative to the camera, and if this relative position changes, it corresponds to a relative motion in 3D. It is a relative motion since it does not matter if it is the scene point, or the camera, or both, that are moving. It is only when there is a change in the relative position that the camera is able to detect that some motion has happened. By projecting the relative 3D motion of all visible points back into the image, the result is the motion field , describing the apparent motion of each image point in terms of a magnitude and direction of velocity of that point in the image plane. A consequence of this observation is that if the relative 3D motion of some scene points are along their projection lines, the corresponding apparent motion is zero.
The camera measures the intensity of light at each image point, a light field. In practice, a digital camera measures this light field at discrete points, pixels, but given that the pixels are sufficiently dense, the pixel intensities can be used to represent most characteristics of the light field that falls onto the image plane. A common assumption of motion analysis is that the light reflected from the scene points does not vary over time. As a consequence, if an intensity I has been observed at some point in the image, at some point in time, the same intensity I will be observed at a position that is displaced relative to the first one as a consequence of the apparent motion. Another common assumption is that there is a fair amount of variation in the detected intensity over the pixels in an image. A consequence of this assumption is that if the scene point that corresponds to a certain pixel in the image has a relative 3D motion, then the pixel intensity is likely to change over time.
One of the simplest type of motion analysis is to detect image points that refer to moving points in the scene. The typical result of this processing is a binary image where all image points (pixels) that relate to moving points in the scene are set to 1 and all other points are set to 0. This binary image is then further processed, e.g., to remove noise, group neighboring pixels, and label objects. Motion detection can be done using several methods; the two main groups are differential methods and methods based on background segmentation.
In the areas of medicine, sports, [5] video surveillance, physical therapy, [6] and kinesiology, [7] human motion analysis has become an investigative and diagnostic tool. See the section on motion capture for more detail on the technologies. Human motion analysis can be divided into three categories: human activity recognition, human motion tracking, and analysis of body and body part movement.
Human activity recognition is most commonly used for video surveillance, specifically automatic motion monitoring for security purposes. Most efforts in this area rely on state-space approaches, in which sequences of static postures are statistically analyzed and compared to modeled movements. Template-matching is an alternative method whereby static shape patterns are compared to pre-existing prototypes. [8]
Human motion tracking can be performed in two or three dimensions. Depending on the complexity of analysis, representations of the human body range from basic stick figures to volumetric models. Tracking relies on the correspondence of image features between consecutive frames of video, taking into consideration information such as position, color, shape, and texture. Edge detection can be performed by comparing the color and/or contrast of adjacent pixels, looking specifically for discontinuities or rapid changes. [9] Three-dimensional tracking is fundamentally identical to two-dimensional tracking, with the added factor of spatial calibration. [8]
Motion analysis of body parts is critical in the medical field. In postural and gait analysis, joint angles are used to track the location and orientation of body parts. Gait analysis is also used in sports to optimize athletic performance or to identify motions that may cause injury or strain. Tracking software that does not require the use of optical markers is especially important in these fields, where the use of markers may impede natural movement. [8] [10]
Motion analysis is also applicable in the manufacturing process. [11] Using high speed video cameras and motion analysis software, one can monitor and analyze assembly lines and production machines to detect inefficiencies or malfunctions. Manufacturers of sports equipment, such as baseball bats and hockey sticks, also use high speed video analysis to study the impact of projectiles. An experimental setup for this type of study typically uses a triggering device, external sensors (e.g., accelerometers, strain gauges), data acquisition modules, a high-speed camera, and a computer for storing the synchronized video and data. Motion analysis software calculates parameters such as distance, velocity, acceleration, and deformation angles as functions of time. This data is then used to design equipment for optimal performance. [12]
The object and feature detecting capabilities of motion analysis software can be applied to count and track particles, such as bacteria, [13] [14] viruses, [15] "ionic polymer-metal composites", [16] [17] micron-sized polystyrene beads, [18] aphids, [19] and projectiles. [20]
Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.
Chroma key compositing, or chroma keying, is a visual-effects and post-production technique for compositing (layering) two or more images or video streams together based on colour hues. The technique has been used in many fields to remove a background from the subject of a photo or video – particularly the newscasting, motion picture, and video game industries. A colour range in the foreground footage is made transparent, allowing separately filmed background footage or a static image to be inserted into the scene. The chroma keying technique is commonly used in video production and post-production. This technique is also referred to as colour keying, colour-separation overlay, or by various terms for specific colour-related variants such as green screen or blue screen; chroma keying can be done with backgrounds of any colour that are uniform and distinct, but green and blue backgrounds are more commonly used because they differ most distinctly in hue from any human skin colour. No part of the subject being filmed or photographed may duplicate the colour used as the backing, or the part may be erroneously identified as part of the backing.
Motion blur is the apparent streaking of moving objects in a photograph or a sequence of frames, such as a film or animation. It results when the image being recorded changes during the recording of a single exposure, due to rapid movement or long exposure.
Motion capture is the process of recording the movement of objects or people. It is used in military, entertainment, sports, medical applications, and for validation of computer vision and robots. In filmmaking and video game development, it refers to recording actions of human actors and using that information to animate digital character models in 2D or 3D computer animation. When it includes face and fingers or captures subtle expressions, it is often referred to as performance capture. In many fields, motion capture is sometimes called motion tracking, but in filmmaking and games, motion tracking usually refers more to match moving.
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Motion detection is the process of detecting a change in the position of an object relative to its surroundings or a change in the surroundings relative to an object. It can be achieved by either mechanical or electronic methods. When it is done by natural organisms, it is called motion perception.
A high-speed camera is a device capable of capturing moving images with exposures of less than 1/1,000 second or frame rates in excess of 250 fps. It is used for recording fast-moving objects as photographic images onto a storage medium. After recording, the images stored on the medium can be played back in slow motion. Early high-speed cameras used film to record the high-speed events, but were superseded by entirely electronic devices using an image sensor, recording, typically, over 1,000 fps onto DRAM, to be played back slowly to study the motion for scientific study of transient phenomena.
Gesture recognition is a topic in computer science and language technology with the goal of interpreting human gestures via mathematical algorithms. It is a subdiscipline of computer vision. Gestures can originate from any bodily motion or state, but commonly originate from the face or hand. Focuses in the field include emotion recognition from face and hand gesture recognition since they are all expressions. Users can make simple gestures to control or interact with devices without physically touching them. Many approaches have been made using cameras and computer vision algorithms to interpret sign language, however, the identification and recognition of posture, gait, proxemics, and human behaviors is also the subject of gesture recognition techniques. Gesture recognition can be seen as a way for computers to begin to understand human body language, thus building a better bridge between machines and humans than older text user interfaces or even GUIs, which still limit the majority of input to keyboard and mouse and interact naturally without any mechanical devices.
In visual effects, match moving is a technique that allows the insertion of computer graphics into live-action footage with correct position, scale, orientation, and motion relative to the photographed objects in the shot. The term is used loosely to describe several different methods of extracting camera motion information from a motion picture. Sometimes referred to as motion tracking or camera solving, match moving is related to rotoscoping and photogrammetry. Match moving is sometimes confused with motion capture, which records the motion of objects, often human actors, rather than the camera. Typically, motion capture requires special cameras and sensors and a controlled environment. Match moving is also distinct from motion control photography, which uses mechanical hardware to execute multiple identical camera moves. Match moving, by contrast, is typically a software-based technology, applied after the fact to normal footage recorded in uncontrolled environments with an ordinary camera.
Motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion is in three dimensions but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.
Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Commonly performed through the use of computer software, most approaches to image stitching require nearly exact overlaps between images and identical exposures to produce seamless results, although some stitching algorithms actually benefit from differently exposed images by doing high-dynamic-range imaging in regions of overlap. Some digital cameras can stitch their photos internally.
In the fields of computing and computer vision, pose represents the position and orientation of an object, usually in three dimensions. Poses are often stored internally as transformation matrices. The term “pose” is largely synonymous with the term “transform”, but a transform may often include scale, whereas pose does not.
The following are common definitions related to the machine vision field.
Range imaging is the name for a collection of techniques that are used to produce a 2D image showing the distance to points in a scene from a specific point, normally associated with some type of sensor device.
Rolling shutter is a method of image capture in which a still picture or each frame of a video is captured not by taking a snapshot of the entire scene at a single instant in time but rather by scanning across the scene rapidly, vertically, horizontally or rotationally. In other words, not all parts of the image of the scene are recorded at exactly the same instant. This produces predictable distortions of fast-moving objects or rapid flashes of light. This is in contrast with "global shutter" in which the entire frame is captured at the same instant.
In robotics and computer vision, visual odometry is the process of determining the position and orientation of a robot by analyzing the associated camera images. It has been used in a wide variety of robotic applications, such as on the Mars Exploration Rovers.
2D to 3D video conversion is the process of transforming 2D ("flat") film to 3D form, which in almost all cases is stereo, so it is the process of creating imagery for each eye from one 2D image.
Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.
In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.
An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.