In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.
There are a number of methods that have been proposed to do so. [1] There is no consistent way to classify motion segmentation due to its large variation in literature. Depending on the segmentation criterion used in the algorithm it can be broadly classified into the following categories: image difference, statistical methods, wavelets, layering, optical flow and factorization. Moreover, depending on the number of views required the algorithms can be two or multi view-based. Rigid motion segmentation has found an increase in its application over the recent past with rise in surveillance and video editing. These algorithms are discussed further.
In general, motion can be considered to be a transformation of an object in space and time. If this transformation preserves size and shape of the object it is known as a Rigid Transformation. Rigid transform can be rotational, translational or reflective. We define rigid transformation mathematically as:
where F is a rigid transform if and only if it preserves isometry and space orientation.
In the sense of motion, rigid transform is the movement of a rigid object in space. As shown in Figure 1: this 3-D motion is the transformation from original co-ordinates (X,Y,Z) to transformed co-ordinates (X',Y',Z') which is a result of rotation and translation captured by rotational matrix R and translational vector T respectively. Hence the transform will be:
where,
has 9 unknowns which correspond to the rotational angle with each axis and has 3 unknowns () which account for translation in X,Y and Z directions respectively. This motion (3-D) in time when captured by a camera (2-D) corresponds to change of pixels in the subsequent frames of the video sequence. This transformation is also known as 2-D rigid body motion or the 2-D Euclidean transformation. It can be written as:
where,
X→ original pixel co-ordinate.
X'→ transformed pixel co-ordinate.
R→ orthonormal rotation matrix with R ⋅ RT = I and |R| = 1.
t→ translational vector but in the 2D image space.
To visualize this consider an example of a video sequence of a traffic surveillance camera. It will have moving cars and this movement does not change their shape and size. Moreover, the movement is a combination of rotation and transformation of the car in 3D which is reflected in its subsequent video frames. Thus the car is said to have a rigid motion.
Image segmentation techniques are interested in segmenting out different parts of the image as per the region of interest. As videos are sequences of images, motion segmentation aims at decomposing a video in moving objects and background by segmenting the objects that undergo different motion patterns. The analysis of these spatial and temporal changes occurring in the image sequence by separating visual features from the scenes into different groups lets us extract visual information. Each group corresponds to the motion of an object in the dynamic sequence. In the simplest case motion segmentation can mean extracting moving objects from a stationary camera but the camera can also move which introduces the relative motion of the static background. Depending upon the type of visual features that are extracted, motion segmentation algorithms can be broadly divided into two categories. The first is known as direct motion segmentation that uses pixel intensities from the image. Such algorithms assume constant illumination. The second category of algorithms computes a set of features corresponding to actual physical points on the objects. These sparse features are then used to characterize either the 2-D motion of the scene or the 3-D motion of the objects in the scene. There are a number of requirements to design a good motion segmentation algorithm. The algorithm must extract distinct features (corners or salient points) that represent the object by a limited number of points and it must have the ability to deal with occlusions. The images will also be affected by noise and will have missing data, thus they must be robust. Some algorithms detect only one object but the video sequence may have different motions. Thus the algorithm must be multiple object detectors. Moreover, the type of camera model, if used, also characterizes the algorithm. Depending upon the object characterization of an algorithm it can detect rigid, non-rigid motion or both. Moreover, algorithms used to estimate single rigid-body motions can provide accurate results with robustness to noise and outliers but when extended to multiple rigid-body motions they fail. In case of view-based segmentation techniques described below, this happens because the single fundamental matrix assumption is violated as each motion will now be represented by means of a new fundamental matrix corresponding to that motion.
As mentioned earlier that there is no particular way to distinguish Motion Segmentation techniques but depending on the basis of the segmentation criterion used in the algorithm it can be broadly classified as follows: [2]
It is a very useful technique for detecting changes in images due to its simplicity and ability to deal with occlusion and multiple motions. These techniques assume constant light source intensity. The algorithm first considers two frames at a time and then computes the pixel by pixel intensity difference. On this computation it thresholds the intensity difference and maps the changes onto a contour. Using this contour it extracts the spatial and temporal information required to define the motion in the scene. Though it is a simple technique to implement it is not robust to noise. Another difficulty with these techniques is the camera movement. When the camera moves there is a change in the entire image which has to be accounted for. Many new algorithm have been introduced to overcome these difficulties. [3] [4] [5] [6]
Motion segmentation can be seen as a classification problem where each pixel has to be classified as background or foreground. Such classifications are modeled under statistic theory and can be used in segmentation algorithms. These approaches can be further divided depending on the statistical framework used. Most commonly used frameworks are maximum a posteriori probability (MAP), [7] Particle Filter (PF) [8] and Expectation Maximization (EM). [9] MAP uses Bayes' Rule for implementation where a particular pixel has to be classified under predefined classes. PF is based on the concept of evolution of a variable with varying weights over time. The final estimation is the weighted sum of all the variables. Both of these methods are iterative. The EM algorithm is also an iterative estimation method. It computes the maximum likelihood (ML) estimate of the model parameters in presence of missing or hidden data and decided the most likely fit of the observed data.
Optical flow (OF) helps in determining the relative pixel velocity of points within an image sequence. Like image difference, it is also an old concept used for segmentation. Initially the main drawback of OF was the lack of robustness to noise and high computational costs but due to recent key-point matching techniques and hardware implementations, these limitations have diminished. To increase its robustness to occlusion and temporal stopping, OF is generally used with other statistical or image difference techniques. For complicated scenarios, particularly when the camera itself is moving, OF provides a basis for estimating the fundamental matrix where outliers represent other objects moving independently in the scene. [3] Alternatively, optical flow based on line segments instead of point features can also be used to segment multiple rigid-body motions. [10]
An image is made up of different frequency components. [11] Edges, corners and plane regions can be represented by means of different frequencies. Wavelet based methods perform analysis of the different frequency components of the images and then study each component with different resolution such that they are matched to its scale. Multi-scale decomposition is used generally in order to reduce the noise. Though this method provides good results, [12] it is limited with an assumption that the movement of objects is only in front of the camera. Implementations of Wavelet-based techniques are present with other approaches, such as optical flow and are applied at various scale to reduce the effect of noise.
Layers based techniques divide the images into layers that have uniform motion. This approach determines the different depth layer in the image and finds which layer the object or part of the image lies in. Such techniques are used in stereo vision where it is needed to compute the depth distance. The first layer based technique was proposed in 1993. [13] As humans also use layer based segmentation, this method is a natural solution to occlusion problems but it is very complex with requirement of manual tuning.
Tomasi and Kanade introduced the first factorization method. This method tracked features in a sequence of images and recovered the shape and motion. This technique factorized the trajectory matrix W, determined after the tracking of different features over the sequence into two matrices: motion and structure using Singular Value Decomposition. [14] The simplicity of the algorithm is the reason for its wide use but they are sensitive to noise and outliers. Most of these methods are implemented under the assumption of rigid and independent motion.
Further motion detection algorithms can also be classified depending upon the number of views: two and multi view-based approaches namely. The two-view based approaches are usually based on epipolar geometry. Consider two perspective camera views of a rigid body and find its feature correspondences. These correspondences are seen to satisfy either an epipolar constraint for a general rigid-body or a homography constraint for a planar object. Planar motion in a sequence is the motion of the background, facade or the ground. [15] Thus it is a degenerate case of rigid body motion together with general rigid body objects e.g. cars. Hence in a sequence we expect to see more than one type of motion, described by multiple epipolar constraints and homographies. The view based algorithms are sensitive to outliers but recent approaches deal with outliers by using random sample consensus (RANSAC) [16] and enhanced Dirichlet process mixture models. [3] [17] Other approaches use global dimension minimization to reveal the clusters corresponding to the underlying subspace. These approaches use only two frames for motion segmentation even if multiple frames are available as they cannot use multi frame information. Multiview-based approaches utilize the trajectory of feature points unlike two-view based approaches. [18] A number of approaches have been provided which include Principle Angles Configuration (PAC) [19] and Sparse Subspace Clustering (SSC) [20] methods. These work well in two or three motion cases. These algorithms are also robust to noise with a tradeoff with speed, i.e. they are less sensitive to noise but slow in computation. Other algorithms with a multi-view approach are spectral curvature clustering (SCC), latent low-rank representation-based method (LatLRR) [21] and ICLM-based approaches. [22] These algorithms are faster and more accurate than the two-view based but require greater number of frames to maintain the accuracy.
Motion segmentation is a field under research as there are many issues which provide scope of improvement. One of the major problems is of feature detection and finding correspondences. There are strong feature detection algorithms but they still give false positives which can lead to unexpected correspondences. Finding these pixel or feature correspondences is a difficult task. These mismatched feature points from the objects and the background often introduce outliers. The presence of image noise and outliers further affect the accuracy of structure from motion (SFM) estimation. Another issue is that of motion models or motion representations. It requires the motion to be modeled or estimated in the given model used in the algorithm. Most algorithms perform 2-D motion segmentation by assuming the motions in the scene can be modeled by 2-D affine motion models. Theoretically, this is valid because 2-D translational motion model can be represented by general affine motion model. However, such approximations in modeling can have negative consequences. The translational model has two parameters and the affine model has 6 parameters so we estimate four extra parameters. Moreover, there may not be enough data to estimate the affine motion model so the parameter estimation might be erroneous. Some of the other problems faced are:
Robust algorithms have been proposed to take care of the outliers and implement with greater accuracy. The Tomasi and Kanade factorization method is one of the methods as mentioned above under factorization.
Motion segmentation has many important applications. [1] It is used for video compression. With segmentation, it is possible to eliminate the redundancy related to the repetition of the same visual patterns in successive images. It can also be used for video description tasks, such as logging, annotation and indexing. By using Automatic object extraction techniques video content with object-specific information can be segregated. Thus concept can be used by search engines and video libraries. Some specific applications include:
Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images, and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.
Motion compensation in computing is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video data for video compression, for example in the generation of MPEG-2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesized from previously transmitted/stored images, the compression efficiency can be improved.
Digital image processing is the use of a digital computer to process digital images through an algorithm. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allows a much wider range of algorithms to be applied to the input data and can avoid problems such as the build-up of noise and distortion during processing. Since images are defined over two dimensions digital image processing may be modeled in the form of multidimensional systems. The generation and development of digital image processing are mainly affected by three factors: first, the development of computers; second, the development of mathematics ; third, the demand for a wide range of applications in environment, agriculture, military, industry and medical science has increased.
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Template matching is a technique in digital image processing for finding small parts of an image which match a template image. It can be used for quality control in manufacturing, navigation of mobile robots, or edge detection in images.
Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it also can be interpreted as an outlier detection method. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles at SRI International in 1981. They used RANSAC to solve the Location Determination Problem (LDP), where the goal is to determine the points in the space that project onto an image into a set of landmarks with known locations.
The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.
Video tracking is the process of locating a moving object over time using a camera. It has a variety of uses, some of which are: human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, medical imaging and video editing. Video tracking can be a time-consuming process due to the amount of data that is contained in video. Adding further to the complexity is the possible need to use object recognition techniques for tracking, a challenging problem in its own right.
Simple interactive object extraction (SIOX) is an algorithm for extracting foreground objects from color images and videos with very little user interaction. It has been implemented as "foreground selection" tool in the GIMP, as part of the tracer tool in Inkscape, and as function in ImageJ and Fiji (plug-in). Experimental implementations were also reported for Blender and Krita. Although the algorithm was originally designed for videos, virtually all implementations use SIOX primarily for still image segmentation. In fact, it is often said to be the current de facto standard for this task in the open-source world.
Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Commonly performed through the use of computer software, most approaches to image stitching require nearly exact overlaps between images and identical exposures to produce seamless results, although some stitching algorithms actually benefit from differently exposed images by doing high-dynamic-range imaging in regions of overlap. Some digital cameras can stitch their photos internally.
In computer vision, the Lucas–Kanade method is a widely used differential method for optical flow estimation developed by Bruce D. Lucas and Takeo Kanade. It assumes that the flow is essentially constant in a local neighbourhood of the pixel under consideration, and solves the basic optical flow equations for all the pixels in that neighbourhood, by the least squares criterion.
Camera resectioning is the process of estimating the parameters of a pinhole camera model approximating the camera that produced a given photograph or video; it determines which incoming light ray is associated with each pixel on the resulting image. Basically, the process determines the pose of the pinhole camera.
Motion analysis is used in computer vision, image processing, high-speed photography and machine vision that studies methods and applications in which two or more consecutive images from an image sequences, e.g., produced by a video camera or high-speed camera, are processed to produce information based on the apparent motion in the images. In some applications, the camera is fixed relative to the scene and objects are moving around in the scene, in some applications the scene is more or less fixed and the camera is moving, and in some cases both the camera and the scene are moving.
Image rectification is a transformation process used to project images onto a common image plane. This process has several degrees of freedom and there are many strategies for transforming images to the common plane. Image rectification is used in computer stereo vision to simplify the problem of finding matching points between images, and in geographic information systems to merge images taken from multiple perspectives into a common map coordinate system.
Part-based models refers to a broad class of detection algorithms used on images, in which various parts of the image are used separately in order to determine if and where an object of interest exists. Amongst these methods a very popular one is the constellation model which refers to those schemes which seek to detect a small number of features and their relative positions to then determine whether or not the object of interest is present.
The Tomasi–Kanade factorization is the seminal work by Carlo Tomasi and Takeo Kanade in the early 1990s. It charted out an elegant and simple solution based on a SVD-based factorization scheme for analysing image measurements of a rigid object captured from different views using a weak perspective camera model. The crucial observation made by authors was that if all the measurements are collected in a single matrix, the point trajectories will reside in a certain subspace. The dimension of the subspace in which the image data resides is a direct consequence of two factors:
ViBe is a background subtraction algorithm which has been presented at the IEEE ICASSP 2009 conference and was refined in later publications. More precisely, it is a software module for extracting background information from moving images. It has been developed by Oliver Barnich and Marc Van Droogenbroeck of the Montefiore Institute, University of Liège, Belgium.
Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.
In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.
Robust Principal Component Analysis (RPCA) is a modification of the widely used statistical procedure of principal component analysis (PCA) which works well with respect to grossly corrupted observations. A number of different approaches exist for Robust PCA, including an idealized version of Robust PCA, which aims to recover a low-rank matrix L0 from highly corrupted measurements M = L0 +S0. This decomposition in low-rank and sparse matrices can be achieved by techniques such as Principal Component Pursuit method (PCP), Stable PCP, Quantized PCP, Block based PCP, and Local PCP. Then, optimization methods are used such as the Augmented Lagrange Multiplier Method (ALM), Alternating Direction Method (ADM), Fast Alternating Minimization (FAM), Iteratively Reweighted Least Squares (IRLS ) or alternating projections (AP).
{{cite journal}}
: Cite journal requires |journal=
(help){{cite journal}}
: Cite journal requires |journal=
(help)