Moving object detection

Last updated August 22, 2024

Moving object detection is a technique used in computer vision and image processing. Multiple consecutive frames from a video are compared by various methods to determine if any moving object is detected.

Definition

Moving object detection is to recognize the physical movement of an object in a given place or region.^[2] By acting segmentation among moving objects and stationary area or region,^[3] the moving objects' motion can be tracked and thus analyzed later. To achieve this, consider a video is a structure built upon single frames, moving object detection is to find the foreground moving target(s), either in each video frame or only when the moving target shows the first appearance in the video.^[4]

Traditional methods

Among all the traditional moving object detection methods, we could categorize them into four major approaches: Background subtraction, Frame differencing, Temporal Differencing, and Optical Flow.^[2]

Frame differencing

Instead of using traditional approach, to use image subtraction operator by subtracting second and images afterwards, the frame differencing method makes comparisons between two successive frames to detect moving targets.^[5]

Temporal differencing

The temporal differencing method identifies the moving object by applying pixel-wise difference method with two or three consecutive frames.^[3]

Related Research Articles

Computer vision tasks include methods for acquiring, processing, analyzing and understanding digital images,and extraction of high-dimensional data from the real world in order to produce numerical or symbolic information, e.g. in the forms of decisions. Understanding in this context means the transformation of visual images into descriptions of the world that make sense to thought processes and can elicit appropriate action. This image understanding can be seen as the disentangling of symbolic information from image data using models constructed with the aid of geometry, physics, statistics, and learning theory.

Frame rate, most commonly expressed in frames per second or FPS, is typically the frequency (rate) at which consecutive images (frames) are captured or displayed. This definition applies to film and video cameras, computer animation, and motion capture systems. In these contexts, frame rate may be used interchangeably with frame frequency and refresh rate, which are expressed in hertz. Additionally, in the context of computer graphics performance, FPS is the rate at which a system, particularly a GPU, is able to generate frames, and refresh rate is the frequency at which a display shows completed frames. In electronic camera specifications frame rate refers to the maximum possible rate frames could be captured, but in practice, other settings may reduce the actual frequency to a lower number than the frame rate.

Motion compensation in computing is an algorithmic technique used to predict a frame in a video given the previous and/or future frames by accounting for motion of the camera and/or objects in the video. It is employed in the encoding of video data for video compression, for example in the generation of MPEG-2 files. Motion compensation describes a picture in terms of the transformation of a reference picture to the current picture. The reference picture may be previous in time or even from the future. When images can be accurately synthesized from previously transmitted/stored images, the compression efficiency can be improved.

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

In computer vision and image processing, motion estimation is the process of determining motion vectors that describe the transformation from one 2D image to another; usually from adjacent frames in a video sequence. It is an ill-posed problem as the motion happens in three dimensions (3D) but the images are a projection of the 3D scene onto a 2D plane. The motion vectors may relate to the whole image or specific parts, such as rectangular blocks, arbitrary shaped patches or even per pixel. The motion vectors may be represented by a translational model or many other models that can approximate the motion of a real video camera, such as rotation and translation in all three dimensions and zoom.

Video tracking is the process of locating a moving object over time using a camera. It has a variety of uses, some of which are: human-computer interaction, security and surveillance, video communication and compression, augmented reality, traffic control, medical imaging and video editing. Video tracking can be a time-consuming process due to the amount of data that is contained in video. Adding further to the complexity is the possible need to use object recognition techniques for tracking, a challenging problem in its own right.

Shot transition detection also called cut detection is a field of research of video processing. Its subject is the automated detection of transitions between shots in digital video with the purpose of temporal segmentation of videos.

Motion analysis is used in computer vision, image processing, high-speed photography and machine vision that studies methods and applications in which two or more consecutive images from an image sequences, e.g., produced by a video camera or high-speed camera, are processed to produce information based on the apparent motion in the images. In some applications, the camera is fixed relative to the scene and objects are moving around in the scene, in some applications the scene is more or less fixed and the camera is moving, and in some cases both the camera and the scene are moving.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Activity recognition aims to recognize the actions and goals of one or more agents from a series of observations on the agents' actions and the environmental conditions. Since the 1980s, this research field has captured the attention of several computer science communities due to its strength in providing personalized support for many different applications and its connection to many different fields of study such as medicine, human-computer interaction, or sociology.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

Pedestrian detection is an essential and significant task in any intelligent video surveillance system, as it provides the fundamental information for semantic understanding of the video footages. It has an obvious extension to automotive applications due to the potential for improving safety systems. Many car manufacturers offer this as an ADAS option in 2017.

Video synopsis is a method for automatically synthesizing a short, informative summary of a video. Unlike traditional video summarization, the synopsis is not just composed of frames from the original video. The algorithm detects, tracks and analyzes moving objects in a database of objects and activities. The final output is a new, short video clip in which objects and activities that originally occurred at different times are displayed simultaneously, so as to convey information in the shortest possible time. Video synopsis has specific applications in the field of video analytics and video surveillance where, despite technological advancements and increased growth in the deployment of CCTV cameras, viewing and analysis of recorded footage is still a costly labor-intensive and time-intensive task.

ViBe is a background subtraction algorithm which has been presented at the IEEE ICASSP 2009 conference and was refined in later publications. More precisely, it is a software module for extracting background information from moving images. It has been developed by Oliver Barnich and Marc Van Droogenbroeck of the Montefiore Institute, University of Liège, Belgium.

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

In computer vision, a saliency map is an image that highlights either the region on which people's eyes focus first or the most relevant regions for machine learning models. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system or an otherwise opaque ML model.

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

References

↑ Chaquet, Jose M.; Carmona, Enrique J.; Fernández-Caballero, Antonio (June 2013). "A survey of video datasets for human action and activity recognition". Computer Vision and Image Understanding. 117 (6): 633–659. doi:10.1016/j.cviu.2013.01.013. hdl: 10578/3697 .
1 2 , J. S. Kulchandani and K. J. Dangarwala, "Moving object detection: Review of recent research trends," 2015 International Conference on Pervasive Computing (ICPC), Pune, 2015, pp. 1-5. doi: 10.1109/PERVASIVE.2015.7087138.
1 2 , Weiming Hu, Tieniu Tan, Liang Wang, and Steve Maybank, “A Survey on Visual Surveillance of Object Motion and Behaviors,” IEEE Trans. on Systems, Man, and Cybernetics—Part C: Applications and Reviews, vol. 34, no. 3, pp. 334-352, August 2004.
↑ , Bahadir Karasulu and Serdar Korukoglu (2013). Performance Evaluation Software: Moving Object Detection and Tracking in Videos.
↑ , Jain, R. and H. Nagel, “On the Accumulative Difference Pictures for the Analysis of Real World Scene Sequences,” IEEE Tran. on Pattern Anal. Mach. Intell., pp. 206-221, 1979.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Chaquet, Jose M.; Carmona, Enrique J.; Fernández-Caballero, Antonio (June 2013). "A survey of video datasets for human action and activity recognition". Computer Vision and Image Understanding. 117 (6): 633–659. doi:10.1016/j.cviu.2013.01.013. hdl: 10578/3697 .

[moving_review-2] 1 2 , J. S. Kulchandani and K. J. Dangarwala, "Moving object detection: Review of recent research trends," 2015 International Conference on Pervasive Computing (ICPC), Pune, 2015, pp. 1-5. doi: 10.1109/PERVASIVE.2015.7087138.

[survey-3] 1 2 , Weiming Hu, Tieniu Tan, Liang Wang, and Steve Maybank, “A Survey on Visual Surveillance of Object Motion and Behaviors,” IEEE Trans. on Systems, Man, and Cybernetics—Part C: Applications and Reviews, vol. 34, no. 3, pp. 334-352, August 2004.

[4] , Bahadir Karasulu and Serdar Korukoglu (2013). Performance Evaluation Software: Moving Object Detection and Tracking in Videos.

[5] , Jain, R. and H. Nagel, “On the Accumulative Difference Pictures for the Analysis of Real World Scene Sequences,” IEEE Tran. on Pattern Anal. Mach. Intell., pp. 206-221, 1979.

[1]

[2]

[3]

[4]

[5]