Pedestrian detection

Last updated
Pedestrian detection Detection de personne - exemple 1.jpg
Pedestrian detection
Pedestrian detection Detection de personne - exemple 3.jpg
Pedestrian detection

Pedestrian detection is an essential and significant task in any intelligent video surveillance system, as it provides the fundamental information for semantic understanding of the video footages. It has an obvious extension to automotive applications due to the potential for improving safety systems. Many car manufacturers (e.g. Volvo, Ford, GM, Nissan) offer this as an ADAS option in 2017.

Contents

Challenges

Existing approaches

Despite the challenges, pedestrian detection still remains an active research area in computer vision in recent years. Numerous approaches have been proposed.

Holistic detection

Detectors are trained to search for pedestrians in the video frame by scanning the whole frame. The detector would “fire” if the image features inside the local search window meet certain criteria. Some methods employ global features such as edge template , [1] others uses local features like histogram of oriented gradients [2] descriptors. The drawback of this approach is that the performance can be easily affected by background clutter and occlusions.

Part-based detection

Pedestrians are modeled as collections of parts. Part hypotheses are firstly generated by learning local features, which include edgelet [3] and orientation features. [4] These part hypotheses are then joined to form the best assembly of existing pedestrian hypotheses. Though this approach is attractive, part detection itself is a difficult task. Implementation of this approach follows a standard procedure for processing the image data that consists of first creating a densely sampled image pyramid, computing features at each scale, performing classification at all possible locations, and finally performing non-maximal suppression to generate the final set of bounding boxes. [5]

Patch-based detection

In 2005, Leibe et al. [6] proposed an approach combining both the detection and segmentation with the name Implicit Shape Model (ISM). A codebook of local appearance is learned during the training process. In the detecting process, extracted local features are used to match against the codebook entries, and each match casts one vote for the pedestrian hypotheses. Final detection results can be obtained by further refining those hypotheses. The advantage of this approach is only a small number of training images are required.

Motion-based detection

When the conditions permit (fixed camera, stationary lighting conditions, etc.), background subtraction can help to detect pedestrians. Background subtraction classifies the pixels of video streams as either background, where no motion is detected, or foreground, where motion is detected. This procedure highlights the silhouettes (the connected components in the foreground) of every moving element in the scene, including people. An algorithm has been developed, [7] [8] at the university of Liège, to analyze the shape of these silhouettes in order to detect the humans. Since the methods that consider the silhouette as a whole and perform a single classification are, in general, highly sensitive to shape defects, a part-based method splitting the silhouettes in a set of smaller regions has been proposed to decrease the influence of defects. To the contrary of other part-based approaches, these regions do not have any anatomical meaning. This algorithm has been extended to the detection of humans in 3D video streams. [9]

Detection using multiple cameras

Fleuret et al. [10] suggested a method for integrating multiple calibrated cameras for detecting multiple pedestrians. In this approach, The ground plane is partitioned into uniform, non-overlapping grid cells, typically with size of 25 by 25 (cm). The detector produces a Probability Occupancy Map (POM), it provides an estimation of the probability of each grid cell to be occupied by a person. Given two to four synchronized video streams taken at eye level and from different angles, this method can effectively combine a generative model with dynamic programming to accurately follow up to six individuals across thousands of frames in spite of significant occlusions and lighting changes. It can also derive metrically accurate trajectories for each one of them.

See also

Related Research Articles

<span class="mw-page-title-main">Boosting (machine learning)</span> Method in machine learning

In machine learning, boosting is an ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by Kearns and Valiant : "Can a set of weak learners create a single strong learner?" A weak learner is defined to be a classifier that is only slightly correlated with the true classification. In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.

<span class="mw-page-title-main">Canny edge detector</span> Image edge detection algorithm

The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images. It was developed by John F. Canny in 1986. Canny also produced a computational theory of edge detection explaining why the technique works.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

<span class="mw-page-title-main">Feature (computer vision)</span> Piece of information about the content of an image

In computer vision and image processing, a feature is a piece of information about the content of an image; typically about whether a certain region of the image has certain properties. Features may be specific structures in the image such as points, edges or objects. Features may also be the result of a general neighborhood operation or feature detection applied to the image. Other examples of features are related to motion in image sequences, or to shapes defined in terms of curves or boundaries between different image regions.

The following outline is provided as an overview of and topical guide to computer vision:

Structure from motion (SfM) is a photogrammetric range imaging technique for estimating three-dimensional structures from two-dimensional image sequences that may be coupled with local motion signals. It is studied in the fields of computer vision and visual perception. In biological vision, SfM refers to the phenomenon by which humans can recover 3D structure from the projected 2D (retinal) motion field of a moving object or scene.

<span class="mw-page-title-main">Takeo Kanade</span> Japanese computer scientist

Takeo Kanade is a Japanese computer scientist and one of the world's foremost researchers in computer vision. He is U.A. and Helen Whitaker Professor at Carnegie Mellon School of Computer Science. He has approximately 300 peer-reviewed academic publications and holds around 20 patents.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Part-based models refers to a broad class of detection algorithms used on images, in which various parts of the image are used separately in order to determine if and where an object of interest exists. Amongst these methods a very popular one is the constellation model which refers to those schemes which seek to detect a small number of features and their relative positions to then determine whether or not the object of interest is present.

In computer vision, 3D object recognition involves recognizing and determining 3D information, such as the pose, volume, or shape, of user-chosen 3D objects in a photograph or range scan. Typically, an example of the object to be recognized is presented to a vision system in a controlled environment, and then for an arbitrary input such as a video stream, the system locates the previously presented object. This can be done either off-line, or in real-time. The algorithms for solving this problem are specialized for locating a single pre-identified object, and can be contrasted with algorithms which operate on general classes of objects, such as face recognition systems or 3D generic object recognition. Due to the low cost and ease of acquiring photographs, a significant amount of research has been devoted to 3D object recognition in photographs.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

Local binary patterns (LBP) is a type of visual descriptor used for classification in computer vision. LBP is the particular case of the Texture Spectrum model proposed in 1990. LBP was first described in 1994. It has since been found to be a powerful feature for texture classification; it has further been determined that when LBP is combined with the Histogram of oriented gradients (HOG) descriptor, it improves the detection performance considerably on some datasets. A comparison of several improvements of the original LBP in the field of background subtraction was made in 2015 by Silva et al. A full survey of the different versions of LBP can be found in Bouwmans et al.

Oriented FAST and rotated BRIEF (ORB) is a fast robust local feature detector, first presented by Ethan Rublee et al. in 2011, that can be used in computer vision tasks like object recognition or 3D reconstruction. It is based on the FAST keypoint detector and a modified version of the visual descriptor BRIEF (Binary Robust Independent Elementary Features). Its aim is to provide a fast and efficient alternative to SIFT.

<span class="mw-page-title-main">Saliency map</span>

In computer vision, a saliency map is an image that highlights the region on which people's eyes focus first. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system. For example, in this image, a person first looks at the fort and light clouds, so they should be highlighted on the saliency map. Saliency maps engineered in artificial or computer vision are typically not the same as the actual saliency map constructed by biological or natural vision.

Integral Channel Features (ICF), also known as ChnFtrs, is a method for object detection in computer vision. It uses integral images to extract features such as local sums, histograms and Haar-like features from multiple registered image channels. This method was highly exploited by Dollár et al. in their work for pedestrian detection, that was first described at the BMVC in 2009.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

References

  1. C. Papageorgiou and T. Poggio, "A Trainable Pedestrian Detection system", International Journal of Computer Vision (IJCV), pages 1:15–33, 2000
  2. N. Dalal, B. Triggs, “Histograms of oriented gradients for human detection”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pages 1:886–893, 2005
  3. Bo Wu and Ram Nevatia, "Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors", IEEE International Conference on Computer Vision (ICCV), pages 1:90–97, 2005
  4. Mikolajczyk, K. and Schmid, C. and Zisserman, A. "Human detection based on a probabilistic assembly of robust part detectors", The European Conference on Computer Vision (ECCV), volume 3021/2004, pages 69–82, 2005
  5. Hyunggi Cho, Paul E. Rybski, Aharon Bar-Hillel and Wende Zhang "Real-time Pedestrian Detection with Deformable Part Models"
  6. B.Leibe, E. Seemann, and B. Schiele. "Pedestrian detection in crowded scenes" IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pages 1:878–885, 2005
  7. O. Barnich, S. Jodogne, and M. Van Droogenbroeck. "Robust analysis of silhouettes by morphological size distributions" Advanced Concepts for Intelligent Vision Systems(ACIVS), pages 734–745, 2006
  8. S. Piérard, A. Lejeune, and M. Van Droogenbroeck. "A probabilistic pixel-based approach to detect humans in video streams" IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 921–924, 2011
  9. S. Piérard, A. Lejeune, and M. Van Droogenbroeck. "3D information is valuable for the detection of humans in video streams" Proceedings of 3D Stereo MEDIA, pages 1–4, 2010
  10. F. Fleuret, J. Berclaz, R. Lengagne and P. Fua, Multi-Camera People Tracking with a Probabilistic Occupancy Map, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 30, Nr. 2, pp. 267–282, February 2008.