Features from accelerated segment test

Last updated

Features from accelerated segment test (FAST) is a corner detection method, which could be used to extract feature points and later used to track and map objects in many computer vision tasks. The FAST corner detector was originally developed by Edward Rosten and Tom Drummond, and was published in 2006. [1] The most promising advantage of the FAST corner detector is its computational efficiency. Referring to its name, it is indeed faster than many other well-known feature extraction methods, such as difference of Gaussians (DoG) used by the SIFT, SUSAN and Harris detectors. Moreover, when machine learning techniques are applied, superior performance in terms of computation time and resources can be realised. The FAST corner detector is very suitable for real-time video processing application because of this high-speed performance.

Contents

Segment test detector

The pixels used by the FAST corner detector FAST Corner Detector.jpg
The pixels used by the FAST corner detector

FAST corner detector uses a circle of 16 pixels (a Bresenham circle of radius 3) to classify whether a candidate point p is actually a corner. Each pixel in the circle is labeled from integer number 1 to 16 clockwise. If a set of N contiguous pixels in the circle are all brighter than the intensity of candidate pixel p (denoted by Ip) plus a threshold value t or all darker than the intensity of candidate pixel p minus threshold value t, then p is classified as corner. The conditions can be written as:

So when either of the two conditions is met, candidate p can be classified as a corner. There is a tradeoff of choosing N, the number of contiguous pixels and the threshold value t. On one hand the number of detected corner points should not be too many, on the other hand, the high performance should not be achieved by sacrificing computational efficiency. Without the improvement of machine learning, N is usually chosen as 12. A high-speed test method could be applied to exclude non-corner points.

High-speed test

The high-speed test for rejecting non-corner points is operated by examining 4 example pixels, namely pixel 1, 9, 5 and 13. Because there should be at least 12 contiguous pixels that are whether all brighter or darker than the candidate corner, so there should be at least 3 pixels out of these 4 example pixels that are all brighter or darker than the candidate corner. Firstly pixels 1 and 9 are examined, if both I1 and I9 are within [Ip - t, Ip + t], then candidate p is not a corner. Otherwise pixels 5 and 13 are further examined to check whether three of them are brighter than Ip + t or darker than Ip - t. If there exists 3 of them that are either brighter or darker, the rest pixels are then examined for final conclusion. And according to the inventor in his first paper, [2] on average 3.8 pixels are needed to check for candidate corner pixel. Compared with 8.5 pixels for each candidate corner, 3.8 is really a great reduction which could highly improve the performance.

However, there are several weaknesses for this test method:

  1. The high-speed test cannot be generalized well for N < 12. If N < 12, it would be possible that a candidate p is a corner and only 2 out of 4 example test pixels are both brighter Ip + t or darker than Ip - t.
  2. The efficiency of the detector depends on the choice and ordering of these selected test pixels. However it is unlikely that the chosen pixels are optimal which take concerns about the distribution of corner appearances.
  3. Multiple features are detected adjacent to one another

Improvement with machine learning

In order to address the first two weakness points of high-speed test, a machine learning approach is introduced to help improve the detecting algorithm. This machine learning approach operates in two stages. Firstly, corner detection with a given N is processed on a set of training images which are preferable from the target application domain. Corners are detected through the simplest implementation which literally extracts a ring of 16 pixels and compares the intensity values with an appropriate threshold.

For candidate p, each location on the circle x ∈ {1, 2, 3, ..., 16} can be denoted by p→x. The state of each pixel, Sp→x must be in one of the following three states:

Then choosing an x (same for all p) partitions P (the set of all pixels of all training images) into 3 different subsets, Pd, Ps, Pb where:

Secondly, a decision tree algorithm, the ID3 algorithm is applied to the 16 locations in order to achieve the maximum information gain. Let Kp be a boolean variable which indicates whether p is a corner, then the entropy of Kp is used to measure the information of p being a corner. For a set of pixels Q, the total entropy of KQ (not normalized) is:

The information gain can then be represented as:

A recursive process is applied to each subsets in order to select each x that could maximize the information gain. For example, at first an x is selected to partition P into Pd, Ps, Pb with the most information; then for each subset Pd, Ps, Pb, another y is selected to yield the most information gain (notice that the y could be the same as x ). This recursive process ends when the entropy is zero so that either all pixels in that subset are corners or non-corners.

This generated decision tree can then be converted into programming code, such as C and C++, which is just a bunch of nested if-else statements. For optimization purpose, profile-guided optimization is used to compile the code. The complied code is used as corner detector later for other images.

Notice that the corners detected using this decision tree algorithm should be slightly different from the results using segment test detector. This is because that decision tree model depends on the training data, which could not cover all possible corners.

Non-maximum suppression

"Since the segment test does not compute a corner response function, non-maximum suppression can not be applied directly to the resulting features." However, if N is fixed, for each pixel p the corner strength is defined as the maximum value of t that makes p a corner. Two approaches therefore could be used:

FAST-ER: Enhanced repeatability

FAST-ER detector is an improvement of the FAST detector using a metaheuristic algorithm, in this case simulated annealing. So that after the optimization, the structure of the decision tree would be optimized and suitable for points with high repeatability. However, since simulated annealing is a metaheurisic algorithm, each time the algorithm would generate a different optimized decision tree. So it is better to take efficiently large amount of iterations to find a solution that is close to the real optimal. According to Rosten, it takes about 200 hours on a Pentium 4 at 3 GHz which is 100 repeats of 100,000 iterations to optimize the FAST detector.

Comparison with other detectors

In Rosten's research, [3] FAST and FAST-ER detector are evaluated on several different datasets and compared with the DoG, Harris, Harris-Laplace, Shi-Tomasi, and SUSAN corner detectors.

The parameter settings for the detectors (other than FAST) are as follows:

DetectorParameter SettingValue
DoG
Scales per octave3
Initial blur σ0.8
Octaves4
SUSANDistance threshold4.0
Harris, Shi-TomasiBlur σ2.5
Harris-LaplaceInitial blur σ0.8
Harris blur3
Octaves4
Scales per octave10
General parametersε5 pixels
DetectorA
FAST-ER1313.6
FAST-91304.57
DOG1275.59
Shi & Tomasi1219.08
Harris1195.2
Harris-Laplace1153.13
FAST-121121.53
SUSAN1116.79
Random271.73
DetectorTraining set pixel rateTest set pixel rate
FAST n=9188179
FAST n=12158154
Original FAST n=127982.2
FAST-ER75.467.5
SUSAN12.313.6
Harris8.057.90
Shi-Tomasi6.506.50
DoG4.725.10

Related Research Articles

Edge detection includes a variety of mathematical methods that aim at identifying edges, curves in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The same problem of finding discontinuities in one-dimensional signals is known as step detection and the problem of finding signal discontinuities over time is known as change detection. Edge detection is a fundamental tool in image processing, machine vision and computer vision, particularly in the areas of feature detection and feature extraction.

The Hough transform is a feature extraction technique used in image analysis, computer vision, and digital image processing. The purpose of the technique is to find imperfect instances of objects within a certain class of shapes by a voting procedure. This voting procedure is carried out in a parameter space, from which object candidates are obtained as local maxima in a so-called accumulator space that is explicitly constructed by the algorithm for computing the Hough transform.

<span class="mw-page-title-main">Canny edge detector</span> Image edge detection algorithm

The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images. It was developed by John F. Canny in 1986. Canny also produced a computational theory of edge detection explaining why the technique works.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

Voice activity detection (VAD), also known as speech activity detection or speech detection, is the detection of the presence or absence of human speech, used in speech processing. The main uses of VAD are in speaker diarization, speech coding and speech recognition. It can facilitate speech processing, and can also be used to deactivate some processes during non-speech section of an audio session: it can avoid unnecessary coding/transmission of silence packets in Voice over Internet Protocol (VoIP) applications, saving on computation and on network bandwidth.

<span class="mw-page-title-main">ALICE experiment</span> Detector experiments at the Large Hadron Collider

In computer vision and image processing, a feature is a piece of information about the content of an image; typically about whether a certain region of the image has certain properties. Features may be specific structures in the image such as points, edges or objects. Features may also be the result of a general neighborhood operation or feature detection applied to the image. Other examples of features are related to motion in image sequences, or to shapes defined in terms of curves or boundaries between different image regions.

<span class="mw-page-title-main">Corner detection</span>

Corner detection is an approach used within computer vision systems to extract certain kinds of features and infer the contents of an image. Corner detection is frequently used in motion detection, image registration, video tracking, image mosaicing, panorama stitching, 3D reconstruction and object recognition. Corner detection overlaps with the topic of interest point detection.

The Kadir–Brady saliency detector extracts features of objects in images that are distinct and representative. It was invented by Timor Kadir and J. Michael Brady in 2001 and an affine invariant version was introduced by Kadir and Brady in 2004 and a robust version was designed by Shao et al. in 2007.

The histogram of oriented gradients (HOG) is a feature descriptor used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in localized portions of an image. This method is similar to that of edge orientation histograms, scale-invariant feature transform descriptors, and shape contexts, but differs in that it is computed on a dense grid of uniformly spaced cells and uses overlapping local contrast normalization for improved accuracy.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

In computer vision, maximally stable extremal regions (MSER) are used as a method of blob detection in images. This technique was proposed by Matas et al. to find correspondences between image elements from two images with different viewpoints. This method of extracting a comprehensive number of corresponding image elements contributes to the wide-baseline matching, and it has led to better stereo matching and object recognition algorithms.

The Viola–Jones object detection framework is a machine learning object detection framework proposed in 2001 by Paul Viola and Michael Jones. It was motivated primarily by the problem of face detection, although it can be adapted to the detection of other object classes.

One-shot learning is an object categorization problem, found mostly in computer vision. Whereas most machine learning-based object categorization algorithms require training on hundreds or thousands of examples, one-shot learning aims to classify objects from one, or only a few, examples. The term few-shot learning is also used for these problems, especially when more than one example is needed.

Cascading is a particular case of ensemble learning based on the concatenation of several classifiers, using all information collected from the output from a given classifier as additional information for the next classifier in the cascade. Unlike voting or stacking ensembles, which are multiexpert systems, cascading is a multistage one.

Geometric feature learning is a technique combining machine learning and computer vision to solve visual tasks. The main goal of this method is to find a set of representative features of geometric form to represent an object by collecting geometric features from images and learning them using efficient machine learning methods. Humans solve visual tasks and can give fast response to the environment by extracting perceptual information from what they see. Researchers simulate humans' ability of recognizing objects to solve computer vision problems. For example, M. Mata et al.(2002) applied feature learning techniques to the mobile robot navigation tasks in order to avoid obstacles. They used genetic algorithms for learning features and recognizing objects (figures). Geometric feature learning methods can not only solve recognition problems but also predict subsequent actions by analyzing a set of sequential input sensory images, usually some extracting features of images. Through learning, some hypothesis of the next action are given and according to the probability of each hypothesis give a most probable action. This technique is widely used in the area of artificial intelligence.

<span class="mw-page-title-main">Foreground detection</span>

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

Chessboards arise frequently in computer vision theory and practice because their highly structured geometry is well-suited for algorithmic detection and processing. The appearance of chessboards in computer vision can be divided into two main areas: camera calibration and feature extraction. This article provides a unified discussion of the role that chessboards play in the canonical methods from these two areas, including references to the seminal literature, examples, and pointers to software implementations.

The Harris corner detector is a corner detection operator that is commonly used in computer vision algorithms to extract corners and infer features of an image. It was first introduced by Chris Harris and Mike Stephens in 1988 upon the improvement of Moravec's corner detector. Compared to its predecessor, Harris' corner detector takes the differential of the corner score into account with reference to direction directly, instead of using shifting patches for every 45 degree angles, and has been proved to be more accurate in distinguishing between edges and corners. Since then, it has been improved and adopted in many algorithms to preprocess images for subsequent applications.

References

  1. Rosten, Edward; Drummond, Tom (2006). "Machine Learning for High-speed Corner Detection". Computer Vision – ECCV 2006. Lecture Notes in Computer Science. Vol. 3951. pp. 430–443. doi:10.1007/11744023_34. ISBN   978-3-540-33832-1. S2CID   1388140.
  2. Edward Rosten, Real-time Video Annotations for Augmented Reality
  3. Edward Rosten, FASTER and better: A machine learning approach to corner detection

Bibliography