Region Based Convolutional Neural Networks

Last updated
R-CNN architecture R-cnn.svg
R-CNN architecture

Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision, and specifically object detection and localization. [1] The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category (e.g. car or pedestrian) of the object. In general, R-CNN architectures perform selective search [2] over feature maps outputted by a CNN.

Contents

R-CNN has been extended to perform other computer vision tasks, such as: tracking objects from a drone-mounted camera, [3] locating text in an image, [4] and enabling object detection in Google Lens. [5]

Mask R-CNN is also one of seven tasks in the MLPerf Training Benchmark, which is a competition to speed up the training of neural networks. [6]

History

The following covers some of the versions of R-CNN that have been developed.

Architecture

For review articles see. [1] [12]

Given an image (or an image-like feature map), selective search (also called Hierarchical Grouping) first segments the image by the algorithm in (Felzenszwalb and Huttenlocher, 2004), [13] then performs the following: [2]

Input: (colour) image  Output: Set of object location hypotheses L   Segment image into initial regions R = {r₁, ..., rₙ} using Felzenszwalb and Huttenlocher (2004) Initialise similarity set S = ∅ foreach Neighbouring region pair (rᵢ, rⱼ) do    Calculate similarity s(rᵢ, rⱼ)    S = S ∪ s(rᵢ, rⱼ) while S ≠ ∅ do    Get highest similarity s(rᵢ, rⱼ) = max(S)    Merge corresponding regions rₜ = rᵢ ∪ rⱼ    Remove similarities regarding rᵢ: S = S \ s(rᵢ, r∗)    Remove similarities regarding rⱼ: S = S \ s(r∗, rⱼ)    Calculate similarity set Sₜ between rₜ and its neighbours    S = S ∪ Sₜ    R = R ∪ rₜ Extract object location boxes L from all regions in R

R-CNN

R-CNN architecture R-cnn.svg
R-CNN architecture

Given an input image, R-CNN begins by applying selective search to extract regions of interest (ROI), where each ROI is a rectangle that may represent the boundary of an object in image. Depending on the scenario, there may be as many as two thousand ROIs. After that, each ROI is fed through a neural network to produce output features. For each ROI's output features, an ensemble of support-vector machine classifiers is used to determine what type of object (if any) is contained within the ROI. [7]

Fast R-CNN

Fast R-CNN Fast-rcnn.svg
Fast R-CNN

While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image. [8]

RoI pooling to size 2x2. In this example region proposal (an input parameter) has size 7x5. RoI pooling animated.gif
RoI pooling to size 2x2. In this example region proposal (an input parameter) has size 7x5.

At the end of the network is a ROIPooling module, which slices out each ROI from the network's output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses selective search to generate its region proposals.

Faster R-CNN

Faster R-CNN Faster-rcnn.svg
Faster R-CNN

While Fast R-CNN used selective search to generate ROIs, Faster R-CNN integrates the ROI generation into the neural network itself. [9]

Mask R-CNN

Mask R-CNN Mask-rcnn.svg
Mask R-CNN

While previous versions of R-CNN focused on object detections, Mask R-CNN adds instance segmentation. Mask R-CNN also replaced ROIPooling with a new method called ROIAlign, which can represent fractions of a pixel. [10]

Related Research Articles

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

In computer science and machine learning, cellular neural networks (CNN) or cellular nonlinear networks (CNN) are a parallel computing paradigm similar to neural networks, with the difference that communication is allowed between neighbouring units only. Typical applications include image processing, analyzing 3D surfaces, solving partial differential equations, reducing non-visual problems to geometric maps, modelling biological vision and other sensory-motor organs.

As applied in the field of computer vision, graph cut optimization can be employed to efficiently solve a wide variety of low-level computer vision problems, such as image smoothing, the stereo correspondence problem, image segmentation, object co-segmentation, and many other computer vision problems that can be formulated in terms of energy minimization. Many of these energy minimization problems can be approximated by solving a maximum flow problem in a graph. Under most formulations of such problems in computer vision, the minimum energy solution corresponds to the maximum a posteriori estimate of a solution. Although many computer vision algorithms involve cutting a graph, the term "graph cuts" is applied specifically to those models which employ a max-flow/min-cut optimization.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

Image segmentation strives to partition a digital image into regions of pixels with similar properties, e.g. homogeneity. The higher-level region representation simplifies image analysis tasks such as counting objects or detecting changes, because region attributes can be compared more readily than raw pixels.

<span class="mw-page-title-main">Jitendra Malik</span> Indian-American academic (born 1960)

Jitendra Malik is an Indian-American academic who is the Arthur J. Chick Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He is known for his research in computer vision.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by more recent deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Robust Principal Component Analysis (RPCA) is a modification of the widely used statistical procedure of principal component analysis (PCA) which works well with respect to grossly corrupted observations. A number of different approaches exist for Robust PCA, including an idealized version of Robust PCA, which aims to recover a low-rank matrix L0 from highly corrupted measurements M = L0 +S0. This decomposition in low-rank and sparse matrices can be achieved by techniques such as Principal Component Pursuit method (PCP), Stable PCP, Quantized PCP, Block based PCP, and Local PCP. Then, optimization methods are used such as the Augmented Lagrange Multiplier Method (ALM), Alternating Direction Method (ADM), Fast Alternating Minimization (FAM), Iteratively Reweighted Least Squares (IRLS ) or alternating projections (AP).

<span class="mw-page-title-main">AlexNet</span> Convolutional neural network

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.

<span class="mw-page-title-main">Alan Yuille</span> English academic

Alan Yuille is a Bloomberg Distinguished Professor of Computational Cognitive Science with appointments in the departments of Cognitive Science and Computer Science at Johns Hopkins University. Yuille develops models of vision and cognition for computers, intended for creating artificial vision systems. He studied under Stephen Hawking at Cambridge University on a PhD in theoretical physics, which he completed in 1981.

<span class="mw-page-title-main">Object co-segmentation</span> Type of image segmentation, jointly segmenting semantically similar objects in multiple images

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.

<span class="mw-page-title-main">Event camera</span> Type of imaging sensor

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

Pedro Felipe Felzenszwalb is a computer scientist and professor of the School of Engineering and Department of Computer Science at Brown University.

<span class="mw-page-title-main">Michael J. Black</span> American-born computer scientist

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Jiaya Jia is a Chair Professor of the Department of Computer Science and Engineering at The Hong Kong University of Science and Technology (HKUST). He is an IEEE Fellow, the associate editor-in-chief of one of IEEE’s flagship and premier journals- Transactions on Pattern Analysis and Machine Intelligence (TPAMI), as well as on the editorial board of International Journal of Computer Vision (IJCV).

Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.

References

  1. 1 2 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "14.8. Region-based CNNs (R-CNNs)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN   978-1-009-38943-3.
  2. 1 2 Uijlings, J. R. R.; van de Sande, K. E. A.; Gevers, T.; Smeulders, A. W. M. (2013-09-01). "Selective Search for Object Recognition". International Journal of Computer Vision. 104 (2): 154–171. doi:10.1007/s11263-013-0620-5. ISSN   1573-1405.
  3. Nene, Vidi (Aug 2, 2019). "Deep Learning-Based Real-Time Multiple-Object Detection and Tracking via Drone". Drone Below. Retrieved Mar 28, 2020.
  4. Ray, Tiernan (Sep 11, 2018). "Facebook pumps up character recognition to mine memes". ZDNET . Retrieved Mar 28, 2020.
  5. Sagar, Ram (Sep 9, 2019). "These machine learning methods make google lens a success". Analytics India. Retrieved Mar 28, 2020.
  6. Mattson, Peter; et al. (2019). "MLPerf Training Benchmark". arXiv: 1910.01500v3 [math.LG].
  7. 1 2 Girshick, Ross; Donahue, Jeff; Darrell, Trevor; Malik, Jitendra (2016-01-01). "Region-Based Convolutional Networks for Accurate Object Detection and Segmentation". IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (1): 142–158. doi:10.1109/TPAMI.2015.2437384. ISSN   0162-8828.
  8. 1 2 Girshick, Ross (7–13 December 2015). "Fast R-CNN". 2015 IEEE International Conference on Computer Vision (ICCV). IEEE: 1440–1448. doi:10.1109/ICCV.2015.169. ISBN   978-1-4673-8391-2.
  9. 1 2 Ren, Shaoqing; He, Kaiming; Girshick, Ross; Sun, Jian (2017-06-01). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks". IEEE Transactions on Pattern Analysis and Machine Intelligence. 39 (6): 1137–1149. arXiv: 1506.01497 . doi:10.1109/TPAMI.2016.2577031. ISSN   0162-8828.
  10. 1 2 He, Kaiming; Gkioxari, Georgia; Dollar, Piotr; Girshick, Ross (October 2017). "Mask R-CNN". IEEE: 2980–2988. doi:10.1109/ICCV.2017.322. ISBN   978-1-5386-1032-9.{{cite journal}}: Cite journal requires |journal= (help)
  11. Gkioxari, Georgia; Malik, Jitendra; Johnson, Justin (2019). "Mesh R-CNN": 9785–9795.{{cite journal}}: Cite journal requires |journal= (help)
  12. Weng, Lilian (December 31, 2017). "Object Detection for Dummies Part 3: R-CNN Family". Lil'Log. Retrieved March 12, 2020.
  13. Felzenszwalb, Pedro F.; Huttenlocher, Daniel P. (2004-09-01). "Efficient Graph-Based Image Segmentation". International Journal of Computer Vision. 59 (2): 167–181. doi:10.1023/B:VISI.0000022288.19776.77. ISSN   1573-1405.

Further reading