Region Based Convolutional Neural Networks

Last updated October 11, 2024

R-CNN architecture R-cnn.svg — R-CNN architecture

Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision, and specifically object detection and localization.^[1] The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category (e.g. car or pedestrian) of the object. In general, R-CNN architectures perform selective search^[2] over feature maps outputted by a CNN.

History

The following covers some of the versions of R-CNN that have been developed.

November 2013: R-CNN.^[7]
April 2015: Fast R-CNN.^[8]
June 2015: Faster R-CNN.^[9]
March 2017: Mask R-CNN.^[10]
June 2019: Mesh R-CNN adds the ability to generate a 3D mesh from a 2D image.^[11]

Architecture

For review articles see.^[1]^[12]

Selective search

Given an image (or an image-like feature map), selective search (also called Hierarchical Grouping) first segments the image by the algorithm in (Felzenszwalb and Huttenlocher, 2004),^[13] then performs the following:^[2]

Input: (colour) image  Output: Set of object location hypotheses L   Segment image into initial regions R = {r₁, ..., rₙ} using Felzenszwalb and Huttenlocher (2004) Initialise similarity set S = ∅ foreach Neighbouring region pair (rᵢ, rⱼ) do    Calculate similarity s(rᵢ, rⱼ)    S = S ∪ s(rᵢ, rⱼ) while S ≠ ∅ do    Get highest similarity s(rᵢ, rⱼ) = max(S)    Merge corresponding regions rₜ = rᵢ ∪ rⱼ    Remove similarities regarding rᵢ: S = S \ s(rᵢ, r∗)    Remove similarities regarding rⱼ: S = S \ s(r∗, rⱼ)    Calculate similarity set Sₜ between rₜ and its neighbours    S = S ∪ Sₜ    R = R ∪ rₜ Extract object location boxes L from all regions in R

R-CNN

Given an input image, R-CNN begins by applying selective search to extract regions of interest (ROI), where each ROI is a rectangle that may represent the boundary of an object in image. Depending on the scenario, there may be as many as two thousand ROIs. After that, each ROI is fed through a neural network to produce output features. For each ROI's output features, an ensemble of support-vector machine classifiers is used to determine what type of object (if any) is contained within the ROI.^[7]

Fast R-CNN

While the original R-CNN independently computed the neural network features on each of as many as two thousand regions of interest, Fast R-CNN runs the neural network once on the whole image.^[8]

At the end of the network is a ROIPooling module, which slices out each ROI from the network's output tensor, reshapes it, and classifies it. As in the original R-CNN, the Fast R-CNN uses selective search to generate its region proposals.

Faster R-CNN

While Fast R-CNN used selective search to generate ROIs, Faster R-CNN integrates the ROI generation into the neural network itself.^[9]

Mask R-CNN

While previous versions of R-CNN focused on object detections, Mask R-CNN adds instance segmentation. Mask R-CNN also replaced ROIPooling with a new method called ROIAlign, which can represent fractions of a pixel.^[10]

Related Research Articles

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

In computer science and machine learning, cellular neural networks (CNN) or cellular nonlinear networks (CNN) are a parallel computing paradigm similar to neural networks, with the difference that communication is allowed between neighbouring units only. Typical applications include image processing, analyzing 3D surfaces, solving partial differential equations, reducing non-visual problems to geometric maps, modelling biological vision and other sensory-motor organs.

As applied in the field of computer vision, graph cut optimization can be employed to efficiently solve a wide variety of low-level computer vision problems, such as image smoothing, the stereo correspondence problem, image segmentation, object co-segmentation, and many other computer vision problems that can be formulated in terms of energy minimization.

Articulated body pose estimation in computer vision is the study of algorithms and systems that recover the pose of an articulated body, which consists of joints and rigid parts using image-based observations. It is one of the longest-lasting problems in computer vision because of the complexity of the models that relate observation with pose, and because of the variety of situations in which it would be useful.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

Image segmentation strives to partition a digital image into regions of pixels with similar properties, e.g. homogeneity. The higher-level region representation simplifies image analysis tasks such as counting objects or detecting changes, because region attributes can be compared more readily than raw pixels.

Jitendra Malik is an Indian-American academic who is the Arthur J. Chick Professor of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He is known for his research in computer vision.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by newer deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

In computer vision, a saliency map is an image that highlights either the region on which people's eyes focus first or the most relevant regions for machine learning models. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system or an otherwise opaque ML model.

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.

<span class="mw-page-title-main">Alan Yuille</span> English academic

Alan Yuille is a Bloomberg Distinguished Professor of Computational Cognitive Science with appointments in the departments of Cognitive Science and Computer Science at Johns Hopkins University. Yuille develops models of vision and cognition for computers, intended for creating artificial vision systems. He studied under Stephen Hawking at Cambridge University on a PhD in theoretical physics, which he completed in 1981.

In computer vision, object co-segmentation is a special case of image segmentation, which is defined as jointly segmenting semantically similar objects in multiple images or video frames.

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

Pedro Felipe Felzenszwalb is a computer scientist and professor of the School of Engineering and Department of Computer Science at Brown University.

Michael J. Black is an American-born computer scientist working in Tübingen, Germany. He is a founding director at the Max Planck Institute for Intelligent Systems where he leads the Perceiving Systems Department in research focused on computer vision, machine learning, and computer graphics. He is also an Honorary Professor at the University of Tübingen.

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Jiaya Jia is a Chair Professor of the Department of Computer Science and Engineering at The Hong Kong University of Science and Technology (HKUST). He is an IEEE Fellow, the associate editor-in-chief of one of IEEE’s flagship and premier journals- Transactions on Pattern Analysis and Machine Intelligence (TPAMI), as well as on the editorial board of International Journal of Computer Vision (IJCV).

References

1 2 Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "14.8. Region-based CNNs (R-CNNs)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
1 2 Uijlings, J. R. R.; van de Sande, K. E. A.; Gevers, T.; Smeulders, A. W. M. (2013-09-01). "Selective Search for Object Recognition". International Journal of Computer Vision. 104 (2): 154–171. doi:10.1007/s11263-013-0620-5. ISSN 1573-1405.
↑ Nene, Vidi (Aug 2, 2019). "Deep Learning-Based Real-Time Multiple-Object Detection and Tracking via Drone". Drone Below. Retrieved Mar 28, 2020.
↑ Ray, Tiernan (Sep 11, 2018). "Facebook pumps up character recognition to mine memes". ZDNET . Retrieved Mar 28, 2020.
↑ Sagar, Ram (Sep 9, 2019). "These machine learning methods make google lens a success". Analytics India. Retrieved Mar 28, 2020.
↑ Mattson, Peter; et al. (2019). "MLPerf Training Benchmark". arXiv: 1910.01500v3 [math.LG].
1 2 Girshick, Ross; Donahue, Jeff; Darrell, Trevor; Malik, Jitendra (2016-01-01). "Region-Based Convolutional Networks for Accurate Object Detection and Segmentation". IEEE Transactions on Pattern Analysis and Machine Intelligence. 38 (1): 142–158. doi:10.1109/TPAMI.2015.2437384. ISSN 0162-8828. PMID 26656583.
1 2 Girshick, Ross (7–13 December 2015). "Fast R-CNN". 2015 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 1440–1448. doi:10.1109/ICCV.2015.169. ISBN 978-1-4673-8391-2.
1 2 Ren, Shaoqing; He, Kaiming; Girshick, Ross; Sun, Jian (2017-06-01). "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks". IEEE Transactions on Pattern Analysis and Machine Intelligence. 39 (6): 1137–1149. arXiv: 1506.01497 . doi:10.1109/TPAMI.2016.2577031. ISSN 0162-8828. PMID 27295650.
1 2 He, Kaiming; Gkioxari, Georgia; Dollar, Piotr; Girshick, Ross (October 2017). "Mask R-CNN". 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. pp. 2980–2988. doi:10.1109/ICCV.2017.322. ISBN 978-1-5386-1032-9.
↑ Gkioxari, Georgia; Malik, Jitendra; Johnson, Justin (2019). "Mesh R-CNN": 9785–9795.{{cite journal}}: Cite journal requires |journal= (help)
↑ Weng, Lilian (December 31, 2017). "Object Detection for Dummies Part 3: R-CNN Family". Lil'Log. Retrieved March 12, 2020.
↑ Felzenszwalb, Pedro F.; Huttenlocher, Daniel P. (2004-09-01). "Efficient Graph-Based Image Segmentation". International Journal of Computer Vision. 59 (2): 167–181. doi:10.1023/B:VISI.0000022288.19776.77. ISSN 1573-1405.