You Only Look Once

Last updated
You Only Look Once
Original author(s) Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
Initial release2015
Written in Python
Type
Website https://pjreddie.com/darknet/yolo/

You Only Look Once (YOLO) is a series of real-time object detection systems based on convolutional neural networks. First introduced by Joseph Redmon et al. in 2015, [1] YOLO has undergone several iterations and improvements, becoming one of the most popular object detection frameworks. [2]

Contents

The name "You Only Look Once" refers to the fact that the algorithm requires only one forward propagation pass through the neural network to make predictions, unlike previous region proposal-based techniques like R-CNN that require thousands for a single image.

Overview

Compared to previous methods like R-CNN and OverFeat, [3] instead of applying the model to an image at multiple locations and scales, YOLO applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

Versions

There are two parts to the YOLO series. The original part contained YOLOv1, v2, and v3, all released on a website maintained by Joseph Redmon. [4]

YOLOv1

The original YOLO algorithm, introduced in 2015, divided the image into an S x S grid. If the center of an object fell into a grid cell, that cell was responsible for detecting that object. Each grid cell predicted B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and how accurate it thinks the box is that it predicts.

YOLOv2

Released in 2016, YOLOv2 (also known as YOLO9000) [5] [6] improved upon the original model by incorporating batch normalization, a higher resolution classifier, and using anchor boxes to predict bounding boxes. It could detect over 9000 object categories. It was also released on GitHub under the Apache 2.0 license. [7]

YOLOv3

YOLOv3, introduced in 2018, contained only "incremental" improvements, including the use of a more complex backbone network, multiple scales for detection, and a more sophisticated loss function. [8]

YOLOv4 and beyond

Subsequent versions of YOLO (v4, v5, etc.) have been developed by different researchers, further improving performance and introducing new features. These versions are not officially associated with the original YOLO authors but build upon their work. [4] As of 2023, there are up to YOLOv8. [2]

See also

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

The receptive field, or sensory space, is a delimited medium where some physiological stimuli can evoke a sensory neuronal response in specific organisms.

<span class="mw-page-title-main">Bit plane</span> Computer graphics term

A bit plane of a digital discrete signal is a set of bits corresponding to a given bit position in each of the binary numbers representing the signal.

<span class="mw-page-title-main">Crowd counting</span> Estimation of crowd sizes

Crowd counting is known to be act of counting the total crowd present in a certain area. The people in a certain area are called a crowd. The most direct method is to actually count each person in the crowd. For example, turnstiles are often used to precisely count the number of people entering an event.

In computer science and machine learning, cellular neural networks (CNN) or cellular nonlinear networks (CNN) are a parallel computing paradigm similar to neural networks, with the difference that communication is allowed between neighbouring units only. Typical applications include image processing, analyzing 3D surfaces, solving partial differential equations, reducing non-visual problems to geometric maps, modelling biological vision and other sensory-motor organs.

Object recognition – technology in the field of computer vision for finding and identifying objects in an image or video sequence. Humans recognize a multitude of objects in images with little effort, despite the fact that the image of the objects may vary somewhat in different view points, in many different sizes and scales or even when they are translated or rotated. Objects can even be recognized when they are partially obstructed from view. This task is still a challenge for computer vision systems. Many approaches to the task have been implemented over multiple decades.

Automatic target recognition (ATR) is the ability for an algorithm or device to recognize targets or other objects based on data obtained from sensors.

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is a subset of machine learning methods based on neural networks with representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by more recent deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

An AI accelerator, deep learning processor or neural processing unit (NPU) is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and computer vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2024, a typical AI integrated circuit chip contains tens of billions of MOSFETs.

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par with or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

Multi-focus image fusion is a multiple image compression technique using input images with different focus depths to make one output image that preserves all information.

U-Net is a convolutional neural network that was developed for image segmentation. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern (2015) GPU using the U-Net architecture.

Inception v3 is a convolutional neural network (CNN) for assisting in image analysis and object detection, and got its start as a module for GoogLeNet. It is the third edition of Google's Inception Convolutional Neural Network, originally introduced during the ImageNet Recognition Challenge. The design of Inceptionv3 was intended to allow deeper networks while also keeping the number of parameters from growing too large: it has "under 25 million parameters", compared against 60 million for AlexNet.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".

Region-based Convolutional Neural Networks (R-CNN) are a family of machine learning models for computer vision, and specifically object detection and localization. The original goal of R-CNN was to take an input image and produce a set of bounding boxes as output, where each bounding box contains an object and also the category of the object. In general, R-CNN architectures perform selective search over feature maps outputted by a CNN.

<span class="mw-page-title-main">Vision transformer</span> Variant of Transformer designed for vision processing

A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.

Applications of machine learning in earth sciences include geological mapping, gas leakage detection and geological features identification. Machine learning (ML) is a type of artificial intelligence (AI) that enables computer systems to classify, cluster, identify and analyze vast and complex sets of data while eliminating the need for explicit instructions and programming. Earth science is the study of the origin, evolution, and future of the planet Earth. The Earth system can be subdivided into four major components including the solid earth, atmosphere, hydrosphere and biosphere.

Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.

References

  1. Redmon, Joseph; Divvala, Santosh; Girshick, Ross; Farhadi, Ali (2016). "You Only Look Once: Unified, Real-Time Object Detection": 779–788.{{cite journal}}: Cite journal requires |journal= (help)
  2. 1 2 Terven, Juan; Córdova-Esparza, Diana-Margarita; Romero-González, Julio-Alejandro (2023-11-20). "A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS". Machine Learning and Knowledge Extraction. 5 (4): 1680–1716. doi: 10.3390/make5040083 . ISSN   2504-4990.
  3. Sermanet, Pierre; Eigen, David; Zhang, Xiang; Mathieu, Michael; Fergus, Rob; LeCun, Yann (2014-02-23), OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks, arXiv: 1312.6229
  4. 1 2 "YOLO: Real-Time Object Detection". pjreddie.com. Retrieved 2024-09-12.
  5. Redmon, Joseph; Farhadi, Ali (2017). "YOLO9000: Better, Faster, Stronger": 7263–7271.{{cite journal}}: Cite journal requires |journal= (help)
  6. "YOLOv2: Real-Time Object Detection". pjreddie.com. Retrieved 2024-09-12.
  7. Rémy, Philippe (2024-09-05), philipperemy/yolo-9000 , retrieved 2024-09-12
  8. Redmon, Joseph; Farhadi, Ali (2018-04-08), YOLOv3: An Incremental Improvement, arXiv: 1804.02767 , retrieved 2024-09-12