Small object detection

Last updated

Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.

Contents

Uses

An example of object tracking

Small object detection has applications in various fields such as Video surveillance (Traffic video Surveillance, [1] [2] Small object retrieval, [3] [4] Anomaly detection, [5] Maritime surveillance, Drone surveying, Traffic flow analysis, [6] and Object tracking.

Problems with small objects

Shadow and drone movement effect Disp shadow.jpg
Shadow and drone movement effect

Methods

Various methods [16] are available to detect small objects, which fall under three categories:

YOLOv5 detection result Yolov5.jpg
YOLOv5 detection result
YOLOv5 and SAHI interface Y5 sahi.jpg
YOLOv5 and SAHI interface
YOLOv7 detection output Yolov7.jpg
YOLOv7 detection output

Improvising existing techniques

There are various ways to detect small objects with existing techniques. Some of them are mentioned below,

Choosing a data set that has small objects

The machine learning model's output depends on "How well it is trained." [17] So, the data set must include small objects to detect such objects. Also, modern-day detectors, such as YOLO, rely on anchors. [18] Latest versions of YOLO (starting from YOLOv5 [19] ) uses an auto-anchor algorithm to find good anchors based on the nature of object sizes in the data set. Therefore, it is mandatory to have smaller objects in the data set.

Generating more data via augmentation, if required

Deep learning models have billions of neurons that settle down to some weights after training. Therefore, it requires a good amount of quantitative and qualitative data for better training. [20] Data augmentation is useful technique to generate more diverse data [17] from an existing data set.

Increasing image capture resolution and model’s input resolution

These help to get more features from objects and eventually learn the best from them. For example, a bike object in the 1280 X 1280 resolution image has more features than the 640 X 640 resolution.

Auto learning anchors

Selecting anchor size plays a vital role in small object detection. [21] Instead of hand picking it, use algorithms that identify it based on the data set. YOLOv5 uses a K-means algorithm to define anchor size.

Tiling approach during training and inference

State-of-the-art object detectors allow only the fixed size of image and change the input image size according to it. This change may deform the small objects in the image. The tiling approach [22] helps when an image has a high resolution than the model's fixed input size; instead of scaling it down, the image is broken down into tiles and then used in training. The same approach is used during inference as well.

Feature Pyramid Network (FPN)

Use a feature pyramid network [23] to learn features at a multi-scale: e.g., Twin Feature Pyramid Networks (TFPN), [24] Extended Feature Pyramid Network (EFPN). [25] FPN helps to sustain features of small objects against convolution layers.

Add-on techniques

Instead of modifying existing methods, some add-on techniques are there, which can be directly placed on top of existing approaches to detect smaller objects. One such technique is Slicing Aided Hyper Inference(SAHI). [26] The image is sliced into different-sized multiple overlapping patches. Hyper-parameters define their dimensions. Then patches are resized, while maintaining the aspect ratio during fine-tuning. These patches are then provided for training the model.

Well-Optimised techniques for small object detection

Various deep learning techniques are available that focus on such object detection problems: e.g., Feature-Fused SSD, [27] YOLO-Z. [28] Such methods work on "How to sustain features of small objects while they pass through convolution networks."

Other applications

See also

Related Research Articles

<span class="mw-page-title-main">Object detection</span> Computer technology related to computer vision and image processing

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class in digital images and videos. Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on artificial neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

<span class="mw-page-title-main">Convolutional neural network</span> Artificial neural network

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

An AI accelerator or neural processing unit is a class of specialized hardware accelerator or computer system designed to accelerate artificial intelligence and machine learning applications, including artificial neural networks and machine vision. Typical applications include algorithms for robotics, Internet of Things, and other data-intensive or sensor-driven tasks. They are often manycore designs and generally focus on low-precision arithmetic, novel dataflow architectures or in-memory computing capability. As of 2018, a typical AI integrated circuit chip contains billions of MOSFET transistors. A number of vendor-specific terms exist for devices in this category, and it is an emerging technology without a dominant design.

The CIFAR-10 dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.

<span class="mw-page-title-main">Neural architecture search</span> Machine learning-powered structure design

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

Energy-based generative neural networks is a class of generative models, which aim to learn explicit probability distributions of data in the form of energy-based models whose energy functions are parameterized by modern deep neural networks. Its name is due to the fact that this model can be derived from the discriminative neural networks. The parameter of the neural network in this model is trained in a generative manner by Markov chain Monte Carlo(MCMC)-based maximum likelihood estimation. The learning process follows an ''analysis by synthesis'' scheme, where within each learning iteration, the algorithm samples the synthesized examples from the current model by a gradient-based MCMC method, e.g., Langevin dynamics, and then updates the model parameters based on the difference between the training examples and the synthesized ones. This process can be interpreted as an alternating mode seeking and mode shifting process, and also has an adversarial interpretation. The first energy-based generative neural network is the generative ConvNet proposed in 2016 for image patterns, where the neural network is a convolutional neural network. The model has been generalized to various domains to learn distributions of videos, and 3D voxels. They are made more effective in their variants. They have proven useful for data generation, data recovery, data reconstruction.

An audio deepfake is a type of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

<span class="mw-page-title-main">Attention (machine learning)</span> Machine learning technique

Machine learning-based attention is a mechanism mimicking cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. It can do it either in parallel or sequentially. "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

<span class="mw-page-title-main">Self-supervised learning</span> A paradigm in machine learning

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

The Fashion MNIST dataset is a large freely available database of fashion images that is commonly used for training and testing various machine learning systems. Fashion-MNIST was intended to serve as a replacement for the original MNIST database for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits.

Alexander Wong is a professor in the Department of Systems Design Engineering and a Co-Director of the Vision and Image Processing Research Group at the University of Waterloo. He is the Canada Research Chair in Artificial Intelligence and Medical Imaging, a Founding Member of the Waterloo Artificial Intelligence Institute and a Member of the College of the Royal Society of Canada and a Fellow of the Institute of Engineering and Technology. He is also a Fellow of the Institute of Physics, a Fellow in the International Society for Design and Development in Education, a Fellow of the Royal Society for Public Health and a Fellow of the Royal Society of Medicine.

Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.

Xiaoming Liu is a Chinese-American computer scientist and an academic. He is a Professor in the Department of Computer Science and Engineering, MSU Foundation Professor as well as Anil K. and Nandita Jain Endowed Professor of Engineering at Michigan State University.

References

  1. Saran K B; Sreelekha G (2015). "Traffic video surveillance: Vehicle detection and classification". 2015 International Conference on Control Communication & Computing India (ICCC). Trivandrum, Kerala, India: IEEE. pp. 516–521. doi:10.1109/ICCC.2015.7432948. ISBN   978-1-4673-7349-4. S2CID   14779393.
  2. Nemade, Bhushan (2016-01-01). "Automatic Traffic Surveillance Using Video Tracking". Procedia Computer Science. Proceedings of International Conference on Communication, Computing and Virtualization (ICCCV) 2016. 79: 402–409. doi: 10.1016/j.procs.2016.03.052 . ISSN   1877-0509.
  3. Guo, Haiyun; Wang, Jinqiao; Xu, Min; Zha, Zheng-Jun; Lu, Hanqing (2015-10-13). "Learning Multi-view Deep Features for Small Object Retrieval in Surveillance Scenarios". Proceedings of the 23rd ACM international conference on Multimedia. MM '15. New York, NY, USA: Association for Computing Machinery. pp. 859–862. doi:10.1145/2733373.2806349. ISBN   978-1-4503-3459-4. S2CID   9041849.
  4. Galiyawala, Hiren; Raval, Mehul S.; Patel, Meet (2022-05-20). "Person retrieval in surveillance videos using attribute recognition". Journal of Ambient Intelligence and Humanized Computing. doi:10.1007/s12652-022-03891-0. ISSN   1868-5145. S2CID   248951090.
  5. Ingle, Palash Yuvraj; Kim, Young-Gab (2022-05-19). "Real-Time Abnormal Object Detection for Video Surveillance in Smart Cities". Sensors. 22 (10): 3862. Bibcode:2022Senso..22.3862I. doi: 10.3390/s22103862 . ISSN   1424-8220. PMC   9143895 . PMID   35632270.
  6. Tsuboi, Tsutomu; Yoshikawa, Noriaki (2020-03-01). "Traffic flow analysis in Ahmedabad (India)". Case Studies on Transport Policy. 8 (1): 215–228. doi: 10.1016/j.cstp.2019.06.001 . ISSN   2213-624X. S2CID   195543435.
  7. Redmon, Joseph; Divvala, Santosh; Girshick, Ross; Farhadi, Ali (2016-05-09). "You Only Look Once: Unified, Real-Time Object Detection". arXiv: 1506.02640 [cs.CV].
  8. Redmon, Joseph; Farhadi, Ali (2016-12-25). "YOLO9000: Better, Faster, Stronger". arXiv: 1612.08242 [cs.CV].
  9. Redmon, Joseph; Farhadi, Ali (2018-04-08). "YOLOv3: An Incremental Improvement". arXiv: 1804.02767 [cs.CV].
  10. Bochkovskiy, Alexey; Wang, Chien-Yao; Liao, Hong-Yuan Mark (2020-04-22). "YOLOv4: Optimal Speed and Accuracy of Object Detection". arXiv: 2004.10934 [cs.CV].
  11. Wang, Chien-Yao; Bochkovskiy, Alexey; Liao, Hong-Yuan Mark (2021-02-21). "Scaled-YOLOv4: Scaling Cross Stage Partial Network". arXiv: 2011.08036 [cs.CV].
  12. Li, Chuyi; Li, Lulu; Jiang, Hongliang; Weng, Kaiheng; Geng, Yifei; Li, Liang; Ke, Zaidan; Li, Qingyuan; Cheng, Meng; Nie, Weiqiang; Li, Yiduo; Zhang, Bo; Liang, Yufei; Zhou, Linyuan; Xu, Xiaoming (2022-09-07). "YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications". arXiv: 2209.02976 [cs.CV].
  13. Wang, Chien-Yao; Bochkovskiy, Alexey; Liao, Hong-Yuan Mark (2022-07-06). "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors". arXiv: 2207.02696 [cs.CV].
  14. Zhang, Mingrui; Zhao, Wenbing; Li, Xiying; Wang, Dan (2020-12-11). "Shadow Detection of Moving Objects in Traffic Monitoring Video". 2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). Vol. 9. Chongqing, China: IEEE. pp. 1983–1987. doi:10.1109/ITAIC49862.2020.9338958. ISBN   978-1-7281-5244-8. S2CID   231824327.
  15. "Interactive workshop "How drones are changing the world we live in"". 2016 Integrated Communications Navigation and Surveillance (ICNS). Herndon, VA: IEEE. 2016. pp. 1–17. doi:10.1109/ICNSURV.2016.7486437. ISBN   978-1-5090-2149-9. S2CID   21388151.
  16. Nguyen, Nhat-Duy; Do, Tien; Ngo, Thanh Duc; Le, Duy-Dinh (2020). "An Evaluation of Deep Learning Methods for Small Object Detection". Journal of Electrical and Computer Engineering. 2020: 1–18. doi: 10.1155/2020/3189691 .
  17. 1 2 Gong, Zhiqiang; Zhong, Ping; Hu, Weidong (2019). "Diversity in Machine Learning". IEEE Access. 7: 64323–64350. doi: 10.1109/ACCESS.2019.2917620 . ISSN   2169-3536. S2CID   206491718.
  18. Christiansen, Anders (2022-06-10). "Anchor Boxes — The key to quality object detection". Medium. Retrieved 2022-09-14.
  19. Jocher, Glenn; Chaurasia, Ayush; Stoken, Alex; Borovec, Jirka; NanoCode012; Kwon, Yonghye; TaoXie; Michael, Kalen; Fang, Jiacong (2022-08-17). "ultralytics/yolov5: v6.2 - YOLOv5 Classification Models, Apple M1, Reproducibility, ClearML and Deci.ai integrations". doi:10.5281/zenodo.3908559 . Retrieved 2022-09-14.{{cite journal}}: Cite journal requires |journal= (help)CS1 maint: numeric names: authors list (link)
  20. "The Size and Quality of a Data Set | Machine Learning". Google Developers. Retrieved 2022-09-14.
  21. Zhong, Yuanyi; Wang, Jianfeng; Peng, Jian; Zhang, Lei (2020-01-26). "Anchor Box Optimization for Object Detection". arXiv: 1812.00469 [cs.CV].
  22. Unel, F. Ozge; Ozkalayci, Burak O.; Cigla, Cevahir (2019). "The Power of Tiling for Small Object Detection". 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Long Beach, CA, USA: IEEE. pp. 582–591. doi:10.1109/CVPRW.2019.00084. ISBN   978-1-7281-2506-0. S2CID   198903617.
  23. Lin, Tsung-Yi; Dollár, Piotr; Girshick, Ross; He, Kaiming; Hariharan, Bharath; Belongie, Serge (2017-04-19). "Feature Pyramid Networks for Object Detection". arXiv: 1612.03144 [cs.CV].
  24. Liang, Yi; Changjian, Wang; Fangzhao, Li; Yuxing, Peng; Qin, Lv; Yuan, Yuan; Zhen, Huang (2019). "TFPN: Twin Feature Pyramid Networks for Object Detection". 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI). Portland, OR, USA: IEEE. pp. 1702–1707. doi:10.1109/ICTAI.2019.00251. ISBN   978-1-7281-3798-8. S2CID   211211764.
  25. Deng, Chunfang; Wang, Mengmeng; Liu, Liang; Liu, Yong (2020-04-09). "Extended Feature Pyramid Network for Small Object Detection". arXiv: 2003.07021 [cs.CV].
  26. Akyon, Fatih Cagatay; Altinuc, Sinan Onur; Temizel, Alptekin (2022-07-12). "Slicing Aided Hyper Inference and Fine-Tuning for Small Object Detection". 2022 IEEE International Conference on Image Processing (ICIP). pp. 966–970. arXiv: 2202.06934 . doi:10.1109/ICIP46576.2022.9897990. ISBN   978-1-6654-9620-9. S2CID   246823962.
  27. Cao, Guimei; Xie, Xuemei; Yang, Wenzhe; Liao, Quan; Shi, Guangming; Wu, Jinjian (2018-04-10). "Feature-fused SSD: Fast detection for small objects". In Dong, Junyu; Yu, Hui (eds.). Ninth International Conference on Graphic and Image Processing (ICGIP 2017). Vol. 10615. SPIE. pp. 381–388. arXiv: 1709.05054 . Bibcode:2018SPIE10615E..1EC. doi:10.1117/12.2304811. ISBN   9781510617414. S2CID   20592770.
  28. Benjumea, Aduen; Teeti, Izzeddin; Cuzzolin, Fabio; Bradley, Andrew (2021-12-23). "YOLO-Z: Improving small object detection in YOLOv5 for autonomous vehicles". arXiv: 2112.11798 [cs.CV].
  29. Rajendran, Logesh; Shyam Shankaran, R (2021). "Bigdata Enabled Realtime Crowd Surveillance Using Artificial Intelligence and Deep Learning". 2021 IEEE International Conference on Big Data and Smart Computing (BigComp). Jeju Island, Korea (South): IEEE. pp. 129–132. doi:10.1109/BigComp51126.2021.00032. ISBN   978-1-7281-8924-6. S2CID   232236614.
  30. Sivachandiran, S.; Mohan, K. Jagan; Nazer, G. Mohammed (2022-03-29). "Deep Transfer Learning Enabled High-Density Crowd Detection and Classification using Aerial Images". 2022 6th International Conference on Computing Methodologies and Communication (ICCMC). Erode, India: IEEE. pp. 1313–1317. doi:10.1109/ICCMC53470.2022.9753982. ISBN   978-1-6654-1028-1. S2CID   248131806.
  31. Santhini, C.; Gomathi, V. (2018). "Crowd Scene Analysis Using Deep Learning Network". 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT). pp. 1–5. doi:10.1109/ICCTCT.2018.8550851. ISBN   978-1-5386-3702-9. S2CID   54438440.
  32. Sharath, S.V.; Biradar, Vidyadevi; Prajwal, M.S.; Ashwini, B. (2021-11-19). "Crowd Counting in High Dense Images using Deep Convolutional Neural Network". 2021 IEEE International Conference on Distributed Computing, VLSI, Electrical Circuits and Robotics (DISCOVER). Nitte, India: IEEE. pp. 30–34. doi:10.1109/DISCOVER52564.2021.9663716. ISBN   978-1-6654-1244-5. S2CID   245707782.
  33. Wang, Hongbo; Hou, Jiaying; Chen, Na (2019). "A Survey of Vehicle Re-Identification Based on Deep Learning". IEEE Access. 7: 172443–172469. doi: 10.1109/ACCESS.2019.2956172 . ISSN   2169-3536. S2CID   209319743.
  34. Santhanam, Sanjay; B, Sudhir Sidhaarthan; Panigrahi, Sai Sudha; Kashyap, Suryakant Kumar; Duriseti, Bhargav Krishna (2021-11-26). "Animal Detection for Road safety using Deep Learning". 2021 International Conference on Computational Intelligence and Computing Applications (ICCICA). Nagpur, India: IEEE. pp. 1–5. doi:10.1109/ICCICA52458.2021.9697287. ISBN   978-1-6654-2040-2. S2CID   246663727.
  35. Li, Nopparut; Kusakunniran, Worapan; Hotta, Seiji (2020). "Detection of Animal Behind Cages Using Convolutional Neural Network". 2020 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON). Phuket, Thailand: IEEE. pp. 242–245. doi:10.1109/ECTI-CON49241.2020.9158137. ISBN   978-1-7281-6486-1. S2CID   221086279.
  36. Oishi, Yu; Matsunaga, Tsuneo (2010). "Automatic detection of moving wild animals in airborne remote sensing images". 2010 IEEE International Geoscience and Remote Sensing Symposium. pp. 517–519. doi:10.1109/IGARSS.2010.5654227. ISBN   978-1-4244-9565-8. S2CID   16812504.
  37. Ramanan, D.; Forsyth, D.A.; Barnard, K. (2006). "Building models of animals from video". IEEE Transactions on Pattern Analysis and Machine Intelligence. 28 (8): 1319–1334. doi:10.1109/TPAMI.2006.155. ISSN   0162-8828. PMID   16886866. S2CID   1699015.
  38. Cui, Suxia; Zhou, Yu; Wang, Yonghui; Zhai, Lujun (2020). "Fish Detection Using Deep Learning". Applied Computational Intelligence and Soft Computing. 2020: 1–13. doi: 10.1155/2020/3738108 .