Crowd counting

Last updated December 31, 2024

Crowd counting is the act of counting the total crowd present in a certain area. The people in a certain area are called a crowd. The most direct method is to actually count each person in the crowd. For example, turnstiles are often used to precisely count the number of people entering an event.^[1]

Modern understanding

Since the early 2000s, there has been a shift in the understanding of the phrase “crowd counting”. Having moved from a simpler crowd counting method to that of clusters and density maps, there are several improvements for crowd counting methods. Crowd counting can also be defined as estimating the number of people present in a single picture.^[2]

Methods of counting crowds

Due to the rapid progress in technology and growth of CNN (Convolutional Neural Network) over the last decade, the usage of CNN in crowd counting has skyrocketed. The CNN based methods can largely be grouped under the following different models:^[3]

Jacobs' method

The most common technique for counting crowds at protests and rallies is Jacobs' method, named for its inventor, Herbert Jacobs. Jacobs' method involves dividing the area occupied by a crowd into sections, determining an average number of people in each section, and multiplying by the number of sections occupied. According to a report by Life's Little Mysteries, technologies sometimes used to assist such estimations include "lasers, satellites, aerial photography, 3-D grid systems, recorded video footage and surveillance balloons, usually tethered several blocks around an event's location and flying 400 to 800 feet (120 to 240 meters) overhead."^[2]

Direct regression-based counting

This crowd counting method involves using regression on global image features to the whole image. Global image features refer to the different properties of certain areas of the photo. For example, global image features include “ contour representations, shape descriptions, texture features.”^[4]

As distribution information of objects are not accounted for, object localisation cannot be processed via regressions.^[5] Additionally, as this model estimates the crowd density on descriptions of crowd patterns, it ignores individual trackers.^[2] This allows regression based models to be very efficient in crowded pictures; if the density per pixel is very high regression models are best suited.

Earlier crowd counting methods employed classical regression models.^[6]

Density-based counting

Object density maps rely on finding the total number of objects located in a particular area. This is determined by the integral summation of the number of objects in that area.^[5] Due to the density values being estimated through low values, density-based counting allows the user to experience advantages of regression-based models alongside localisation of information.^[5] Localisation of information refers to the act of maintaining location information.

Strengthening crowd counting

In order to use the above-mentioned models efficiently, it is important to have a large amount of data. However, as users, we are stuck with limited data i.e. the original image. In order to compensate for these issues, we employ tricks such as random cropping. Random cropping refers to the act of randomly choosing certain sub images from the existing original image.

After performing several iterations of random cropping, the sub images are then fed into the machine learning algorithm to help the algorithm generalize better.

To tackle the problems associated with crowd counting in heavy density areas density based counting methods can be employed. These image pyramids are generally employed for crowd counting in places where people gather to perform rituals or practice their religious beliefs. This is because there are different scales of people in different locations within the image.

However, as employing the required algorithms for image pyramids is very expensive, it is financially unstable to depend on these methods. As a result, deep fusion models can be involved.^[7]

These deep fusion models will employ “neural network(s) to promote the density map regression accuracy.”^[8] These models will first mark the location of each civilian within the picture. Then, the models shall decide the density maps of the area by using the “pedestrian’s location, shape, and perspective distortion.”^[8] As there are many iterations of the algorithm and scanning processes taking place, the number of people is counted via the head of the person. This is also because there will be many instances when the bodies of the civilians will be overlapping with one another.

Importance

Crowd counting plays an important role in “public safety, assembly language, and video surveillance”^[9] amongst many things. Without crowd control, through poor planning, several terrible accidents can occur. Some of the most notable ones are the Hillborough disaster which took place on April 15 in England. Another memorable incident occurred when Louis Farrakhan threatened to sue the Washington, D.C. Park Police for announcing that only 400,000 people attended the 1995 Million Man March he organized.

At events in streets or a park rather than an enclosed venue, crowd counting is more difficult and less precise. For many events, especially political rallies or protests, the number of people in a crowd carries political significance and count results are controversial. For example, the global protests against the Iraq war had many protests with widely differing counts offered by organizers on one side and the police on the other side.

Related Research Articles

In machine learning, supervised learning (SL) is a paradigm where a model is trained using input objects and desired output values, which are often human-made labels. The training process builds a function that maps new data to expected output values. An optimal scenario will allow for the algorithm to accurately determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured via a generalization error.

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. Developed in the early 1980s by Robert M. Gray, it was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms. In simpler terms, vector quantization chooses a set of points to represent a larger set of points.

In machine learning (ML), boosting is an ensemble metaheuristic for primarily reducing bias. It can also improve the stability and accuracy of ML classification and regression algorithms. Hence, it is prevalent in supervised learning for converting weak learners to strong learners.

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Sensor fusion is a process of combining sensor data or data derived from disparate sources so that the resulting information has less uncertainty than would be possible if these sources were used individually. For instance, one could potentially obtain a more accurate location estimate of an indoor object by combining multiple data sources such as video cameras and WiFi localization signals. The term uncertainty reduction in this case can mean more accurate, more complete, or more dependable, or refer to the result of an emerging view, such as stereoscopic vision.

When classification is performed by a computer, statistical methods are normally used to develop the algorithm.

<span class="mw-page-title-main">Spatial analysis</span> Formal techniques which study entities using their topological, geometric, or geographic properties

Spatial analysis is any of the formal techniques which studies entities using their topological, geometric, or geographic properties. Spatial analysis includes a variety of techniques using different analytic approaches, especially spatial statistics. It may be applied in fields as diverse as astronomy, with its studies of the placement of galaxies in the cosmos, or to chip fabrication engineering, with its use of "place and route" algorithms to build complex wiring structures. In a more restricted sense, spatial analysis is geospatial analysis, the technique applied to structures at the human scale, most notably in the analysis of geographic data. It may also be applied to genomics, as in transcriptomics data.

In computer vision, the bag-of-words model sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features.

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

There are many types of artificial neural networks (ANN).

In computer science, landmark detection is the process of finding significant landmarks in an image. This originally referred to finding landmarks for navigational purposes – for instance, in robot vision or creating maps from satellite images. Methods used in navigation have been extended to other fields, notably in facial recognition where it is used to identify key points on a face. It also has important applications in medicine, identifying anatomical landmarks in medical images.

Deep learning is a subset of machine learning that focuses on utilizing neural networks to perform tasks such as classification, regression, and representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence (AI), its subdisciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

The following outline is provided as an overview of, and topical guide to, machine learning:

Multi-focus image fusion is a multiple image compression technique using input images with different focus depths to make one output image that preserves all information.

Land cover maps are tools that provide vital information about the Earth's land use and cover patterns. They aid policy development, urban planning, and forest and agricultural monitoring.

Applications of machine learning (ML) in earth sciences include geological mapping, gas leakage detection and geological feature identification. Machine learning is a subdiscipline of artificial intelligence aimed at developing programs that are able to classify, cluster, identify, and analyze vast and complex data sets without the need for explicit programming to do so. Earth science is the study of the origin, evolution, and future of the Earth. The earth's system can be subdivided into four major components including the solid earth, atmosphere, hydrosphere, and biosphere.

Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.

References

↑ "What are Turnstiles? (with pictures)". EasyTechJunkie. Retrieved 2022-10-11.
1 2 3 Loy, Chen Change; Chen, Ke; Gong, Shaogang; Xiang, Tao (2021). "Fine-Grained Crowd Counting". IEEE Transactions on Image Processing. 30: 2114–2126. arXiv: 2007.06146 . Bibcode:2021ITIP...30.2114W. doi:10.1109/TIP.2021.3049938. PMID 33439838. S2CID 220496399.
↑ Chu, Huanpeng; Tang, Jilin; Hu, Haoji (2021-10-01). "Attention guided feature pyramid network for crowd counting". Journal of Visual Communication and Image Representation. 80: 103319. doi:10.1016/j.jvcir.2021.103319. ISSN 1047-3203. S2CID 241591128.
↑ Lisin, Dimitri A.; Mattar, Marwan A.; Blaschko, Matthew B.; Benfield, Mark C.; Learned-Mille, Erik G. "Combining Local and Global Image Features for Object Class Recognition" (PDF).
1 2 3 Kang, D.; Ma, Z.; Chan, A. B. (May 2019). "Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks—Counting, Detection, and Tracking". IEEE Transactions on Circuits and Systems for Video Technology. 29 (5): 1408–1422. arXiv: 1705.10118 . doi:10.1109/TCSVT.2018.2837153. S2CID 19706288.
↑ Delussu, Rita; Putzu, Lorenzo; Fumera, Giorgio (2022). "Scene-specific crowd counting using synthetic training images". Pattern Recognition. 124: 108484. Bibcode:2022PatRe.12408484D. doi:10.1016/j.patcog.2021.108484. hdl: 11584/341493 . S2CID 245109866.
↑ Khan, Sultan Daud; Salih, Yasir; Zafar, Basim; Noorwali, Abdulfattah (2021-09-28). "A Deep-Fusion Network for Crowd Counting in High-Density Crowded Scenes". International Journal of Computational Intelligence Systems. 14 (1): 168. doi: 10.1007/s44196-021-00016-x . ISSN 1875-6883.
1 2 Tang, Siqi; Pan, Zhisong; Zhou, Xingyu (2017-01-01). "Low-Rank and Sparse Based Deep-Fusion Convolutional Neural Network for Crowd Counting". Mathematical Problems in Engineering. 2017: 1–11. doi: 10.1155/2017/5046727 .
↑ Chu, Huanpeng; Tang, Jilin; Hu, Haoji (2021-10-01). "Attention guided feature pyramid network for crowd counting". Journal of Visual Communication and Image Representation. 80: 103319. doi:10.1016/j.jvcir.2021.103319. ISSN 1047-3203. S2CID 241591128.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "What are Turnstiles? (with pictures)". EasyTechJunkie. Retrieved 2022-10-11.

[lc-2] 1 2 3 Loy, Chen Change; Chen, Ke; Gong, Shaogang; Xiang, Tao (2021). "Fine-Grained Crowd Counting". IEEE Transactions on Image Processing. 30: 2114–2126. arXiv: 2007.06146 . Bibcode:2021ITIP...30.2114W. doi:10.1109/TIP.2021.3049938. PMID 33439838. S2CID 220496399.

[3] Chu, Huanpeng; Tang, Jilin; Hu, Haoji (2021-10-01). "Attention guided feature pyramid network for crowd counting". Journal of Visual Communication and Image Representation. 80: 103319. doi:10.1016/j.jvcir.2021.103319. ISSN 1047-3203. S2CID 241591128.

[4] Lisin, Dimitri A.; Mattar, Marwan A.; Blaschko, Matthew B.; Benfield, Mark C.; Learned-Mille, Erik G. "Combining Local and Global Image Features for Object Class Recognition" (PDF).

[dk-5] 1 2 3 Kang, D.; Ma, Z.; Chan, A. B. (May 2019). "Beyond Counting: Comparisons of Density Maps for Crowd Analysis Tasks—Counting, Detection, and Tracking". IEEE Transactions on Circuits and Systems for Video Technology. 29 (5): 1408–1422. arXiv: 1705.10118 . doi:10.1109/TCSVT.2018.2837153. S2CID 19706288.

[6] Delussu, Rita; Putzu, Lorenzo; Fumera, Giorgio (2022). "Scene-specific crowd counting using synthetic training images". Pattern Recognition. 124: 108484. Bibcode:2022PatRe.12408484D. doi:10.1016/j.patcog.2021.108484. hdl: 11584/341493 . S2CID 245109866.

[7] Khan, Sultan Daud; Salih, Yasir; Zafar, Basim; Noorwali, Abdulfattah (2021-09-28). "A Deep-Fusion Network for Crowd Counting in High-Density Crowded Scenes". International Journal of Computational Intelligence Systems. 14 (1): 168. doi: 10.1007/s44196-021-00016-x . ISSN 1875-6883.

[go.gale.com-8] 1 2 Tang, Siqi; Pan, Zhisong; Zhou, Xingyu (2017-01-01). "Low-Rank and Sparse Based Deep-Fusion Convolutional Neural Network for Crowd Counting". Mathematical Problems in Engineering. 2017: 1–11. doi: 10.1155/2017/5046727 .

[9] Chu, Huanpeng; Tang, Jilin; Hu, Haoji (2021-10-01). "Attention guided feature pyramid network for crowd counting". Journal of Visual Communication and Image Representation. 80: 103319. doi:10.1016/j.jvcir.2021.103319. ISSN 1047-3203. S2CID 241591128.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]