Crowd counting is known to be act of counting the total crowd present in a certain area. The people in a certain area are called a crowd. The most direct method is to actually count each person in the crowd. For example, turnstiles are often used to precisely count the number of people entering an event. [1]
Since the early 2000s, there has been a shift in the understanding of the phrase “crowd counting”. Having moved from a simpler crowd counting method to that of clusters and density maps, there are several improvements for crowd counting methods. Crowd counting can also be defined as estimating the number of people present in a single picture. [2]
Due to the rapid progress in technology and growth of CNN (Convolutional Neural Network) over the last decade, the usage of CNN in crowd counting has skyrocketed. The CNN based methods can largely be grouped under the following different models: [3]
The most common technique for counting crowds at protests and rallies is Jacobs' method, named for its inventor, Herbert Jacobs. Jacobs' method involves dividing the area occupied by a crowd into sections, determining an average number of people in each section, and multiplying by the number of sections occupied. According to a report by Life's Little Mysteries, technologies sometimes used to assist such estimations include "lasers, satellites, aerial photography, 3-D grid systems, recorded video footage and surveillance balloons, usually tethered several blocks around an event's location and flying 400 to 800 feet (120 to 240 meters) overhead." [2]
This crowd counting method involves using regression on global image features to the whole image. Global image features refer to the different properties of certain areas of the photo. For example, global image features include “ contour representations, shape descriptions, texture features.” [4]
As distribution information of objects are not accounted for, object localisation cannot be processed via regressions. [5] Additionally, as this model estimates the crowd density on descriptions of crowd patterns, it ignores individual trackers. [2] This allows regression based models to be very efficient in crowded pictures; if the density per pixel is very high regression models are best suited.
Earlier crowd counting methods employed classical regression models. [6]
Object density maps rely on finding the total number of objects located in a particular area. This is determined by the integral summation of the number of objects in that area. [5] Due to the density values being estimated through low values, density-based counting allows the user to experience advantages of regression-based models alongside localisation of information. [5] Localisation of information refers to the act of maintaining location information.
In order to use the above-mentioned models efficiently, it is important to have a large amount of data. However, as users, we are stuck with limited data i.e. the original image. In order to compensate for these issues, we employ tricks such as random cropping. Random cropping refers to the act of randomly choosing certain sub images from the existing original image.
After performing several iterations of random cropping, the sub images are then fed into the machine learning algorithm to help the algorithm generalize better.
To tackle the problems associated with crowd counting in heavy density areas density based counting methods can be employed. These image pyramids are generally employed for crowd counting in places where people gather to perform rituals or practice their religious beliefs. This is because there are different scales of people in different locations within the image.
However, as employing the required algorithms for image pyramids is very expensive, it is financially unstable to depend on these methods. As a result, deep fusion models can be involved. [7]
These deep fusion models will employ “neural network(s) to promote the density map regression accuracy.” [8] These models will first mark the location of each civilian within the picture. Then, the models shall decide the density maps of the area by using the “pedestrian’s location, shape, and perspective distortion.” [8] As there are many iterations of the algorithm and scanning processes taking place, the number of people is counted via the head of the person. This is also because there will be many instances when the bodies of the civilians will be overlapping with one another.
Crowd counting plays an important role in “public safety, assembly language, and video surveillance” [9] amongst many things. Without crowd control, through poor planning, several terrible accidents can occur. Some of the most notable ones are the Hillborough disaster which took place on April 15 in England. Another memorable incident occurred when Louis Farrakhan threatened to sue the Washington, D.C. Park Police for announcing that only 400,000 people attended the 1995 Million Man March he organized.
At events in streets or a park rather than an enclosed venue, crowd counting is more difficult and less precise. For many events, especially political rallies or protests, the number of people in a crowd carries political significance and count results are controversial. For example, the global protests against the Iraq war had many protests with widely differing counts offered by organizers on one side and the police on the other side.
Supervised learning (SL) is a paradigm in machine learning where input objects and a desired output value train a model. The training data is processed, building a function that maps new data on expected output values. An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way. This statistical quality of an algorithm is measured through the so-called generalization error.
In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.
Vector quantization (VQ) is a classical quantization technique from signal processing that allows the modeling of probability density functions by the distribution of prototype vectors. Developed in the early 1980s by Robert M. Gray, it was originally used for data compression. It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms. In simpler terms, vector quantization chooses a set of points to represent a larger set of points.
Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.
Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.
Sensor fusion is the process of combining sensor data or data derived from disparate sources such that the resulting information has less uncertainty than would be possible when these sources were used individually. For instance, one could potentially obtain a more accurate location estimate of an indoor object by combining multiple data sources such as video cameras and WiFi localization signals. The term uncertainty reduction in this case can mean more accurate, more complete, or more dependable, or refer to the result of an emerging view, such as stereoscopic vision.
When classification is performed by a computer, statistical methods are normally used to develop the algorithm.
Spatial analysis is any of the formal techniques which studies entities using their topological, geometric, or geographic properties. Spatial analysis includes a variety of techniques using different analytic approaches, especially spatial statistics. It may be applied in fields as diverse as astronomy, with its studies of the placement of galaxies in the cosmos, or to chip fabrication engineering, with its use of "place and route" algorithms to build complex wiring structures. In a more restricted sense, spatial analysis is geospatial analysis, the technique applied to structures at the human scale, most notably in the analysis of geographic data. It may also be applied to genomics, as in transcriptomics data.
Group method of data handling (GMDH) is a family of inductive algorithms for computer-based mathematical modeling of multi-parametric datasets that features fully automatic structural and parametric optimization of models.
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone. Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.
There are many types of artificial neural networks (ANN).
In computer science, landmark detection is the process of finding significant landmarks in an image. This originally referred to finding landmarks for navigational purposes – for instance, in robot vision or creating maps from satellite images. Methods used in navigation have been extended to other fields, notably in facial recognition where it is used to identify key points on a face. It also has important applications in medicine, identifying anatomical landmarks in medical images.
In machine learning, a hyperparameter is a parameter, such as the learning rate or choice of optimizer, which specifies details of the learning process, hence the name hyperparameter. This is in contrast to parameters which determine the model itself.
Multimedia information retrieval is a research discipline of computer science that aims at extracting semantic information from multimedia data sources. Data sources include directly perceivable media such as audio, image and video, indirectly perceivable sources such as text, semantic descriptions, biosignals as well as not perceivable sources such as bioinformation, stock prices, etc. The methodology of MMIR can be organized in three groups:
Adversarial machine learning is the study of the attacks on machine learning algorithms, and of the defenses against such attacks. A survey from May 2020 exposes the fact that practitioners report a dire need for better protecting machine learning systems in industrial applications.
The following outline is provided as an overview of and topical guide to machine learning:
Multi-focus image fusion is a multiple image compression technique using input images with different focus depths to make one output image that preserves all information.
Applications of machine learning in earth sciences include geological mapping, gas leakage detection and geological features identification. Machine learning (ML) is a type of artificial intelligence (AI) that enables computer systems to classify, cluster, identify and analyze vast and complex sets of data while eliminating the need for explicit instructions and programming. Earth science is the study of the origin, evolution, and future of the planet Earth. The Earth system can be subdivided into four major components including the solid earth, atmosphere, hydrosphere and biosphere.
Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.