Part of a series on |
Machine learning and data mining |
---|
In data analysis, anomaly detection (also referred to as outlier detection and sometimes as novelty detection) is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. [1] Such examples may arouse suspicions of being generated by a different mechanism, [2] or appear inconsistent with the remainder of that set of data. [3]
Anomaly detection finds application in many domains including cybersecurity, medicine, machine vision, statistics, neuroscience, law enforcement and financial fraud to name only a few. Anomalies were initially searched for clear rejection or omission from the data to aid statistical analysis, for example to compute the mean or standard deviation. They were also removed to better predictions from models such as linear regression, and more recently their removal aids the performance of machine learning algorithms. However, in many applications anomalies themselves are of interest and are the observations most desirous in the entire data set, which need to be identified and separated from noise or irrelevant outliers.
Three broad categories of anomaly detection techniques exist. [1] Supervised anomaly detection techniques require a data set that has been labeled as "normal" and "abnormal" and involves training a classifier. However, this approach is rarely used in anomaly detection due to the general unavailability of labelled data and the inherent unbalanced nature of the classes. Semi-supervised anomaly detection techniques assume that some portion of the data is labelled. This may be any combination of the normal or anomalous data, but more often than not, the techniques construct a model representing normal behavior from a given normal training data set, and then test the likelihood of a test instance to be generated by the model. Unsupervised anomaly detection techniques assume the data is unlabelled and are by far the most commonly used due to their wider and relevant application.
Many attempts have been made in the statistical and computer science communities to define an anomaly. The most prevalent ones include the following, and can be categorised into three groups: those that are ambiguous, those that are specific to a method with pre-defined thresholds usually chosen empirically, and those that are formally defined:
The concept of intrusion detection, a critical component of anomaly detection, has evolved significantly over time. Initially, it was a manual process where system administrators would monitor for unusual activities, such as a vacationing user's account being accessed or unexpected printer activity. This approach was not scalable and was soon superseded by the analysis of audit logs and system logs for signs of malicious behavior. [4]
By the late 1970s and early 1980s, the analysis of these logs was primarily used retrospectively to investigate incidents, as the volume of data made it impractical for real-time monitoring. The affordability of digital storage eventually led to audit logs being analyzed online, with specialized programs being developed to sift through the data. These programs, however, were typically run during off-peak hours due to their computational intensity. [4]
The 1990s brought the advent of real-time intrusion detection systems capable of analyzing audit data as it was generated, allowing for immediate detection of and response to attacks. This marked a significant shift towards proactive intrusion detection. [4]
As the field has continued to develop, the focus has shifted to creating solutions that can be efficiently implemented across large and complex network environments, adapting to the ever-growing variety of security threats and the dynamic nature of modern computing infrastructures. [4]
Anomaly detection is applicable in a very large number and variety of domains, and is an important subarea of unsupervised machine learning. As such it has applications in cyber-security, intrusion detection, fraud detection, fault detection, system health monitoring, event detection in sensor networks, detecting ecosystem disturbances, defect detection in images using machine vision, medical diagnosis and law enforcement. [5]
Anomaly detection was proposed for intrusion detection systems (IDS) by Dorothy Denning in 1986. [6] Anomaly detection for IDS is normally accomplished with thresholds and statistics, but can also be done with soft computing, and inductive learning. [7] Types of features proposed by 1999 included profiles of users, workstations, networks, remote hosts, groups of users, and programs based on frequencies, means, variances, covariances, and standard deviations. [8] The counterpart of anomaly detection in intrusion detection is misuse detection.
Anomaly detection is vital in fintech for fraud prevention. [9] [10]
Preprocessing data to remove anomalies can be an important step in data analysis, and is done for a number of reasons. Statistics such as the mean and standard deviation are more accurate after the removal of anomalies, and the visualisation of data can also be improved. In supervised learning, removing the anomalous data from the dataset often results in a statistically significant increase in accuracy. [11] [12]
Anomaly detection has become increasingly vital in video surveillance to enhance security and safety. [13] [14] With the advent of deep learning technologies, methods using Convolutional Neural Networks (CNNs) and Simple Recurrent Units (SRUs) have shown significant promise in identifying unusual activities or behaviors in video data. [13] These models can process and analyze extensive video feeds in real-time, recognizing patterns that deviate from the norm, which may indicate potential security threats or safety violations. [13] An important aspect for video surveillance is the developement of scalable real-time frameworks. [15] [16] Such pipelines are required for processing multiple video streams with low computational resources.
In IT infrastructure management, anomaly detection is crucial for ensuring the smooth operation and reliability of services. [17] Techniques like the IT Infrastructure Library (ITIL) and monitoring frameworks are employed to track and manage system performance and user experience. [17] Detection anomalies can help identify and pre-empt potential performance degradations or system failures, thus maintaining productivity and business process effectiveness. [17]
Anomaly detection is critical for the security and efficiency of Internet of Things (IoT) systems. [18] It helps in identifying system failures and security breaches in complex networks of IoT devices. [18] The methods must manage real-time data, diverse device types, and scale effectively. Garbe et al. [19] have introduced a multi-stage anomaly detection framework that improves upon traditional methods by incorporating spatial clustering, density-based clustering, and locality-sensitive hashing. This tailored approach is designed to better handle the vast and varied nature of IoT data, thereby enhancing security and operational reliability in smart infrastructure and industrial IoT systems. [19]
Anomaly detection is crucial in the petroleum industry for monitoring critical machinery. [20] Martí et al. used a novel segmentation algorithm to analyze sensor data for real-time anomaly detection. [20] This approach helps promptly identify and address any irregularities in sensor readings, ensuring the reliability and safety of petroleum operations. [20]
In the oil and gas sector, anomaly detection is not just crucial for maintenance and safety, but also for environmental protection. [21] Aljameel et al. propose an advanced machine learning-based model for detecting minor leaks in oil and gas pipelines, a task traditional methods may miss. [21]
Many anomaly detection techniques have been proposed in literature. [1] [22] The performance of methods usually depend on the data sets. For example, some may be suited to detecting local outliers, while others global, and methods have little systematic advantages over another when compared across many data sets. [23] [24] Almost all algorithms also require the setting of non-intuitive parameters critical for performance, and usually unknown before application. Some of the popular techniques are mentioned below and are broken down into categories:
Also referred to as frequency-based or counting-based, the simplest non-parametric anomaly detection method is to build a histogram with the training data or a set of known normal instances, and if a test point does not fall in any of the histogram bins mark it as anomalous, or assign an anomaly score to test data based on the height of the bin it falls in. [1] The size of bins are key to the effectiveness of this technique but must be determined by the implementer.
A more sophisticated technique uses kernel functions to approximate the distribution of the normal data. Instances in low probability areas of the distribution are then considered anomalies [25] .
![]() | This section is empty. You can help by adding to it. (January 2024) |
Histogram-based Outlier Score (HBOS) uses value histograms and assumes feature independence for fast predictions. [55]
Dynamic networks, such as those representing financial systems, social media interactions, and transportation infrastructure, are subject to constant change, making anomaly detection within them a complex task. Unlike static graphs, dynamic networks reflect evolving relationships and states, requiring adaptive techniques for anomaly detection.
Many of the methods discussed above only yield an anomaly score prediction, which often can be explained to users as the point being in a region of low data density (or relatively low density compared to the neighbor's densities). In explainable artificial intelligence, the users demand methods with higher explainability. Some methods allow for more detailed explanations:
{{cite book}}
: CS1 maint: date and year (link)