A major contributor to this article appears to have a close connection with its subject.(October 2023) |
Isolation Forest is an algorithm for data anomaly detection using binary trees. It was developed by Fei Tony Liu in 2008. [1] It has a linear time complexity and a low memory use, which works well for high-volume data. [2] [3] It is based on the assumption that because anomalies are few and different from other data, they can be isolated using few partitions. Like decision tree algorithms, it does not perform density estimation. Unlike decision tree algorithms, it uses only path length to output an anomaly score, and does not use leaf node statistics of class distribution or target value.
Isolation Forest is fast because it splits the data space, randomly selecting an attribute and split point. The anomaly score is inversely associated with the path-length because anomalies need fewer splits to be isolated, because they are few and different.
The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008. [2] In 2012 the same authors showed that iForest has linear time complexity, a small memory requirement, and is applicable to high-dimensional data. [3] In 2010, an extension of the algorithm, SCiforest, was published to address clustered and axis-paralleled anomalies. [4]
The premise of the Isolation Forest algorithm is that anomalous data points are easier to separate from the rest of the sample. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value between the minimum and maximum values allowed for that attribute.
An example of random partitioning in a 2D dataset of normally distributed points is shown in the first figure for a non-anomalous point and in the second one for a point that is more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.
Recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point in the first figure is greater than the path length of in the second figure.
Let be a set of d-dimensional points and . An Isolation Tree (iTree) is defined as a data structure with the following properties:
In order to build an iTree, the algorithm recursively divides by randomly selecting an attribute and a split value , until either
When the iTree is fully grown, each point in is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length of point is defined as the number of edges traverses from the root node to get to an external node.
A probabilistic explanation of iTree is provided in the original iForest paper. [2]
Anomaly detection with Isolation Forest is done as follows: [4]
The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST. [4] Therefore, the estimation of average for external node terminations is the same as that of the unsuccessful searches in BST, that is [5]
where is the test set size, is the sample set size and is the harmonic number, which can be estimated by , where is the Euler-Mascheroni constant.
Above, is the average given , so we can use it to normalize to get an estimate of the anomaly score for a given instance x:
where is the average value of from a collection of iTrees. For any data point :
The Isolation Forest algorithm has shown its effectiveness in spotting anomalies in data sets like uncovering credit card fraud instances among transactions, by European cardholders with an unbalanced dataset where it can distinguish fraudulent activities from legitimate ones by identifying rare patterns that show notable differences. [6]
In this research projects dataset, there are 284807 transactions recorded in total out of which only 492 are identified as fraudulent (0.172%). Due to this imbalance between authentic and fraudulent transactions detection of fraud becomes quite demanding; hence specialized metrics such as the Area Under the Precision Recall Curve (AUPRC) are essential for accurate evaluation rather, than relying solely on traditional accuracy measures. [6]
The dataset consists of PCA transformed features (from V1, to V28) well as the Time (time elapsed since the initial transaction) and Amount (transaction value). We processed the dataset using the steps:
Scaling : The Time and Amount features by utilizing StandardScaler to standardize their input range. [7]
Imputation: Missing data in the dataset were filled in by using the average of the corresponding columns, with SimpleImputer. [7]
Feature Selection : To enhance the model's effectiveness and accuracy in predictions and analysis tasks was to choose features with the kurtosis value for further examination because these particular features tend to have the most significant outliers that could potentially signal irregularities or anomalies within the data set used for modeling purposes. For training purposes specifically, a selection of the 10 features were identified and prioritized as key components, in refining the model's capabilities and enhancing its overall performance. [8]
The Isolation Forest model was specifically trained on transactions (Class=0) focusing on recognizing common behavioral patterns in data analysis tasks. The algorithm separates out instances by measuring the distance needed to isolate them within a collection of randomly divided trees. [6]
Hyperparameter Tuning:
A grid search was performed over the following hyperparameters
Contamination: Expected percentage of anomalies in the dataset, tested at values 0.01, 0.02, and 0.05 [8]
Max Features: Number of features to sample for each tree, tested at values 5, 8, and 10. [8]
The best configuration was found with:
The model was evaluated on a separate test set using accuracy, precision, recall, and the Area Under the Precision-Recall Curve (AUPRC). Below are the key results:
While the accuracy seems impressive at glance it mainly showcases the prevalence of regular transactions within the dataset. Precision and recall emphasize the challenges in detecting fraud because of the significant imbalance present. In assessing both precision and recall the AUPRC offers an evaluation by considering the balance between precision and recall. [6]
1. Scatter Plot of Detected Anomalies
Observation: The plot shows that many fraudulent transactions (red points) are located on the edges or far from the central cluster of normal transactions (blue points). However, some red points overlap with the blue cluster, indicating potential false positives or challenging cases for the model.
Key Details:
Observation:
The Isolation Forest algorithm provides a robust solution for anomaly detection, particularly in domains like fraud detection where anomalies are rare and challenging to identify. However, its reliance on hyperparameters and sensitivity to imbalanced data necessitate careful tuning and complementary techniques for optimal results. [6] [8]
The performance of the Isolation Forest algorithm is highly dependent on the selection of its parameters. Properly tuning these parameters can significantly enhance the algorithm's ability to accurately identify anomalies. Understanding the role and impact of each parameter is crucial for optimizing the model's performance. [10]
The Isolation Forest algorithm involves several key parameters that influence its behavior and effectiveness. These parameters control various aspects of the tree construction process, the size of the sub-samples, and the thresholds for identifying anomalies. [10] Selecting appropriate parameters is the key to the performance of the Isolation Forest algorithm. Each of the parameters influences anomaly detection differently. Key parameters include:
Number of Trees : This parameter determines the number of trees in the Isolation Forest. A higher number of trees improves anomaly detection accuracy but increases computational costs. The optimal number balances resource availability with performance needs. For example, a smaller dataset might require fewer trees to save on computation, while larger datasets benefit from additional trees to capture more complexity. [2]
Subsample Size : The subsample size dictates the number of data points used to construct each tree. Smaller subsample sizes reduce computational complexity but may capture less variability in the data. For instance, a subsample size of 256 is commonly used, but the optimal value depends on dataset characteristics. [2]
Contamination Factor : This parameter estimates the proportion of outliers in the dataset. Higher contamination values flag more data points as anomalies, which can lead to false positives. Tuning this parameter carefully based on domain knowledge or cross-validation is critical to avoid bias or misclassification. [3]
Maximum Features: This parameter specifies the number of random features to consider for each split in the tree. Limiting the number of features increases randomness, making the model more robust. However, in high-dimensional datasets, selecting only the most informative features prevents overfitting and improves generalization. [2] [3]
Tree Depth : Tree depth determines the maximum number of splits for a tree. Deeper trees better capture data complexity but risk overfitting, especially in small datasets. Shallow trees, on the other hand, improve computational efficiency. [3]
The table below summarizes parameter selection strategies based on dataset characteristics.
Parameter | Small Datasets | Large Datasets | High-Dimensional Data | Imbalanced Datasets |
---|---|---|---|---|
Number of Treesn_estimators | Use fewer trees to save on computation. [12] | More trees improve performance, but are costly. [13] | Requires more trees to capture complexity. [14] | Adjust based on dataset size. [14] |
Subsample Sizemax_samples | Smaller subsample reduces cost. [12] | Larger subsample increases accuracy. [13] | Dimensionality reduction can optimize subsample size. [14] | Smaller subsample for efficiency. [14] |
Contamination Factorcontamination | Tune based on domain knowledge. [12] | Cross-validation for tuning. [13] | Careful tuning to avoid misclassification. [14] | Lower contamination helps avoid bias. [14] |
Maximum Featuresmax_features | Use all features unless limited by computation. [12] | Logarithmic or √n scaling for large datasets. [13] | Select most informative features. [14] | Balance feature selection to avoid overfitting. [14] |
Tree Depthmax_depth | Moderate depth to prevent overfitting. [12] | Shallower depth to save on computation. [13] | Deeper trees to capture data complexity. [14] | Adjust to balance overfitting. [14] |
Benefits of Proper Parameter Tuning:
Improved Accuracy: Fine-tuning parameters helps the algorithm better distinguish between normal data and anomalies, reducing false positives and negatives. [10]
Computational Efficiency: Selecting appropriate values for parameters like the number of trees and sub-sample size makes the algorithm more efficient without sacrificing accuracy. [10]
Generalization: Limiting tree depth and using bootstrap sampling helps the model generalize better to new data, reducing overfitting. [10]
SCiForest (Isolation Forest with Split-selection Criterion) is an extension of the original Isolation Forest algorithm, specifically designed to target clustered anomalies. It introduces a split-selection criterion and uses random hyper-planes that are non-axis-parallel to the original attributes. SCiForest does not require an optimal hyper-plane at every node; instead, it generates multiple random hyper-planes, and through sufficient trials, a good-enough hyper-plane is selected. This approach makes the resulting model highly effective due to the aggregate power of the ensemble learner. [4]
The implementation of SciForest involves four primary steps, each tailored to improve anomaly detection by isolating clustered anomalies more effectively than standard Isolation Forest methods.
Using techniques like KMeans or hierarchical clustering, SciForest organizes features into clusters to identify meaningful subsets. By sampling random subspaces, SciForest emphasizes meaningful feature groups, reducing noise and improving focus. This reduces the impact of irrelevant or noisy dimensions. [4]
Within each selected subspace, isolation trees are constructed. These trees isolate points through random recursive splitting:
Anomalous points, being sparse or distinct, are isolated more quickly (shorter path lengths) compared to normal points. [2]
For each data point, the isolation depth () is calculated for all trees across all subspaces. The anomaly score for a data point is defined as:
Where:
Points with lower average path lengths () are more likely to be anomalies. [3]
The final anomaly scores are compared against a predefined threshold to classify data points. If , the point is classified as anomalous; otherwise, it is normal. The anomaly score threshold, θ, can be tailored to specific applications to control the proportion of identified anomalies. [6]
These steps collectively enable SciForest to adapt to varied data distributions while maintaining efficiency in anomaly detection.
This flowchart visually represents the step-by-step process of SCiForest implementation, from inputting high-dimensional datasets to detecting anomalies. Each step is highlighted with its key functionality, providing a clear overview of the methodology.
Extended Isolation Forest (Extended IF or EIF) is another extension of the original Isolation Forest algorithm. Extended IF uses rotated trees in different planes, similarly to SCiForest and random values are selected to split the data, such as a random slope or intercept.
The standard Isolation Forest requires two pieces of information, those being 1) a random feature or coordinate, and 2) a random value for the feature from the range of available values in the data. The Extended IF also requires only two pieces of information, this time being 1) a random slope for the branch cut, and 2) a random intercept for the branch cut which is chosen from the range of available values of the training data. This makes the Extended IF simpler than using rotation trees. [15]
The figure depicts a score map of a regular Isolation Forest in comparison to an Extended Isolation Forest for a sinusoidal-shaped data-set. This image allows us to clearly observe the improvement made by the Extended Isolation Forest in evaluating the scores much more accurately when compared to the shape of the data. While the regular isolation forest fails in capturing the sinusoid shape of the data and properly evaluating the anomaly scores. The regular Isolation Forest shapes the anomaly scores into a rectangular shape and simply assumes that any region nearby the sinusoid data point is not to be anomalous. In comparison, the EIF is more accurate in evaluating anomaly scores with more detail and unlike its predecessor, the EIF is able to detect anomalies that are close to the sinusoid shape of the data but are still anomalous. The original EIF publication includes also this comparison with a single-blob-shaped data-set and a two-blob-shaped data-set, also comparing the EIF results to isolation forest using rotation trees. [15]
The Extended Isolation Forest enhances the traditional Isolation Forest algorithm by addressing some of its limitations, particularly in handling high-dimensional data and improving anomaly detection accuracy. Key improvements in EIF include:
Enhanced Splitting Mechanism: Unlike traditional Isolation Forest, which uses random axis-aligned splits, EIF uses hyperplanes for splitting data. This approach allows for more flexible and accurate partitioning of the data space, which is especially useful in high-dimensional datasets.
Improved Anomaly Scoring: EIF refines the anomaly scoring process by considering the distance of data points from the hyperplane used in splitting. This provides a more granular and precise anomaly score, leading to better differentiation between normal and anomalous points.
Handling of High-Dimensional Data: The use of hyperplanes also improves EIF's performance in high-dimensional spaces. Traditional Isolation Forest can suffer from the curse of dimensionality in such scenarios, but EIF mitigates this issue by creating more meaningful and informative partitions in the data space. [16]
Original implementation by Fei Tony Liu is Isolation Forest in R.
Other implementations (in alphabetical order):
Other variations of Isolation Forest algorithm implementations:
The isolation forest algorithm is commonly used by data scientists through the version made available in the scikit-learn library. The snippet below depicts a brief implementation of an isolation forest, with direct explanations with comments.
importpandasaspdfromsklearn.ensembleimportIsolationForest# Consider 'data.csv' is a file containing samples as rows and features as column, and a column labeled 'Class' with a binary classification of your samples.df=pd.read_csv('data.csv')X=df.drop(columns=['Class'])y=df['Class']# Determine how many samples will be outliers based on the classificationoutlier_fraction=len(df[df['Class']==1])/float(len(df[df['Class']==0]))# Create and fit model, parameters can be optimizedmodel=IsolationForest(n_estimators=100,contamination=outlier_fraction,random_state=42)model.fit(df)
In this snippet we can observe the simplicity of a standard implementation of the algorithm. The only requirement data that the user needs to adjust is the outlier fraction in which the user determines a percentage of the samples to be classifier as outliers. This can be commonly done by selection a group among the positive and negative samples according to a giving classification. Most of the other steps are pretty standard to any decision tree based technique made available through scikit-learn, in which the user simply needs to split the target variable from the features and fit the model after it is defined with a giving number of estimators (or trees).
This snippet is a shortened adapted version of an implementation explored by GeeksforGeeks, which can be accessed for further explorations. [19]
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.
A decision tree is a decision support recursive partitioning structure that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.
Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.
In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.
Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.
The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming. The curse generally refers to issues that arise when the number of datapoints is small relative to the intrinsic dimension of the data.
Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it also can be interpreted as an outlier detection method. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles at SRI International in 1981. They used RANSAC to solve the location determination problem (LDP), where the goal is to determine the points in the space that project onto an image into a set of landmarks with known locations.
Bootstrap aggregating, also called bagging or bootstrapping, is a machine learning (ML) ensemble meta-algorithm designed to improve the stability and accuracy of ML classification and regression algorithms. It also reduces variance and overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the ensemble averaging approach.
Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set.
In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method. It was first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. Most often, it is used for classification, as a k-NN classifier, the output of which is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.
Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.
Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.
In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.
In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.
BIRCH is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources. In most cases, BIRCH only requires a single scan of the database.
ELKI is a data mining software framework developed for use in research and teaching. It was originally created by the database systems research unit at the Ludwig Maximilian University of Munich, Germany, led by Professor Hans-Peter Kriegel. The project has continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.
The Point Cloud Library (PCL) is an open-source library of algorithms for point cloud processing tasks and 3D geometry processing, such as occur in three-dimensional computer vision. The library contains algorithms for filtering, feature estimation, surface reconstruction, 3D registration, model fitting, object recognition, and segmentation. Each module is implemented as a smaller library that can be compiled separately. PCL has its own data format for storing point clouds - PCD, but also allows datasets to be loaded and saved in many other formats. It is written in C++ and released under the BSD license.
Wireless sensor networks (WSN) are a spatially distributed network of autonomous sensors used for monitoring an environment. Energy cost is a major limitation for WSN requiring the need for energy efficient networks and processing. One of major energy costs in WSN is the energy spent on communication between nodes and it is sometimes desirable to only send data to a gateway node when an event of interest is triggered at a sensor. Sensors will then only open communication during a probable event, saving on communication costs. Fields interested in this type of network include surveillance, home automation, disaster relief, traffic control, health care and more.
{{cite book}}
: |journal=
ignored (help)