Isolation forest

Last updated

Isolation Forest is an algorithm for data anomaly detection using binary trees. It was developed by Fei Tony Liu in 2008. [1] It has a linear time complexity and a low memory use, which works well for high-volume data. [2] [3] It is based on the assumption that because anomalies are few and different from other data, they can be isolated using few partitions. Like decision tree algorithms, it does not perform density estimation. Unlike decision tree algorithms, it uses only path length to output an anomaly score, and does not use leaf node statistics of class distribution or target value.

Contents

Isolation Forest is fast because it splits the data space, randomly selecting an attribute and split point. The anomaly score is inversely associated with the path-length because anomalies need fewer splits to be isolated, because they are few and different.

History

The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008. [2] In 2012 the same authors showed that iForest has linear time complexity, a small memory requirement, and is applicable to high-dimensional data. [3] In 2010, an extension of the algorithm, SCiforest, was published to address clustered and axis-paralleled anomalies. [4]

Isolation trees

An example of isolating a non-anomalous point in a 2D Gaussian distribution. Isolating a Non-Anomalous Point.png
An example of isolating a non-anomalous point in a 2D Gaussian distribution.

The premise of the Isolation Forest algorithm is that anomalous data points are easier to separate from the rest of the sample. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value between the minimum and maximum values allowed for that attribute.

An example of isolating an anomalous point in a 2D Gaussian distribution. Isolating an Anomalous Point.png
An example of isolating an anomalous point in a 2D Gaussian distribution.

An example of random partitioning in a 2D dataset of normally distributed points is shown in the first figure for a non-anomalous point and in the second one for a point that is more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.

Recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point in the first figure is greater than the path length of in the second figure.

Let be a set of d-dimensional points and . An Isolation Tree (iTree) is defined as a data structure with the following properties:

  1. for each node in the Tree, is either an external-node with no child, or an internal-node with one “test” and exactly two child nodes ( and )
  2. a test at node consists of an attribute and a split value such that the test determines the traversal of a data point to either or .

In order to build an iTree, the algorithm recursively divides by randomly selecting an attribute and a split value , until either

  1. the node has only one instance, or
  2. all data at the node have the same values.

When the iTree is fully grown, each point in is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length of point is defined as the number of edges traverses from the root node to get to an external node.

A probabilistic explanation of iTree is provided in the original iForest paper. [2]

Anomaly detection

Anomaly detection with Isolation Forest is done as follows: [4]

  1. Use the training dataset to build some number of iTrees
  2. For each data point in the test set:
    1. Pass it through all the iTrees, counting the path length for each tree
    2. Assign an “anomaly score” to the instance
    3. Label the point as “anomaly” if its score is greater than a predefined threshold, which depends on the domain

Anomaly score

The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST. [4] Therefore, the estimation of average for external node terminations is the same as that of the unsuccessful searches in BST, that is [5]

where is the test set size, is the sample set size and is the harmonic number, which can be estimated by , where is the Euler-Mascheroni constant.

Above, is the average given , so we can use it to normalize to get an estimate of the anomaly score for a given instance x:

where is the average value of from a collection of iTrees. For any data point :

Application of Isolation Forest for Credit Card Fraud Detection (Anomaly)

The Isolation Forest algorithm has shown its effectiveness in spotting anomalies in data sets like uncovering credit card fraud instances among transactions, by European cardholders with an unbalanced dataset where it can distinguish fraudulent activities from legitimate ones by identifying rare patterns that show notable differences. [6]

Dataset and Preprocessing

In this research projects dataset, there are 284807 transactions recorded in total out of which only 492 are identified as fraudulent (0.172%). Due to this imbalance between authentic and fraudulent transactions detection of fraud becomes quite demanding; hence specialized metrics such as the Area Under the Precision Recall Curve (AUPRC) are essential for accurate evaluation rather, than relying solely on traditional accuracy measures. [6]

The dataset consists of PCA transformed features (from V1, to V28) well as the Time (time elapsed since the initial transaction) and Amount (transaction value). We processed the dataset using the steps:

Scaling : The Time and Amount features by utilizing StandardScaler to standardize their input range. [7]

Imputation: Missing data in the dataset were filled in by using the average of the corresponding columns, with SimpleImputer. [7]

Feature Selection : To enhance the model's effectiveness and accuracy in predictions and analysis tasks was to choose features with the kurtosis value for further examination because these particular features tend to have the most significant outliers that could potentially signal irregularities or anomalies within the data set used for modeling purposes. For training purposes specifically, a selection of the 10 features were identified and prioritized as key components, in refining the model's capabilities and enhancing its overall performance. [8]

Model Training and Hyperparameter Tuning

The Isolation Forest model was specifically trained on transactions (Class=0) focusing on recognizing common behavioral patterns in data analysis tasks. The algorithm separates out instances by measuring the distance needed to isolate them within a collection of randomly divided trees. [6]

Hyperparameter Tuning:

A grid search was performed over the following hyperparameters

Contamination: Expected percentage of anomalies in the dataset, tested at values 0.01, 0.02, and 0.05 [8]

Max Features: Number of features to sample for each tree, tested at values 5, 8, and 10. [8]

The best configuration was found with:

Results and Evaluation

The model was evaluated on a separate test set using accuracy, precision, recall, and the Area Under the Precision-Recall Curve (AUPRC). Below are the key results:

While the accuracy seems impressive at glance it mainly showcases the prevalence of regular transactions within the dataset. Precision and recall emphasize the challenges in detecting fraud because of the significant imbalance present. In assessing both precision and recall the AUPRC offers an evaluation by considering the balance between precision and recall. [6]

Visualizing Results

1. Scatter Plot of Detected Anomalies

The scatter plot uses Credit Card Fraud Detection dataset and represents the anomalies (transactions) pinpointed by the Isolation Forest algorithm in a two-dimensional manner using two specific dataset features. V10 along the x axis and V20 along the y axis are selected for this purpose due to their high kurtosis values signifying extreme outlier traits crucial for detecting anomalies effectively. Anomaly Dectection.png
The scatter plot uses Credit Card Fraud Detection dataset and represents the anomalies (transactions) pinpointed by the Isolation Forest algorithm in a two-dimensional manner using two specific dataset features. V10 along the x axis and V20 along the y axis are selected for this purpose due to their high kurtosis values signifying extreme outlier traits crucial for detecting anomalies effectively.

Key Details

  • Red Points: Represent the fraudulent transactions flagged by the model as anomalies. These points deviate significantly from the dense clusters of normal transactions, showcasing the algorithm's capability to isolate outliers effectively.
  • Blue Points: Represent the normal transactions, which form the dense cluster at the center of the plot. These are transactions the model identified as not being anomalies.
  • Interpretability: The use of two features with high kurtosis helps in visually understanding the model's decision-making process. In high-dimensional data like this (28 PCA-transformed features), reducing to two dimensions with the most extreme outliers provides an interpretable representation of the results.

Observation: The plot shows that many fraudulent transactions (red points) are located on the edges or far from the central cluster of normal transactions (blue points). However, some red points overlap with the blue cluster, indicating potential false positives or challenging cases for the model.

2. Precision-Recall Curve

The Precision-Recall Curve (PRC) is a widely used evaluation tool for models dealing with highly imbalanced datasets, such as fraud detection. It shows the trade-off between Precision and Recall across different thresholds using Credit Card Fraud Detection dataset . Precision and Recall Curve.png
The Precision-Recall Curve (PRC) is a widely used evaluation tool for models dealing with highly imbalanced datasets, such as fraud detection. It shows the trade-off between Precision and Recall across different thresholds using Credit Card Fraud Detection dataset .

Key Details:

  • X-axis (Recall): Represents the proportion of true fraudulent transactions (positives) that were correctly identified by the model. A higher recall indicates fewer missed frauds.
  • Y-axis(Precision): Represents the proportion of transactions flagged as fraud that were fraudulent. A higher precision indicates fewer false positives.
  • AUPRC Value: The Area Under the Precision-Recall Curve (AUPRC) quantifies the model's performance. For this dataset, the AUPRC is 0.22, reflecting the difficulty of detecting fraud in a highly imbalanced dataset.

Observation:

  • At high recall values, precision decreases sharply, indicating that as the model becomes more aggressive in identifying anomalies, it also flags more normal transactions as fraud, leading to a higher false-positive rate.
  • Conversely, at higher precision values, recall decreases, meaning the model becomes more conservative and misses many fraudulent transactions.

Strengths of Isolation Forest

  • Scalability: With a linear time complexity of O(n*logn), Isolation Forest is efficient for large datasets. [6]
  • Unsupervised Nature: The model does not rely on labeled data, making it suitable for anomaly detection in various domains. [8]
  • Feature-agnostic: The algorithm adapts to different datasets without making assumptions about feature distributions. [7]
Challenges
  • Imbalanced Data: Low precision indicates that many normal transactions are incorrectly flagged as fraud, leading to false positives. [7]
  • Sensitivity to Hyperparameters: Contamination rate and feature sampling heavily influence the model's performance, requiring extensive tuning. [8]
  • Interpretability: While effective, the algorithm's outputs can be challenging to interpret without domain-specific knowledge. [6]
Future Directions
  • Combining Models: A hybrid approach, integrating supervised learning with Isolation Forest, may enhance performance by leveraging labeled data for known fraud cases. [7]
  • Active Learning: Incorporating feedback loops to iteratively refine the model using misclassified transactions could improve recall and precision. [8]
  • Feature Engineering: Adding transaction metadata, such as merchant location and transaction type, could further aid anomaly detection. [6]

Conclusion

The Isolation Forest algorithm provides a robust solution for anomaly detection, particularly in domains like fraud detection where anomalies are rare and challenging to identify. However, its reliance on hyperparameters and sensitivity to imbalanced data necessitate careful tuning and complementary techniques for optimal results. [6] [8]

Properties

Parameter Selection

The performance of the Isolation Forest algorithm is highly dependent on the selection of its parameters. Properly tuning these parameters can significantly enhance the algorithm's ability to accurately identify anomalies. Understanding the role and impact of each parameter is crucial for optimizing the model's performance. [10]

The bar chart shows the performance sensitivity of key Isolation Forest parameters across different value ranges, based on Dal Pozzolo et al.'s Credit Card Fraud Detection dataset (2014), highlighting tuning complexity. Isolation forest performance chart.png
The bar chart shows the performance sensitivity of key Isolation Forest parameters across different value ranges, based on Dal Pozzolo et al.'s Credit Card Fraud Detection dataset (2014), highlighting tuning complexity.

The Isolation Forest algorithm involves several key parameters that influence its behavior and effectiveness. These parameters control various aspects of the tree construction process, the size of the sub-samples, and the thresholds for identifying anomalies. [10] Selecting appropriate parameters is the key to the performance of the Isolation Forest algorithm. Each of the parameters influences anomaly detection differently. Key parameters include:

Number of Trees : This parameter determines the number of trees in the Isolation Forest. A higher number of trees improves anomaly detection accuracy but increases computational costs. The optimal number balances resource availability with performance needs. For example, a smaller dataset might require fewer trees to save on computation, while larger datasets benefit from additional trees to capture more complexity. [2]

Subsample Size : The subsample size dictates the number of data points used to construct each tree. Smaller subsample sizes reduce computational complexity but may capture less variability in the data. For instance, a subsample size of 256 is commonly used, but the optimal value depends on dataset characteristics. [2]

Contamination Factor : This parameter estimates the proportion of outliers in the dataset. Higher contamination values flag more data points as anomalies, which can lead to false positives. Tuning this parameter carefully based on domain knowledge or cross-validation is critical to avoid bias or misclassification. [3]

Maximum Features: This parameter specifies the number of random features to consider for each split in the tree. Limiting the number of features increases randomness, making the model more robust. However, in high-dimensional datasets, selecting only the most informative features prevents overfitting and improves generalization. [2] [3]

Tree Depth : Tree depth determines the maximum number of splits for a tree. Deeper trees better capture data complexity but risk overfitting, especially in small datasets. Shallow trees, on the other hand, improve computational efficiency. [3]

The table below summarizes parameter selection strategies based on dataset characteristics.

Guidelines for Selecting Key Parameters
ParameterSmall DatasetsLarge DatasetsHigh-Dimensional DataImbalanced Datasets
Number of Trees
n_estimators
Use fewer trees to save on computation. [12] More trees improve performance, but are costly. [13] Requires more trees to capture complexity. [14] Adjust based on dataset size. [14]
Subsample Size
max_samples
Smaller subsample reduces cost. [12] Larger subsample increases accuracy. [13] Dimensionality reduction can optimize subsample size. [14] Smaller subsample for efficiency. [14]
Contamination Factor
contamination
Tune based on domain knowledge. [12] Cross-validation for tuning. [13] Careful tuning to avoid misclassification. [14] Lower contamination helps avoid bias. [14]
Maximum Features
max_features
Use all features unless limited by computation. [12] Logarithmic or √n scaling for large datasets. [13] Select most informative features. [14] Balance feature selection to avoid overfitting. [14]
Tree Depth
max_depth
Moderate depth to prevent overfitting. [12] Shallower depth to save on computation. [13] Deeper trees to capture data complexity. [14] Adjust to balance overfitting. [14]

Benefits of Proper Parameter Tuning:
Improved Accuracy: Fine-tuning parameters helps the algorithm better distinguish between normal data and anomalies, reducing false positives and negatives. [10]
Computational Efficiency: Selecting appropriate values for parameters like the number of trees and sub-sample size makes the algorithm more efficient without sacrificing accuracy. [10]
Generalization: Limiting tree depth and using bootstrap sampling helps the model generalize better to new data, reducing overfitting. [10]

SCiForest

SCiForest (Isolation Forest with Split-selection Criterion) is an extension of the original Isolation Forest algorithm, specifically designed to target clustered anomalies. It introduces a split-selection criterion and uses random hyper-planes that are non-axis-parallel to the original attributes. SCiForest does not require an optimal hyper-plane at every node; instead, it generates multiple random hyper-planes, and through sufficient trials, a good-enough hyper-plane is selected. This approach makes the resulting model highly effective due to the aggregate power of the ensemble learner. [4]

Steps in SCiForest Implementation

The implementation of SciForest involves four primary steps, each tailored to improve anomaly detection by isolating clustered anomalies more effectively than standard Isolation Forest methods.

1. Subspace Selection

Using techniques like KMeans or hierarchical clustering, SciForest organizes features into clusters to identify meaningful subsets. By sampling random subspaces, SciForest emphasizes meaningful feature groups, reducing noise and improving focus. This reduces the impact of irrelevant or noisy dimensions. [4]

2. Isolation Tree Construction

Within each selected subspace, isolation trees are constructed. These trees isolate points through random recursive splitting:

  • A feature is selected randomly from the subspace.
  • A random split value within the feature's range is chosen to partition the data.

Anomalous points, being sparse or distinct, are isolated more quickly (shorter path lengths) compared to normal points. [2]

3. Anomaly Scoring

For each data point, the isolation depth () is calculated for all trees across all subspaces. The anomaly score for a data point is defined as:

Where:

  • : Path length for data point in the -th tree.
  • : Total number of isolation trees.

Points with lower average path lengths () are more likely to be anomalies. [3]

4. Thresholding

The final anomaly scores are compared against a predefined threshold to classify data points. If , the point is classified as anomalous; otherwise, it is normal. The anomaly score threshold, θ, can be tailored to specific applications to control the proportion of identified anomalies. [6]

These steps collectively enable SciForest to adapt to varied data distributions while maintaining efficiency in anomaly detection.

SCiForest Implementation Flowchart

This flowchart visually represents the step-by-step process of SCiForest implementation, from inputting high-dimensional datasets to detecting anomalies. Each step is highlighted with its key functionality, providing a clear overview of the methodology.

Extended Isolation Forest

Extended Isolation Forest (Extended IF or EIF) is another extension of the original Isolation Forest algorithm. Extended IF uses rotated trees in different planes, similarly to SCiForest and random values are selected to split the data, such as a random slope or intercept.

The standard Isolation Forest requires two pieces of information, those being 1) a random feature or coordinate, and 2) a random value for the feature from the range of available values in the data. The Extended IF also requires only two pieces of information, this time being 1) a random slope for the branch cut, and 2) a random intercept for the branch cut which is chosen from the range of available values of the training data. This makes the Extended IF simpler than using rotation trees. [15]

A comparison between a score map for a regular isolation forest (in the left) and a score map for an extended isolation forest (in the right). This image is a recreation from the code provided by the original EIF paper, using the Python matplotlib library . EIF sinusoid.png
A comparison between a score map for a regular isolation forest (in the left) and a score map for an extended isolation forest (in the right). This image is a recreation from the code provided by the original EIF paper, using the Python matplotlib library .

The figure depicts a score map of a regular Isolation Forest in comparison to an Extended Isolation Forest for a sinusoidal-shaped data-set. This image allows us to clearly observe the improvement made by the Extended Isolation Forest in evaluating the scores much more accurately when compared to the shape of the data. While the regular isolation forest fails in capturing the sinusoid shape of the data and properly evaluating the anomaly scores. The regular Isolation Forest shapes the anomaly scores into a rectangular shape and simply assumes that any region nearby the sinusoid data point is not to be anomalous. In comparison, the EIF is more accurate in evaluating anomaly scores with more detail and unlike its predecessor, the EIF is able to detect anomalies that are close to the sinusoid shape of the data but are still anomalous. The original EIF publication includes also this comparison with a single-blob-shaped data-set and a two-blob-shaped data-set, also comparing the EIF results to isolation forest using rotation trees. [15]

Improvements in Extended Isolation Forest

The Extended Isolation Forest enhances the traditional Isolation Forest algorithm by addressing some of its limitations, particularly in handling high-dimensional data and improving anomaly detection accuracy. Key improvements in EIF include:

Enhanced Splitting Mechanism: Unlike traditional Isolation Forest, which uses random axis-aligned splits, EIF uses hyperplanes for splitting data. This approach allows for more flexible and accurate partitioning of the data space, which is especially useful in high-dimensional datasets.

Improved Anomaly Scoring: EIF refines the anomaly scoring process by considering the distance of data points from the hyperplane used in splitting. This provides a more granular and precise anomaly score, leading to better differentiation between normal and anomalous points.

Handling of High-Dimensional Data: The use of hyperplanes also improves EIF's performance in high-dimensional spaces. Traditional Isolation Forest can suffer from the curse of dimensionality in such scenarios, but EIF mitigates this issue by creating more meaningful and informative partitions in the data space. [16]

Open source implementations

Original implementation by Fei Tony Liu is Isolation Forest in R.

Other implementations (in alphabetical order):

Other variations of Isolation Forest algorithm implementations:

Python implementation with Scikit-learn

The isolation forest algorithm is commonly used by data scientists through the version made available in the scikit-learn library. The snippet below depicts a brief implementation of an isolation forest, with direct explanations with comments.

importpandasaspdfromsklearn.ensembleimportIsolationForest# Consider 'data.csv' is a file containing samples as rows and features as column, and a column labeled 'Class' with a binary classification of your samples.df=pd.read_csv('data.csv')X=df.drop(columns=['Class'])y=df['Class']# Determine how many samples will be outliers based on the classificationoutlier_fraction=len(df[df['Class']==1])/float(len(df[df['Class']==0]))# Create and fit model, parameters can be optimizedmodel=IsolationForest(n_estimators=100,contamination=outlier_fraction,random_state=42)model.fit(df)

In this snippet we can observe the simplicity of a standard implementation of the algorithm. The only requirement data that the user needs to adjust is the outlier fraction in which the user determines a percentage of the samples to be classifier as outliers. This can be commonly done by selection a group among the positive and negative samples according to a giving classification. Most of the other steps are pretty standard to any decision tree based technique made available through scikit-learn, in which the user simply needs to split the target variable from the features and fit the model after it is defined with a giving number of estimators (or trees).

This snippet is a shortened adapted version of an implementation explored by GeeksforGeeks, which can be accessed for further explorations. [19]

See also

Related Research Articles

<span class="mw-page-title-main">Outlier</span> Observation far apart from others in statistics and data science

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are sometimes excluded from the data set. An outlier can be an indication of exciting possibility, but can also cause serious problems in statistical analyses.

<span class="mw-page-title-main">Decision tree</span> Decision support tool

A decision tree is a decision support recursive partitioning structure that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains conditional control statements.

Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, where a small portion of the data is tagged, and self-supervision. Some researchers consider self-supervised learning a form of unsupervised learning.

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming. The curse generally refers to issues that arise when the number of datapoints is small relative to the intrinsic dimension of the data.

Random sample consensus (RANSAC) is an iterative method to estimate parameters of a mathematical model from a set of observed data that contains outliers, when outliers are to be accorded no influence on the values of the estimates. Therefore, it also can be interpreted as an outlier detection method. It is a non-deterministic algorithm in the sense that it produces a reasonable result only with a certain probability, with this probability increasing as more iterations are allowed. The algorithm was first published by Fischler and Bolles at SRI International in 1981. They used RANSAC to solve the location determination problem (LDP), where the goal is to determine the points in the space that project onto an image into a set of landmarks with known locations.

Bootstrap aggregating, also called bagging or bootstrapping, is a machine learning (ML) ensemble meta-algorithm designed to improve the stability and accuracy of ML classification and regression algorithms. It also reduces variance and overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the ensemble averaging approach.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that works by creating a multitude of decision trees during training. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the output is the average of the predictions of the trees. Random forests correct for decision trees' habit of overfitting to their training set.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method. It was first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. Most often, it is used for classification, as a k-NN classifier, the output of which is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust statistical methods have been developed for many common problems, such as estimating location, scale, and regression parameters. One motivation is to produce statistical methods that are not unduly affected by outliers. Another motivation is to provide methods with good performance when there are small departures from a parametric distribution. For example, robust methods work well for mixtures of two normal distributions with different standard deviations; under this model, non-robust methods like a t-test work poorly.

In data analysis, anomaly detection is generally understood to be the identification of rare items, events or observations which deviate significantly from the majority of the data and do not conform to a well defined notion of normal behavior. Such examples may arouse suspicions of being generated by a different mechanism, or appear inconsistent with the remainder of that set of data.

In machine learning, one-class classification (OCC), also known as unary classification or class-modelling, tries to identify objects of a specific class amongst all objects, by primarily learning from a training set containing only the objects of that class, although there exist variants of one-class classifiers where counter-examples are used to further refine the classification boundary. This is different from and more difficult than the traditional classification problem, which tries to distinguish between two or more classes with the training set containing objects from all the classes. Examples include the monitoring of helicopter gearboxes, motor failure prediction, or the operational status of a nuclear plant as 'normal': In this scenario, there are few, if any, examples of catastrophic system states; only the statistics of normal operation are known.

BIRCH is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources. In most cases, BIRCH only requires a single scan of the database.

<span class="mw-page-title-main">ELKI</span> Data mining framework

ELKI is a data mining software framework developed for use in research and teaching. It was originally created by the database systems research unit at the Ludwig Maximilian University of Munich, Germany, led by Professor Hans-Peter Kriegel. The project has continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

<span class="mw-page-title-main">Point Cloud Library</span> Open-source algorithm library

The Point Cloud Library (PCL) is an open-source library of algorithms for point cloud processing tasks and 3D geometry processing, such as occur in three-dimensional computer vision. The library contains algorithms for filtering, feature estimation, surface reconstruction, 3D registration, model fitting, object recognition, and segmentation. Each module is implemented as a smaller library that can be compiled separately. PCL has its own data format for storing point clouds - PCD, but also allows datasets to be loaded and saved in many other formats. It is written in C++ and released under the BSD license.

<span class="mw-page-title-main">Event detection for WSN</span>

Wireless sensor networks (WSN) are a spatially distributed network of autonomous sensors used for monitoring an environment. Energy cost is a major limitation for WSN requiring the need for energy efficient networks and processing. One of major energy costs in WSN is the energy spent on communication between nodes and it is sometimes desirable to only send data to a gateway node when an event of interest is triggered at a sensor. Sensors will then only open communication during a probable event, saving on communication costs. Fields interested in this type of network include surveillance, home automation, disaster relief, traffic control, health care and more.

References

  1. Liu, Fei Tony (7 July 2014). "First Isolation Forest implementation on Sourceforge".
  2. 1 2 3 4 5 6 7 8 9 10 11 Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). "Isolation Forest". 2008 Eighth IEEE International Conference on Data Mining. pp. 413–422. doi:10.1109/ICDM.2008.17. ISBN   978-0-7695-3502-9. S2CID   6505449.
  3. 1 2 3 4 5 6 7 Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (December 2008). "Isolation-Based Anomaly Detection". ACM Transactions on Knowledge Discovery from Data. 6: 3:1–3:39. doi:10.1145/2133360.2133363. S2CID   207193045.
  4. 1 2 3 4 5 6 7 Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi-Hua (September 2010). "On Detecting Clustered Anomalies Using SCiForest". Joint European Conference on Machine Learning and Knowledge Discovery in Databases - ECML PKDD 2010: Machine Learning and Knowledge Discovery in Databases. Lecture Notes in Computer Science. Vol. 6322. pp. 274–290. doi: 10.1007/978-3-642-15883-4_18 . ISBN   978-3-642-15882-7.
  5. Shaffer, Clifford A. (2011). Data structures & algorithm analysis in Java (3rd Dover ed.). Mineola, NY: Dover Publications. ISBN   9780486485812. OCLC   721884651.
  6. 1 2 3 4 5 6 7 8 9 Dal Pozzolo, Andrea; Caelen, Olivier; Johnson, Reid A; Bontempi, Gianluca (2015). "Calibrating Probability with Undersampling for Unbalanced Classification". 2015 IEEE Symposium Series on Computational Intelligence. pp. 159–166. doi:10.1109/SSCI.2015.33. ISBN   978-1-4799-7560-0.{{cite book}}: |journal= ignored (help)
  7. 1 2 3 4 5 6 7 "Credit Card Fraud Detection Dataset" . Retrieved 2024-12-04.
  8. 1 2 3 4 5 6 7 Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. "Learned lessons in credit card fraud detection from a practitioner perspective". Expert Systems with Applications. 41 (10): 4915–4928. doi:10.1016/j.eswa.2014.03.026.
  9. Dilini Talagala, Priyanga; Hyndman, Rob J.; Smith-Miles, Kate (12 Aug 2019). "Anomaly Detection in High Dimensional Data". arXiv: 1908.04000 [stat.ML].
  10. 1 2 3 4 5 "Hyperparameter Tuning Isolation Forest | Restackio". www.restack.io. Retrieved 2024-12-05.
  11. "Andrea Dal Pozzolo". dalpozz.github.io. Retrieved 2024-12-06.
  12. 1 2 3 4 5 Michael Heigl; Ashutosh Anand Kumar; Andreas Urmann; Dalibor Fiala; Martin Schramm; Robert Hable (2021). "On the Improvement of the Isolation Forest Algorithm for Outlier Detection with Streaming Data". Electronics. 10 (13): 1534. doi: 10.3390/electronics10131534 .
  13. 1 2 3 4 5 Yassine Chabchoub; M. U. Togbe; Aboubacar Boly; Rachid Chiky (2022). "An In-Depth Study and Improvement of Isolation Forest". IEEE Access. 10: 10219–10237. Bibcode:2022IEEEA..1010219C. doi:10.1109/ACCESS.2022.3144425.
  14. 1 2 3 4 5 6 7 8 9 10 "Tuning Isolation Forest for Anomaly Detection | Restackio". www.restack.io. Retrieved 2024-12-05.
  15. 1 2 3 4 Hariri, Sahand; Kind, Matias Carrasco; Brunner, Robert J. (April 2021). "Extended Isolation Forest". IEEE Transactions on Knowledge and Data Engineering. 33 (4): 1479–1489. arXiv: 1811.02141 . doi:10.1109/TKDE.2019.2947676. ISSN   1558-2191. S2CID   53236735.
  16. Hariri, Sahand; Kind, Matias Carrasco; Brunner, Robert J. (2021-04-01). "Extended Isolation Forest". IEEE Transactions on Knowledge and Data Engineering. 33 (4): 1479–1489. arXiv: 1811.02141 . doi:10.1109/TKDE.2019.2947676. ISSN   1041-4347.
  17. Verbus, James (13 August 2019). "Detecting and preventing abuse on LinkedIn using isolation forests". LinkedIn Engineering Blog. Retrieved 2023-07-02.
  18. Cortes, David (2019). "Distance approximation using Isolation Forests". arXiv: 1910.12362 [stat.ML].
  19. GeeksforGeeks, What is Isolation Forest?, accessed November 19, 2024.