Isolation forest is an unsupervised learning algorithm for anomaly detection that works on the principle of isolating anomalies, [1] instead of the most common techniques of profiling normal points. [2]
In statistics, an anomaly (a.k.a. outlier) is an observation or event that deviates so much from other events to arouse suspicion it was generated by a different mean. For example, the graph in Fig.1 represents ingress traffic to a web server, expressed as the number of requests in 3-hours intervals, for a period of one month. It is quite evident by simply looking at the picture that some points (marked with a red circle) are unusually high, to the point of inducing suspect that the web server might have been under attack at that time. On the other hand, the flat segment indicated by the red arrow also seems unusual and might possibly be a sign that the server was down during that time period.
Anomalies in a big dataset may follow very complicated patterns, which are difficult to detect “by eye” in the great majority of cases. This is the reason why the field of anomaly detection is well suited for the application of Machine Learning techniques.
The most common techniques employed for anomaly detection are based on the construction of a profile of what is “normal”: anomalies are reported as those instances in the dataset that do not conform to the normal profile [2] . Isolation Forest uses a different approach: instead of trying to build a model of normal instances, it explicitly isolates anomalous points in the dataset. The main advantage of this approach is the possibility of exploiting sampling techniques to an extent that is not allowed to the profile-based methods, creating a very fast algorithm with a low memory demand. [1] [3] [4]
The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008. [1] The authors took advantage of two quantitative properties of anomalous data points in a sample, that is:
Since anomalies are typically few and very different from the other points in the sample, they must be easier to “isolate” compared to normal points. On the basis of this principle, Isolation Forest builds an ensemble of “Isolation Trees” (iTrees) for the data set and marks as anomalies the points that have short average path lengths on the iTrees.
In a later paper, published in 2012 [2] the same authors described a set of experiments to prove that iForest:
In 2013 Zhiguo Ding and Minrui Fei proposed a framework based on iForest to resolve the problem of detecting anomalies in streaming data. [5] More application of iForest to streaming data are described in papers by Swee Chuan Tan et al., [4] G. A. Susto et al. [6] and Yu Weng et al. [7]
One of the main problems of the application of iForest to anomaly detection was not with the model itself, but rather in the way the “anomaly score” was computed. This problem was highlighted by Sahand Hariri, Matias Carrasco Kind and Robert J. Brunner in a 2018 paper, [8] wherein they proposed an improved iForest model named Extended Isolation Forest (EIF). In the same paper the authors describe the improvements made to the original model and how they are able to enhance the consistency and reliability of the anomaly score produced for a given data point.
At the basis of the Isolation Forest algorithm there is the tendency of anomalous instances in a dataset to be easier to separate from the rest of the sample (isolate), compared to normal points. In order to isolate a data point the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value for the attribute, between the minimum and maximum values allowed for that attribute.
An example of random partitioning in a 2D dataset of normally distributed points is given in Fig. 2 for a non-anomalous point and Fig. 3 for a point that's more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.
From a mathematical point of view, recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point xi in Fig. 2 is greater than the path length of xj in Fig. 3.
More formally, let X = { x1, ..., xn } be a set of d-dimensional points and X’ ⊂ X a subset of X. An Isolation Tree (iTree) is defined as a data structure with the following properties:
In order to build an iTree, the algorithm recursively divides X’ by randomly selecting an attribute q and a split value p, until either (i) the node has only one instance or (ii) all data at the node have the same values.
When the iTree is fully grown, each point in X is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length h(xi) of point is defined as the number of edges xi traverses from the root node to get to an external node.
A probabilistic explanation of iTree is provided in the iForest original paper. [1]
Anomaly detection with Isolation Forest is a process composed of two main stages [3] :
Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as “anomaly” any point whose score is greater than a predefined threshold, which depends on the domain the analysis is being applied to.
The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST [3] . As a consequence, the estimation of average h(x) for external node terminations is the same as that of the unsuccessful searches in BST, that is [10]
where n is the testing data size, m is the size of the sample set and H is the harmonic number, which can be estimated by , where is the Euler-Mascheroni constant.
The value of c(m) above represents the average of h(x) given m, so we can use it to normalise h(x) and get an estimation of the anomaly score for a given instance x:
where E(h(x)) is the average value of h(x) from a collection of iTrees. It is interesting to note that for any given instance x:
As described in the previous sections, the Isolation Forest algorithm performs very well from both the computational and the memory consumption points of view. The main problem with the original algorithm is that the way the branching of trees takes place introduce a bias, which is likely to reduce the reliability of the anomaly scores for ranking the data. This is the main motivation behind the introduction of the Extended Isolation Forest (EIF) algorithm by Hariri et al. [8]
In order to understand why the original Isolation Forest suffers from that bias, the authors provide a practical example based on a random dataset taken from a 2-D normal distribution with zero mean and covariance given by the identity matrix. An example of such a dataset is shown in Fig. 4.
It is easy to understand by looking at the picture that points falling close to (0, 0) are likely to be normal points, while a point that lies far away from (0, 0) is likely to be anomalous. As a consequence, the anomaly score of a point should increase with an almost circular and symmetric pattern as the point move radially outward the “centre” of the distribution. This is not the case in practice as the authors demonstrate by generating the anomaly score map produced for the distribution by the Isolation Forest algorithm. Although the anomaly scores correctly increase as the points move radially outward, they also generate rectangular regions of lower anomaly score in the x and y directions, compared to other points that fall roughly at the same radial distance from the centre.
It is possible to demonstrate that these unexpected rectangular regions in the anomaly score map are indeed an artifact introduced by the algorithm and are mainly due to the fact that the decision boundaries of Isolation Forest are limited to be either vertical or horizontal (see Fig. 2 and Fig. 3). [8]
This is the reason why in their paper, Hariri et al. propose to improve the original Isolation Forest in the following way: rather than selecting a random feature and value within the range of data, they select a branch cut that has a random “slope”. An example of random partitioning with EIF is shown in Fig. 5.
The authors show how the new approach is able to overcome the limits of the original Isolation Forest, eventually leading to an improved anomaly score map.