Frequent subtree mining

Last updated

In computer science, frequent subtree mining is the problem of finding all patterns in a given database whose support (a metric related to its number of occurrences in other subtrees) is over a given threshold. [1] It is a more general form of the maximum agreement subtree problem. [2]

Contents

Definition

Frequent subtree mining is the problem of trying to find all of the patterns whose "support" is over a certain user-specified level, where "support" is calculated as the number of trees in a database which have at least one subtree isomorphic to a given pattern. [3]

Formal definition

The problem of frequent subtree mining has been formally defined as: [1]

Given a threshold minfreq, a class of trees , a transitive subtree relation between trees , a finite set of trees , the frequent subtree mining problem is the problem of finding all trees such that no two trees in are isomorphic and
where d is an anti-monotone function such that if then

TreeMiner

In 2002, Mohammed J. Zaki introduced TreeMiner, an efficient algorithm for solving the frequent subtree mining problem, which used a "scope list" to represent tree nodes and which was contrasted with PatternMatcher, an algorithm based on pattern matching. [4]

Definitions

Induced sub-trees

A sub-tree is an induced sub-tree of if and only if and . In other words, any two nodes in S that are directly connected by an edge is also directly connected in T. For any node A and B in S, if node A is the parent of node B in S, then node A must also be the parent of node B in T.

Embedded sub-trees

A sub-tree is an embedded sub-tree of if and only if and two endpoint nodes of any edge in S are on the same path from the root to a leaf node in T. In other words, for any node A and B in S, if node A is the parent of node B in S, then node A must be an ancestor of node B in T. Any induced sub-trees are also embedded sub-trees, and thus the concept of embedded sub-trees is a generalization of induced sub-trees. As such embedded sub-trees characterizes the hidden patterns in a tree that are missing in traditional induced sub-tree mining. A sub-tree of size k is often called a k-sub-tree.

Support

The support of a sub-tree is the number of trees in a database that contains the sub-tree. A sub-tree is frequent if its support is not less than a user-specified threshold (often denoted as minsup). The goal of TreeMiner is to find all embedded sub-trees that have support at least the minimum support.

String representation of trees

There are several different ways of encoding a tree structure. TreeMiner uses string representations of trees for efficient tree manipulation and support counting. Initially the string is set to . Starting from the root of the tree, node labels are added to the string in depth-first search order. -1 is added to the string whenever the search process backtracks from a child to its parent. For example, a simple binary tree with root labelled A, a left child labelled B and right child labelled C can be represented by a string A B -1 C -1.

Prefix equivalence class

Two k-sub-trees are said to be in the same prefix equivalence class if the string representation of them are identical up to the (k-1)-th node. In other words, all elements in a prefix equivalence class only differ by the last node. For example, two trees with string representation A B -1 C -1 and A B -1 D -1 are in the prefix equivalence class A B with elements (C, 0) and (D,0). An element of a prefix class is specified by the node label paired with the 0-based depth first index of the node it is attached to. In this example, both elements of prefix class A B are attached to the root, which has an index of 0.

Scope

The scope of a node A is given by a pair of numbers where l and r are the minimum and maximum node index in the sub-tree rooted at A. In other words, l is the index of A, and r is the index of the rightmost leaf among the descendants of A. As such the index of any descendant of A must lie in the scope of A, which will be a very useful property when counting the support of sub-trees.

Algorithm

Candidate generation

Frequent sub-tree patterns follow the anti-monotone property. In other words, the support of a k-sub-tree is less than or equal to the support of its (k-1)-sub-trees. Only super patterns of known frequent patterns can possibly be frequent. By utilizing this property, k-sub-trees candidates can be generated based on frequent (k-1)-sub-trees through prefix class extension. Let C be a prefix equivalence class with two elements (x,i) and (y,j). Let C' be the class representing the extension of element (x,i). The elements of C' are added by performing join operation on the two (k-1)-sub-trees in C. The join operation on (x,i) and (y,j) is defined as the following.

  • If , then add (y,j) to C'.
  • If , then add (y,j) and (y, ni) to C' where ni the depth-first index of x in C
  • If , no possible element can be added to C'

This operation is repeated for any two ordered, but not necessarily distinct elements in C to construct the extended prefix classes of k-sub-trees.

Scope-list representation

TreeMiner performs depth first candidate generation using scope-list representation of sub-trees to facilitate faster support counting. A k-sub-tree S can be representation by a triplet (t,m,s) where t is the tree id the sub-tree comes from, m is the prefix match label, and s the scope of the last node in S. Depending on how S occurs in different trees across the database, S can have different scope-list representation. TreeMiner defines scope-list join that performs class extension on scope-list representation of sub-trees. Two elements (x,i) and (y,j) can be joined if there exists two sub-trees and that satisfy either of the following conditions.

  • In-scope test: , which corresponds to the case when .
  • Out-scope test: , which correspond to the case when .

By keeping track of distinct tree ids used in the scope-list tests, the support of sub-trees can be calculated efficiently.

Applications

Domains in which frequent subtree mining is useful tend to involve complex relationships between data entities: for instance, the analysis of XML documents often requires frequent subtree mining. [1] Another domain where this is useful is the web usage mining problem: since the actions taken by users when visiting a web site can be recorded and categorized in many different ways, complex databases of trees need to be analyzed with frequent subtree mining. [4] Other domains in which frequent subtree mining is useful include computational biology, [5] [6] RNA structure analysis, [6] pattern recognition, [7] bioinformatics, [8] and analysis of the KEGG GLYCAN database. [9]

Challenges

Checking whether a pattern (or a transaction) supports a given subgraph is an NP-complete problem, since it is an NP-complete instance of the subgraph isomorphism problem. [7] Furthermore, due to combinatorial explosion, according to Lei et al., "mining all frequent subtree patterns becomes infeasible for a large and dense tree database". [10]

Related Research Articles

<span class="mw-page-title-main">AVL tree</span> Self-balancing binary search tree

In computer science, an AVL tree is a self-balancing binary search tree. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take O(log n) time in both the average and worst cases, where is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.

<span class="mw-page-title-main">Binary search tree</span> Rooted binary tree data structure

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure with the key of each internal node being greater than all the keys in the respective node's left subtree and less than the ones in its right subtree. The time complexity of operations on the binary search tree is linear with respect to the height of the tree.

In computer science, a red–black tree is a specialised binary search tree data structure noted for fast storage and retrieval of ordered information, and a guarantee that operations will complete within a known time. Compared to other self-balancing binary search trees, the nodes in a red-black tree hold an extra bit called "color" representing "red" and "black" which is used when re-organising the tree to ensure that it is always approximately balanced.

A splay tree is a binary search tree with the additional property that recently accessed elements are quick to access again. Like self-balancing binary search trees, a splay tree performs basic operations such as insertion, look-up and removal in O(log n) amortized time. For random access patterns drawn from a non-uniform random distribution, their amortized time can be faster than logarithmic, proportional to the entropy of the access pattern. For many patterns of non-random operations, also, splay trees can take better than logarithmic time, without requiring advance knowledge of the pattern. According to the unproven dynamic optimality conjecture, their performance on all access patterns is within a constant factor of the best possible performance that could be achieved by any other self-adjusting binary search tree, even one selected to fit that pattern. The splay tree was invented by Daniel Sleator and Robert Tarjan in 1985.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. In any given transaction with a variety of items, association rules are meant to discover the rules that determine how or why certain items are connected.

In computer science, corecursion is a type of operation that is dual to recursion. Whereas recursion works analytically, starting on data further from a base case and breaking it down into smaller data and repeating until one reaches a base case, corecursion works synthetically, starting from a base case and building it up, iteratively producing data further removed from a base case. Put simply, corecursive algorithms use the data that they themselves produce, bit by bit, as they become available, and needed, to produce further bits of data. A similar but distinct concept is generative recursion which may lack a definite "direction" inherent in corecursion and recursion.

Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees. For regression tasks, the mean or average prediction of the individual trees is returned. Random decision forests correct for decision trees' habit of overfitting to their training set.

In coding theory, the Kraft–McMillan inequality gives a necessary and sufficient condition for the existence of a prefix code or a uniquely decodable code for a given set of codeword lengths. Its applications to prefix codes and trees often find use in computer science and information theory. The prefix code can contain either finitely many or infinitely many codewords.

In set theory, the Baire space is the set of all infinite sequences of natural numbers with a certain topology. This space is commonly used in descriptive set theory, to the extent that its elements are often called "reals". It is denoted NN, ωω, by the symbol or also ωω, not to be confused with the countable ordinal obtained by ordinal exponentiation.

<i>m</i>-ary tree Tree data structure in which each node has at most m children.

In graph theory, an m-ary tree is an arborescence in which each node has no more than m children. A binary tree is the special case where m = 2, and a ternary tree is another case with m = 3 that limits its children to three.

<span class="mw-page-title-main">Backjumping</span> In backtracking algorithms, technique that reduces search space

In backtracking algorithms, backjumping is a technique that reduces search space, therefore increasing efficiency. While backtracking always goes up one level in the search tree when all values for a variable have been tested, backjumping may go up more levels. In this article, a fixed order of evaluation of variables is used, but the same considerations apply to a dynamic order of evaluation.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers x0, x1, x2, ... is a second sequence of numbers y0, y1, y2, ..., the sums of prefixes of the input sequence:

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of distinct numbers. To construct the Cartesian tree, set its root to be the minimum number in the sequence, and recursively construct its left and right subtrees from the subsequences before and after this number. It is uniquely defined as a min-heap whose symmetric (in-order) traversal returns the original sequence.

In computer science, the longest common prefix array is an auxiliary data structure to the suffix array. It stores the lengths of the longest common prefixes (LCPs) between all pairs of consecutive suffixes in a sorted suffix array.

In computer science, one approach to the dynamic optimality problem on online algorithms for binary search trees involves reformulating the problem geometrically, in terms of augmenting a set of points in the plane with as few additional points as possible to avoid rectangles with only two points on their boundary.

In the field of computational biology, a planted motif search (PMS) also known as a (l, d)-motif search (LDMS) is a method for identifying conserved motifs within a set of nucleic acid or peptide sequences.

In machine learning, the kernel embedding of distributions comprises a class of nonparametric methods in which a probability distribution is represented as an element of a reproducing kernel Hilbert space (RKHS). A generalization of the individual data-point feature mapping done in classical kernel methods, the embedding of distributions into infinite-dimensional feature spaces can preserve all of the statistical features of arbitrary distributions, while allowing one to compare and manipulate distributions using Hilbert space operations such as inner products, distances, projections, linear transformations, and spectral analysis. This learning framework is very general and can be applied to distributions over any space on which a sensible kernel function may be defined. For example, various kernels have been proposed for learning from data which are: vectors in , discrete classes/categories, strings, graphs/networks, images, time series, manifolds, dynamical systems, and other structured objects. The theory behind kernel embeddings of distributions has been primarily developed by Alex Smola, Le Song , Arthur Gretton, and Bernhard Schölkopf. A review of recent works on kernel embedding of distributions can be found in.

In mathematics, specifically group theory, a descendant tree is a hierarchical structure that visualizes parent-descendant relations between isomorphism classes of finite groups of prime power order , for a fixed prime number and varying integer exponents . Such groups are briefly called finitep-groups. The vertices of a descendant tree are isomorphism classes of finite p-groups.

-dimensional hypercube is a network topology for parallel computers with processing elements. The topology allows for an efficient implementation of some basic communication primitives such as Broadcast, All-Reduce, and Prefix sum. The processing elements are numbered through . Each processing element is adjacent to processing elements whose numbers differ in one and only one bit. The algorithms described in this page utilize this structure efficiently.

References

  1. 1 2 3 Chi, Yun; Muntz, Richard R.; Nijssen, Siegfried; Kok, Joost N. (28 June 2005). "Frequent Subtree Mining - An Overview". Fundamenta Informaticae. 66: 161–198. S2CID   14827585.
  2. Deepak, Akshay; Fernández-Baca, David; Tirthapura, Srikanta; Sanderson, Michael J.; McMahon, Michelle M. (July 2013). "EvoMiner: frequent subtree mining in phylogenetic databases". Knowledge and Information Systems. 41 (3): 559–590. doi:10.1007/s10115-013-0676-0. S2CID   254145982.
  3. Dai, H., Srikant, R. and Zhang, C. (2004). "Advances in Knowledge Discovery and Data Mining." 8th Pacific-Asia Conference, PAKDD 2004, Sydney, Australia, May 26–28, 2004, Proceedings. 1st ed. p. 65.
  4. 1 2 Zaki, Mohammed J. (2002). "Efficiently mining frequent trees in a forest". Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 71–80. doi:10.1145/775047.775058. ISBN   978-1581135671. S2CID   1649653 . Retrieved 16 June 2014.
  5. Deepak, Akshay, David Fernández-Baca, Srikanta Tirthapura, Michael J. Sanderson, and Michelle M. McMahon. "EvoMiner: frequent subtree mining in phylogenetic databases." Knowledge and Information Systems (2011): 1-32.
  6. 1 2 Chi, Yun, Yirong Yang, and Richard R. Muntz. "Canonical forms for labelled trees and their applications in frequent subtree mining." Knowledge and Information Systems 8, no. 2 (2005): 203–234.
  7. 1 2 Chi, Yun; Yang, Yirong; Muntz, Richard R. (2004). "Mining Frequent Rooted Trees and Free Trees Using Canonical Forms" (PDF). Knowledge and Information Systems. Retrieved 16 June 2014.
  8. Xiao, Yongqiao; Yao, Jenq-Foung; Li, Zhigang; Dunham, Margaret H. (2003). "Efficient data mining for maximal frequent subtrees". Third IEEE International Conference on Data Mining. ICDM 2003. pp. 379–386. doi:10.1109/ICDM.2003.1250943.
  9. Aoki-Kinoshita, Kiyoko F. (2009). Glycome Informatics: Methods and Applications. CRC Press. p. 141. ISBN   9781420083347.
  10. Zou, Lei; Lu, Yansheng; Zhang, Huaming; Hu, Rong (2006). "Mining Frequent Induced Subtree Patterns with Subtree-Constraint". Sixth IEEE International Conference on Data Mining Workshops. ICDM Workshops 2006. pp. 3–7. doi:10.1109/ICDMW.2006.112.