M-tree

Last updated

In computer science, M-trees are tree data structures that are similar to R-trees and B-trees. It is constructed using a metric and relies on the triangle inequality for efficient range and k-nearest neighbor (k-NN) queries. While M-trees can perform well in many conditions, the tree can also have large overlap and there is no clear strategy on how to best avoid overlap. In addition, it can only be used for distance functions that satisfy the triangle inequality, while many advanced dissimilarity functions used in information retrieval do not satisfy this. [1]

Contents

Overview

2D M-Tree visualized using ELKI. Every blue sphere (leaf) is contained in a red sphere (directory nodes). Leaves overlap, but not too much; directory nodes overlap much more here. M-tree built with MMRad split.png
2D M-Tree visualized using ELKI. Every blue sphere (leaf) is contained in a red sphere (directory nodes). Leaves overlap, but not too much; directory nodes overlap much more here.

As in any Tree-based data structure, the M-Tree is composed of Nodes and Leaves. In each node there is a data object that identifies it uniquely and a pointer to a sub-tree where its children reside. Every leaf has several data objects. For each node there is a radius that defines a Ball in the desired metric space. Thus, every node and leaf residing in a particular node is at most distance from , and every node and leaf with node parent keep the distance from it.

M-Tree construction

Components

An M-Tree has these components and sub-components:

  1. Non-leaf nodes
    1. A set of routing objects NRO.
    2. Pointer to Node's parent object Op.
  2. Leaf nodes
    1. A set of objects NO.
    2. Pointer to Node's parent object Op.
  3. Routing Object
    1. (Feature value of) routing object Or.
    2. Covering radius r(Or).
    3. Pointer to covering tree T(Or).
    4. Distance of Or from its parent object d(Or,P(Or))
  4. Object
    1. (Feature value of the) object Oj.
    2. Object identifier oid(Oj).
    3. Distance of Oj from its parent object d(Oj,P(Oj))

Insert

The main idea is first to find a leaf node N where the new object O belongs. If N is not full then just attach it to N. If N is full then invoke a method to split N. The algorithm is as follows:

Algorithm Insert   Input: Node N of M-Tree MT, Entry    Output: A new instance of MT containing all entries in original MT plus 
's routing objects or objects   ifN is not a leaf then   {        /* Look for entries that the new object fits into */        let  be routing objects from 's set of routing objects such that if is not empty then        {           /* If there are one or more entry, then look for an entry such that is closer to the new object */                   }        else        {           /* If there are no such entry, then look for an object with minimal distance from */            /* its covering radius's edge to the new object */                      /* Upgrade the new radii of the entry */                   }        /* Continue inserting in the next level */        return insert();else   {        /* If the node has capacity then just insert the new object */        ifN is not full then        { store() }        /* The node is at full capacity, then it is needed to do a new split in this level */        else        { split() }   }
  • "" denotes assignment. For instance, "largestitem" means that the value of largest changes to the value of item.
  • "return" terminates the algorithm and outputs the following value.

Split

If the split method arrives to the root of the tree, then it choose two routing objects from N, and creates two new nodes containing all the objects in original N, and store them into the new root. If split methods arrives to a node N that is not the root of the tree, the method choose two new routing objects from N, re-arrange every routing object in N in two new nodes and , and store these new nodes in the parent node of original N. The split must be repeated if has not enough capacity to store . The algorithm is as follow:

Algorithm Split   Input: Node N of M-Tree MT, Entry    Output: A new instance of MT containing a new partition.
  /* The new routing objects are now all those in the node plus the new routing object */   let be NN entries of ifN is not the root then   {      /*Get the parent node and the parent routing object*/      let  be the parent routing object of Nlet  be the parent node of N   }   /* This node will contain part of the objects of the node to be split */   Create a new node N'   /* Promote two routing objects from the node to be split, to be new routing objects */   Create new objects  and .   Promote()   /* Choose which objects from the node being split will act as new routing objects */   Partition()   /* Store entries in each new routing object */   Store 's entries in N and 's entries in N'ifN is the current root then   {       /* Create a new node and set it as new root and store the new routing objects */       Create a new root node Store  and  in    }   else   {       /* Now use the parent routing object to store one of the new objects */       Replace entry  with entry  in if is no full then       {           /* The second routing object is stored in the parent only if it has free capacity */           Store  in        }       else       {            /*If there is no free capacity then split the level up*/            split()       }   }
  • "" denotes assignment. For instance, "largestitem" means that the value of largest changes to the value of item.
  • "return" terminates the algorithm and outputs the following value.

M-Tree Queries

Range Query

A range query is where a minimum similarity/maximum distance value is specified. For a given query object and a maximum search distance , the range query range(Q, r(Q)) selects all the indexed objects such that . [2]

Algorithm RangeSearch starts from the root node and recursively traverses all the paths which cannot be excluded from leading to qualifying objects.

Algorithm RangeSearch Input: Node N of M-Tree MT,  Q: query object, : search radius
Output: all the DB objects such that 
{    letbe the parent object of node N;ifNis not a leaf then {      for eachentry() inNdo {           ifthen {              Compute ;ifthenRangeSearch(*ptr()),Q,);            }     }   }   else {      for eachentry() inNdo {           ifthen {              Compute ;ifthenadd  to the result;           }     }   } }
  • "" denotes assignment. For instance, "largestitem" means that the value of largest changes to the value of item.
  • "return" terminates the algorithm and outputs the following value.

k-NN queries

K Nearest Neighbor (k-NN) query takes the cardinality of the input set as an input parameter. For a given query object Q ∈ D and an integer k ≥ 1, the k-NN query NN(Q, k) selects the k indexed objects which have the shortest distance from Q, according to the distance function d. [2]

See also

Related Research Articles

In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for nodes with more than two children. Unlike other self-balancing binary search trees, the B-tree is well suited for storage systems that read and write relatively large blocks of data, such as databases and file systems.

In computer science, a red–black tree is a kind of self-balancing binary search tree. Each node stores an extra bit representing "color", used to ensure that the tree remains balanced during insertions and deletions.

Treap

In computer science, the treap and the randomized binary search tree are two closely related forms of binary search tree data structures that maintain a dynamic set of ordered keys and allow binary searches among the keys. After any sequence of insertions and deletions of keys, the shape of the tree is a random variable with the same probability distribution as a random binary tree; in particular, with high probability its height is proportional to the logarithm of the number of keys, so that each search, insertion, or deletion operation takes logarithmic time to perform.

Kademlia is a distributed hash table for decentralized peer-to-peer computer networks designed by Petar Maymounkov and David Mazières in 2002. It specifies the structure of the network and the exchange of information through node lookups. Kademlia nodes communicate among themselves using UDP. A virtual or overlay network is formed by the participant nodes. Each node is identified by a number or node ID. The node ID serves not only as identification, but the Kademlia algorithm uses the node ID to locate values. In fact, the node ID provides a direct map to file hashes and that node stores information on where to obtain the file or resource.

Quadtree Tree data structure in which each internal node has exactly four children, to partition a 2D area

A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are the two-dimensional analog of octrees and are most often used to partition a two-dimensional space by recursively subdividing it into four quadrants or regions. The data associated with a leaf cell varies by application, but the leaf cell represents a "unit of interesting spatial information".

R-tree Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

In computer science, a disjoint-set data structure, also called a union–find data structure or merge–find set, is a data structure that stores a collection of disjoint (non-overlapping) sets. Equivalently, it stores a partition of a set into disjoint subsets. It provides operations for adding new sets, merging sets, and finding a representative member of a set. The last operation makes it possible to find out efficiently if any two elements are in the same or different sets.

A B+ tree is an m-ary tree with a variable but often large number of children per node. A B+ tree consists of a root, internal nodes and leaves. The root may be either a leaf or a node with two or more children.

In data processing R*-trees are a variant of R-trees used for indexing spatial information. R*-trees have slightly higher construction cost than standard R-trees, as the data may need to be reinserted; but the resulting tree will usually have a better query performance. Like the standard R-tree, it can store both point and spatial data. It was proposed by Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger in 1990.

In computer science, an interval tree is a tree data structure to hold intervals. Specifically, it allows one to efficiently find all intervals that overlap with any given interval or point. It is often used for windowing queries, for instance, to find all roads on a computerized map inside a rectangular viewport, or to find all visible elements inside a three-dimensional scene. A similar data structure is the segment tree.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key and creating point clouds. k-d trees are a special case of binary space partitioning trees.

<i>m</i>-ary tree

In graph theory, an m-ary tree is a rooted tree in which each node has no more than m children. A binary tree is the special case where m = 2, and a ternary tree is another case with m = 3 that limits its children to three.

Bidirectional search is a graph search algorithm that finds a shortest path from an initial vertex to a goal vertex in a directed graph. It runs two simultaneous searches: one forward from the initial state, and one backward from the goal, stopping when the two meet. The reason for this approach is that in many cases it is faster: for instance, in a simplified model of search problem complexity in which both searches expand a tree with branching factor b, and the distance from start to goal is d, each of the two searches has complexity O(bd/2), and the sum of these two search times is much less than the O(bd) complexity that would result from a single search from the beginning to the goal.

In graph theory and computer science, the lowest common ancestor (LCA) of two nodes v and w in a tree or directed acyclic graph (DAG) T is the lowest node that has both v and w as descendants, where we define each node to be a descendant of itself.

Hilbert R-tree, an R-tree variant, is an index for multidimensional objects such as lines, regions, 3-D objects, or high-dimensional feature-based parametric objects. It can be thought of as an extension to B+-tree for multidimensional objects.

In computational complexity the decision tree model is the model of computation in which an algorithm is considered to be basically a decision tree, i.e., a sequence of queries or tests that are done adaptively, so the outcome of the previous tests can influence the test is performed next.

In computer science, a ball tree, balltree or metric tree, is a space partitioning data structure for organizing points in a multi-dimensional space. The ball tree gets its name from the fact that it partitions data points into a nested set of intersecting hyperspheres known as "balls". The resulting data structure has characteristics that make it useful for a number of applications, most notably nearest neighbor search.

In data structures, a range query consists of preprocessing some input data into a data structure to efficiently answer any number of queries on any subset of the input. Particularly, there is a group of problems that have been extensively studied where the input is an array of unsorted numbers and a query consists of computing some function, such as the minimum, on a specific range of the array.

In computational geometry, a well-separated pair decomposition (WSPD) of a set of points , is a sequence of pairs of sets , such that each pair is well-separated, and for each two distinct points , there exists precisely one pair which separates the two.

The PH-tree is a tree data structure used for spatial indexing of multi-dimensional data (keys) such as geographical coordinates, points, feature vectors, rectangles or bounding boxes. The PH-tree is space partitioning index with a structure similar to that of a quadtree or octree. However, unlike quadtrees, it uses a splitting policy based on tries and similar to Crit bit trees that is based on the bit-representation of the keys. The bit-based splitting policy, when combined with the use of different internal representations for nodes, provides scalability with high-dimensional data. The bit-representation splitting policy also imposes a maximum depth, thus avoiding degenerated trees and the need for rebalancing.

References

  1. Ciaccia, Paolo; Patella, Marco; Zezula, Pavel (1997). "M-tree An Efficient Access Method for Similarity Search in Metric Spaces" (PDF). Proceedings of the 23rd VLDB Conference Athens, Greece, 1997. IBM Almaden Research Center: Very Large Databases Endowment Inc. pp. 426–435. p426. Retrieved 2010-09-07.
  2. 1 2 P. Ciaccia; M. Patella; F. Rabitti; P. Zezula. "Indexing Metric Spaces with M-tree" (PDF). Department of Computer Science and Engineering. University of Bologna. p. 3. Retrieved 19 November 2013.