Nested set model

Last updated

The nested set model is a technique for representing nested set collections (also known as trees or hierarchies) in relational databases.

Contents

It is based on Nested Intervals, that "are immune to hierarchy reorganization problem, and allow answering ancestor path hierarchical queries algorithmically without accessing the stored hierarchy relation". [1]

Motivation

The standard relational algebra and relational calculus, and the SQL operations based on them, are unable to express directly all desirable operations on hierarchies. The nested set model is a solution to that problem.

An alternative solution is the expression of the hierarchy as a parent-child relation. Celko called this the adjacency list model. If the hierarchy can have arbitrary depth, the adjacency list model does not allow the expression of operations such as comparing the contents of hierarchies of two elements, or determining whether an element is somewhere in the subhierarchy of another element. When the hierarchy is of fixed or bounded depth, the operations are possible, but expensive, due to the necessity of performing one relational join per level. This is often known as the bill of materials problem.

Hierarchies may be expressed easily by switching to a graph database. Alternatively, several resolutions exist for the relational model and are available as a workaround in some relational database management systems:

When these solutions are not available or not feasible, another approach must be taken.

Technique

The nested set model is to number the nodes according to a tree traversal, which visits each node twice, assigning numbers in the order of visiting, and at both visits. This leaves two numbers for each node, which are stored as two attributes. Querying becomes inexpensive: hierarchy membership can be tested by comparing these numbers. Updating requires renumbering and is therefore expensive. Refinements that use rational numbers instead of integers can avoid renumbering, and so are faster to update, although much more complicated. [2]

Example

In a clothing store catalog, clothing may be categorized according to the hierarchy given on the left:

A hierarchy: types of clothing NestedSetModel.svg
A hierarchy: types of clothing
The numbering assigned by tree traversal Clothing-hierarchy-traversal-2.svg
The numbering assigned by tree traversal
NodeLeftRight
Clothing122
Men's29
Women's1021
Suits38
Slacks45
Jackets67
Dresses1116
Skirts1718
Blouses1920
Evening Gowns1213
Sun Dresses1415
The resulting representation

The "Clothing" category, with the highest position in the hierarchy, encompasses all subordinating categories. It is therefore given left and right domain values of 1 and 22, the latter value being the double of the total number of nodes being represented. The next hierarchical level contains "Men's" and "Women's", both containing levels within themselves that must be accounted for. Each level's data node is assigned left and right domain values according to the number of sublevels contained within, as shown in the table data.

Performance

Queries using nested sets can be expected to be faster than queries using a stored procedure to traverse an adjacency list, and so are the faster option for databases which lack native recursive query constructs, such as MySQL 5.x. [3] However, recursive SQL queries can be expected to perform comparably for 'find immediate descendants' queries, and much faster for other depth search queries, and so are the faster option for databases which provide them, such as PostgreSQL, [4] Oracle, [5] and Microsoft SQL Server. [6] MySQL used to lack recursive query constructs but added such features in version 8. [7]

Drawbacks

The use case for a dynamic endless database tree hierarchy is rare. The Nested Set model is appropriate where the tree element and one or two attributes are the only data, but is a poor choice when more complex relational data exists for the elements in the tree. Given an arbitrary starting depth for a category of 'Vehicles' and a child of 'Cars' with a child of 'Mercedes', a foreign key table relationship must be established unless the tree table is natively non-normalized. Attributes of a newly created tree item may not share all attributes with a parent, child or even a sibling. If a foreign key table is established for a table of 'Plants' attributes, no integrity is given to the child attribute data of 'Trees' and its child 'Oak'. Therefore, in each case of an item inserted into the tree, a foreign key table of the item's attributes must be created for all but the most trivial of use cases.

If the tree isn't expected to change often, a properly normalized hierarchy of attribute tables can be created in the initial design of a system, leading to simpler, more portable SQL statements; specifically ones that don't require an arbitrary number of runtime, programmatically created or deleted tables for changes to the tree. For more complex systems, hierarchy can be developed through relational models rather than an implicit numeric tree structure. Depth of an item is simply another attribute rather than the basis for an entire DB architecture. As stated in SQL Antipatterns: [8]

Nested Sets is a clever solution – maybe too clever. It also fails to support referential integrity. It’s best used when you need to query a tree more frequently than you need to modify the tree. [9]

The model doesn't allow for multiple parent categories. For example, an 'Oak' could be a child of 'Tree-Type', but also 'Wood-Type'. An additional tagging or taxonomy has to be established to accommodate this, again leading to a design more complex than a straightforward fixed model.

Nested sets are very slow for inserts because it requires updating left and right domain values for all records in the table after the insert. This can cause a lot of database stress as many rows are rewritten and indexes rebuilt. However, if it is possible to store a forest of small trees in table instead of a single big tree, the overhead may be significantly reduced, since only one small tree must be updated.

The nested interval model does not suffer from this problem, but is more complex to implement, and is not as well known. It still suffers from the relational foreign-key table problem. The nested interval model stores the position of the nodes as rational numbers expressed as quotients (n/d).

Variations

Using the nested set model as described above has some performance limitations during certain tree traversal operations. For example, trying to find the immediate child nodes given a parent node requires pruning the subtree to a specific level as in the following SQL code example:

SELECTChild.Node,Child.Left,Child.RightFROMTreeasParent,TreeasChildWHEREChild.LeftBETWEENParent.LeftANDParent.RightANDNOTEXISTS(-- No Middle NodeSELECT*FROMTreeasMidWHEREMid.LeftBETWEENParent.LeftANDParent.RightANDChild.LeftBETWEENMid.LeftANDMid.RightANDMid.NodeNOTIN(Parent.Node,Child.Node))ANDParent.Left=1-- Given Parent Node Left Index

Or, equivalently:

SELECTDISTINCTChild.Node,Child.Left,Child.RightFROMTreeasChild,TreeasParentWHEREParent.Left<Child.LeftANDParent.Right>Child.Right-- associate Child Nodes with ancestorsGROUPBYChild.Node,Child.Left,Child.RightHAVINGmax(Parent.Left)=1-- Subset for those with the given Parent Node as the nearest ancestor

The query will be more complicated when searching for children more than one level deep. To overcome this limitation and simplify tree traversal an additional column is added to the model to maintain the depth of a node within a tree.

NodeLeftRightDepth
Clothing1220
Men's291
Women's10211
Suits382
Slacks453
Jackets673
Dresses11162
Skirts17182
Blouses19202
Evening Gowns12133
Sun Dresses14153
The resulting representation

In this model, finding the immediate children given a parent node can be accomplished with the following SQL code:

SELECTChild.Node,Child.Left,Child.RightFROMTreeasChild,TreeasParentWHEREChild.Depth=Parent.Depth+1ANDChild.Left>Parent.LeftANDChild.Right<Parent.RightANDParent.Depth=1-- Given Parent Node Left Index

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

A relational database (RDB) is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A database management system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

<span class="mw-page-title-main">Tree (data structure)</span> Linked node hierarchical data structure

In computer science, a tree is a widely used abstract data type that represents a hierarchical tree structure with a set of connected nodes. Each node in the tree can be connected to many children, but must be connected to exactly one parent, except for the root node, which has no parent. These constraints mean there are no cycles or "loops", and also that each child can be treated like the root node of its own subtree, making recursion a useful technique for tree traversal. In contrast to linear data structures, many trees cannot be represented by relationships between neighboring nodes in a single straight line.

<span class="mw-page-title-main">Object–relational database</span> Database management system

An object–relational database (ORD), or object–relational database management system (ORDBMS), is a database management system (DBMS) similar to a relational database, but with an object-oriented database model: objects, classes and inheritance are directly supported in database schemas and in the query language. In addition, just as with pure relational systems, it supports extension of the data model with custom data types and methods.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if no attribute domain has relations as elements. Or more informally, that no table column can have tables as values. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement. SQL-92 does not support creating or using table-valued columns, which means that using only the "traditional relational database features" most relational databases will be in first normal form by necessity. Database systems which do not require first normal form are often called NoSQL systems. Newer SQL standards like SQL:1999 have started to allow so called non-atomic types, which include composite types. Even newer versions like SQL:2016 allow JSON.

A hierarchical database model is a data model in which the data are organized into a tree-like structure. The data are stored as records which are connected to one another through links. A record is a collection of fields, with each field containing only one value. The type of a record defines which fields the record contains.

In computer science, tree traversal is a form of graph traversal and refers to the process of visiting each node in a tree data structure, exactly once. Such traversals are classified by the order in which the nodes are visited. The following algorithms are described for a binary tree, but they may be generalized to other trees as well.

In a database, a view is the result set of a stored query, which can be queried in the same manner as a persistent database collection object. This pre-established query command is kept in the data dictionary. Unlike ordinary base tables in a relational database, a view does not form part of the physical schema: as a result set, it is a virtual table computed or collated dynamically from data in the database when access to that view is requested. Changes applied to the data in a relevant underlying table are reflected in the data shown in subsequent invocations of the view.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

An entity–attribute–value model (EAV) is a data model optimized for the space-efficient storage of sparse—or ad-hoc—property or data values, intended for situations where runtime usage patterns are arbitrary, subject to user variation, or otherwise unforeseeable using a fixed design. The use-case targets applications which offer a large or rich system of defined property types, which are in turn appropriate to a wide set of entities, but where typically only a small, specific selection of these are instantiated for a given entity. Therefore, this type of data model relates to the mathematical notion of a sparse matrix. EAV is also known as object–attribute–value model, vertical database model, and open schema.

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

<span class="mw-page-title-main">Segment tree</span> Computer science data structure

In computer science, the segment tree is a data structure used for storing information about intervals or segments. It allows querying which of the stored segments contain a given point. A similar data structure is the interval tree.

A document-oriented database, or document store, is a computer program and data storage system designed for storing, retrieving and managing document-oriented information, also known as semi-structured data.

A hierarchical query is a type of SQL query that handles hierarchical model data. They are special cases of more general recursive fixpoint queries, which compute transitive closures.

A graph database (GDB) is a database that uses graph structures for semantic queries with nodes, edges, and properties to represent and store data. A key concept of the system is the graph. The graph relates the data items in the store to a collection of nodes and edges, the edges representing the relationships between the nodes. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Querying relationships is fast because they are perpetually stored in the database. Relationships can be intuitively visualized using graph databases, making them useful for heavily inter-connected data.

<span class="mw-page-title-main">Amazon DynamoDB</span> NoSQL database service

Amazon DynamoDB is a fully managed proprietary NoSQL database offered by Amazon.com as part of the Amazon Web Services portfolio. DynamoDB offers a fast persistent key–value datastore with built-in support for replication, autoscaling, encryption at rest, and on-demand backup among other features.

The following is provided as an overview of and topical guide to databases:

Cypher is a declarative graph query language that allows for expressive and efficient data querying in a property graph.

References

  1. "Nested Intervals Tree Encoding in SQL", Vadim Tropashko; Oracle Corp. Original at https://web.archive.org/web/20111119165033/http://sigmod.org/publications/sigmod-record/0506/p47-article-tropashko.pdf
  2. Hazel, Daniel (2008). "Using rational numbers to key nested sets". arXiv: 0806.3115 [cs.DB].
  3. Quassnoi (29 September 2009), "Adjacency list vs. nested sets: MySQL", Explain Extended, retrieved 11 December 2010
  4. Quassnoi (24 September 2009), "Adjacency list vs. nested sets: PostgreSQL", Explain Extended, retrieved 11 December 2010
  5. Quassnoi (28 September 2009), "Adjacency list vs. nested sets: Oracle", Explain Extended, retrieved 11 December 2010
  6. Quassnoi (25 September 2009), "Adjacency list vs. nested sets: SQL Server", Explain Extended, retrieved 11 December 2010
  7. "MySQL :: MySQL 8.0 Reference Manual :: 13.2.15 WITH (Common Table Expressions)". dev.mysql.com. Retrieved 2021-09-01.
  8. Bill, Karwin (2010-06-17). SQL Antipatterns. p. 328.
  9. Bill, Karwin. SQL Antipatterns. p. 44.