Conc-tree list

Last updated April 09, 2022

A conc-tree^[1]^[2] is a data structure that stores element sequences, and provides amortized O(1) time append and prepend operations, O(log n) time insert and remove operations and O(log n) time concatenation. This data structure is particularly viable for functional task-parallel and data-parallel programming, and is relatively simple to implement compared to other data-structures with similar asymptotic complexity.^[1] Conc-trees were designed to improve efficiency of data-parallel operations that do not require sequential left-to-right iteration order,^[3] and improve constant factors in these operations by avoiding unnecessary copies of the data.^[2] Orthogonally, they are used to efficiently aggregate data in functional-style task-parallel algorithms, as an implementation of the conc-list data abstraction.^[4] Conc-list is a parallel programming counterpart to functional cons-lists, and was originally introduced by the Fortress language.

Operations

The basic conc-tree operation is concatenation. Conc-trees work on the following basic data-types:

traitConc[T]{defleft:Conc[T]defright:Conc[T]deflevel:Intdefsize:Int}caseclassEmpty[T]extendsConc[T]{deflevel=0defsize=0}caseclassSingle[T](elem:T)extendsConc[T]{deflevel=0defsize=1}caseclass<>[T](left:Conc[T],right:Conc[T])extendsConc[T]{vallevel=1+math.max(left.level,right.level)valsize=left.size+right.size}

The <> type represents inner nodes, and is pronounced conc, inspired by :: (the cons type) in functional lists, used for sequential programming.

Concatenation in O(log n) time then works by ensuring that the difference in levels (i.e. heights) between any two sibling trees is one or less, similar to invariants maintained in AVL trees. This invariant ensures that the height of the tree (length of the longest path from the root to some leaf) is always logarithmic in the number of elements in the tree. Concatenation is implemented as follows:

defconcat(xs:Conc[T],ys:Conc[T]){valdiff=ys.level-xs.levelif(math.abs(diff)<=1)new<>(xs,ys)elseif(diff<-1){if(xs.left.level>=xs.right.level){valnr=concat(xs.right,ys)new<>(xs.left,nr)}else{valnrr=concat(xs.right.right,ys)if(nrr.level==xs.level-3){valnr=new<>(xs.right.left,nrr)new<>(xs.left,nr)}else{valnl=new<>(xs.left,xs.right.left)new<>(nl,nrr)}}}else{// symmetric case}}

Amortized O(1) time appends (or prepends) are achieved by introducing a new inner node type called Append, and using it to encode a logarithmic-length list of conc-trees, strictly decreasing in height. Every Append node ap must satisfy the following invariants:

1. Level of ap.left.right is always strictly larger than the level of ap.right.

2. The tree ap.right never contains any Append nodes (i.e. it is in the normalized form, composed only from <>, Single and Empty).

With these invariants, appending is isomorphic to binary number addition—two adjacent trees of the same height can be linked on constant time, with at most a logarithmic number of carry operations. This is illustrated in the following figure, where an element is being appended to a conc-tree that corresponds to a binary number 11:

Conc-tree append operation Screen-conc.png — Conc-tree append operation

This binary number representation is similar to that of purely functional random access lists by Okasaki,^[5] with the difference that random access lists require all the trees to be complete binary trees, whereas conc-trees are more relaxed, and only require balanced trees. These more relaxed invariants allow conc-trees to retain logarithmic time concatenation, while random access lists allow only O(n) concatenation.

The following is an implementation of an append method that is worst-case O(log n) time and amortized O(1) time:

caseclassAppend[T](left:Conc[T],right:Conc[T])extendsConc[T]{vallevel=1+math.max(left.level,right.level)valsize=left.size+right.size}privatedefappend[T](xs:Append[T],ys:Conc[T])=if(xs.right.level>ys.level)newAppend(xs,ys)else{valzs=new<>(xs.right,ys)xs.leftmatch{casews@Append(_,_)=>append(ws,zs)casews=>if(ws.level<=xs.level)concat(ws,zs)elsenewAppend(ws,zs)}}}

Conc-tree constructed this way never has more than O(log n) Append nodes, and can be converted back to the normalized form (one using only <>, Single and Empty nodes) in O(log n) time.

A detailed demonstration of these operations can be found in online resources,^[6]^[7] or in the original conc-tree paper.^[1] It was shown that these basic operations can be extended to support worst-case O(1) deque operations,^[2] while keeping the O(log n) concatenation time bound, at the cost of increasing the constant factors of all operations.

Related Research Articles

AVL tree Self-balancing binary search tree

In computer science, an AVL tree is a self-balancing binary search tree (BST). It was the first such data structure to be invented. In an AVL tree, the heights of the two child subtrees of any node differ by at most one; if at any time they differ by more than one, rebalancing is done to restore this property. Lookup, insertion, and deletion all take $O(log n)$ time in both the average and worst cases, where $is the number of nodes in the tree prior to the operation. Insertions and deletions may require the tree to be rebalanced by one or more tree rotations.$

In computer science, a binary search tree (BST), also called an ordered or sorted binary tree, is a rooted binary tree data structure whose internal nodes each store a key greater than all the keys in the node's left subtree and less than those in its right subtree. The time complexity of operations on the binary search tree is directly proportional to the height of the tree.

Heapsort A sorting algorithm which uses the heap data structure

In computer science, heapsort is a comparison-based sorting algorithm. Heapsort can be thought of as an improved selection sort: like selection sort, heapsort divides its input into a sorted and an unsorted region, and it iteratively shrinks the unsorted region by extracting the largest element from it and inserting it into the sorted region. Unlike selection sort, heapsort does not waste time with a linear-time scan of the unsorted region; rather, heap sort maintains the unsorted region in a heap data structure to more quickly find the largest element in each step.

In computer science, a red–black tree is a kind of self-balancing binary search tree. Each node stores an extra bit representing "color", used to ensure that the tree remains balanced during insertions and deletions.

Standard ML (SML) is a general-purpose modular functional programming language with compile-time type checking and type inference. It is popular among compiler writers and programming language researchers, as well as in the development of theorem provers.

In computer science, a tagged union, also called a variant, variant record, choice type, discriminated union, disjoint union, sum type or coproduct, is a data structure used to hold a value that could take on several different, but fixed, types. Only one of the types can be in use at any one time, and a tag field explicitly indicates which one is in use. It can be thought of as a type that has several "cases", each of which should be handled correctly when that type is manipulated. This is critical in defining recursive datatypes, in which some component of a value may have the same type as the value itself, for example in defining a type for representing trees, where it is necessary to distinguish multi-node subtrees and leaves. Like ordinary unions, tagged unions can save storage by overlapping storage areas for each type, since only one is in use at a time.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjans' 1986 article.

In computer programming, a rope, or cord, is a data structure composed of smaller strings that is used to efficiently store and manipulate a very long string. For example, a text editing program may use a rope to represent the text being edited, so that operations such as insertion, deletion, and random access can be done efficiently.

In computer science, corecursion is a type of operation that is dual to recursion. Whereas recursion works analytically, starting on data further from a base case and breaking it down into smaller data and repeating until one reaches a base case, corecursion works synthetically, starting from a base case and building it up, iteratively producing data further removed from a base case. Put simply, corecursive algorithms use the data that they themselves produce, bit by bit, as they become available, and needed, to produce further bits of data. A similar but distinct concept is generative recursion which may lack a definite "direction" inherent in corecursion and recursion.

<i>k</i>-d tree Multidimensional search tree for points in k dimensional space

In computer science, a k-d tree is a space-partitioning data structure for organizing points in a k-dimensional space. k-d trees are a useful data structure for several applications, such as searches involving a multidimensional search key and creating point clouds. k-d trees are a special case of binary space partitioning trees.

In computer programming, append is the operation for concatenating linked lists or arrays in some high-level programming languages.

In computer science, weight-balanced binary trees (WBTs) are a type of self-balancing binary search trees that can be used to implement dynamic sets, dictionaries (maps) and sequences. These trees were introduced by Nievergelt and Reingold in the 1970s as trees of bounded balance, or BB[α] trees. Their more common name is due to Knuth.

In computer science, the prefix sum, cumulative sum, inclusive scan, or simply scan of a sequence of numbers $x 0, x 1, x 2, ...$ is a second sequence of numbers $y 0, y 1, y 2, ...$ , the sums of prefixes of the input sequence:

In functional programming, fold refers to a family of higher-order functions that analyze a recursive data structure and through use of a given combining operation, recombine the results of recursively processing its constituent parts, building up a return value. Typically, a fold is presented with a combining function, a top node of a data structure, and possibly some default values to be used under certain conditions. The fold then proceeds to combine elements of the data structure's hierarchy, using the function in a systematic way.

In computer science, a segment tree, also known as a statistic tree, is a tree data structure used for storing information about intervals, or segments. It allows querying which of the stored segments contain a given point. It is, in principle, a static structure; that is, it's a structure that cannot be modified once it's built. A similar data structure is the interval tree.

In computer science, a finger tree is a purely functional data structure that can be used to efficiently implement other functional data structures. A finger tree gives amortized constant time access to the "fingers" (leaves) of the tree, which is where data is stored, and concatenation and splitting logarithmic time in the size of the smaller piece. It also stores in each internal node the result of applying some associative operation to its descendants. This "summary" data stored in the internal nodes can be used to provide the functionality of data structures other than trees.

Idris is a purely-functional programming language with dependent types, optional lazy evaluation, and features such as a totality checker. Idris may be used as a proof assistant, but it is designed to be a general-purpose programming language similar to Haskell.

In computer science, a fractal tree index is a tree data structure that keeps data sorted and allows searches and sequential access in the same time as a B-tree but with insertions and deletions that are asymptotically faster than a B-tree. Like a B-tree, a fractal tree index is a generalization of a binary search tree in that a node can have more than two children. Furthermore, unlike a B-tree, a fractal tree index has buffers at each node, which allow insertions, deletions and other changes to be stored in intermediate locations. The goal of the buffers is to schedule disk writes so that each write performs a large amount of useful work, thereby avoiding the worst-case performance of B-trees, in which each disk write may change a small amount of data on disk. Like a B-tree, fractal tree indexes are optimized for systems that read and write large blocks of data. The fractal tree index has been commercialized in databases by Tokutek. Originally, it was implemented as a cache-oblivious lookahead array, but the current implementation is an extension of the B^ε tree. The B^ε is related to the Buffered Repository Tree. The Buffered Repository Tree has degree 2, whereas the B^ε tree has degree B^ε. The fractal tree index has also been used in a prototype filesystem. An open source implementation of the fractal tree index is available, which demonstrates the implementation details outlined below.

In computer science, a weak heap is a data structure for priority queues, combining features of the binary heap and binomial heap. It can be stored in an array as an implicit binary tree like a binary heap, and has the efficiency guarantees of binomial heaps.

In computer science, join-based tree algorithms are a class of algorithms for self-balancing binary search trees. This framework aims at designing highly-parallelized algorithms for various balanced binary search trees. The algorithmic framework is based on a single operation join. Under this framework, the join operation captures all balancing criteria of different balancing schemes, and all other functions join have generic implementation across different balancing schemes. The join-based algorithms can be applied to at least four balancing schemes: AVL trees, red–black trees, weight-balanced trees and treaps.

References

1 2 3 Prokopec, A. et al. (2015) Conc-Trees for Functional and Parallel Programming. Research Paper, 2015
1 2 3 Prokopec A. (2014) Data Structures and Algorithms for Data-Parallel Computing in a Managed Runtime. Doctoral Thesis, 2014
↑ Steele, G. (2009) Organizing Functional Code for Parallel Execution; or, foldl and foldr Considered Slightly Harmful
↑ Steel, G. (2011) How to Think about Parallel Programming: Not!
↑ Okasaki, C. (1995) Purely Functional Random Access Lists
↑ Conc-Tree presentation
↑ Parallel Programming lecture on Conc-Trees at EPFL

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[conc-tree-paper-1] 1 2 3 Prokopec, A. et al. (2015) Conc-Trees for Functional and Parallel Programming. Research Paper, 2015

[thesis-2] 1 2 3 Prokopec A. (2014) Data Structures and Algorithms for Data-Parallel Computing in a Managed Runtime. Doctoral Thesis, 2014

[icfp09-3] Steele, G. (2009) Organizing Functional Code for Parallel Execution; or, foldl and foldr Considered Slightly Harmful

[think-parallel-not-4] Steel, G. (2011) How to Think about Parallel Programming: Not!

[5] Okasaki, C. (1995) Purely Functional Random Access Lists

[6] Conc-Tree presentation

[7] Parallel Programming lecture on Conc-Trees at EPFL

[1]

[2]

[3]

[4]

[5]

[6]

[7]