PAM library

Last updated

PAM (Parallel Augmented Maps) is an open-source parallel C++ library implementing the interface for sequence, ordered sets, ordered maps, and augmented maps. [1] The library is available on GitHub. It uses the underlying balanced binary tree structure using join-based algorithms. [1] PAM supports four balancing schemes, including AVL trees, red-black trees, treaps and weight-balanced trees.

Contents

PAM is a parallel library and is also safe for concurrency. Its parallelism can be supported by cilk, OpenMP or the scheduler in PBBS. [2] Theoretically, all algorithms in PAM are work-efficient and have polylogarithmic depth. PAM uses underlying persistent tree structure such that multi-versioning is allowed. PAM also supports efficient GC.

Interface

Sequences

To define a sequence, users need to specify the key type of the sequence.

PAM supports functions on sequences including construction, find an entry with a certain rank, first, last, next, previous, size, empty, filter, map-reduce, concatenating, etc.

Ordered sets

To define an ordered set, users need to specify the key type and the comparison function defining a total ordering on the key type.

On top of the sequence interface, PAM also supports functions for ordered sets including insertion, deletion, union, intersection, difference, etc.

Ordered maps

To define an ordered map, users need to specify the key type, the comparison function on the key type, and the value type.

On top of the ordered set interface, PAM also supports functions for ordered maps, such as insertion with combining values.

Augmented maps

To define an augmented map, users need to specify the key type, the comparison function on the key type, the value type, the augmented value type, the base function, the combine function and the identity of the combine function.

On top of the ordered map interface, PAM also supports functions for augmented maps, such as aug_range.

In addition to the tree structures, PAM also implements the prefix structure for augmented maps.

Implementation for Example Applications

The library also provides example implementations for a number of applications, including 1D stabbing query (using interval trees, 2D range query (using a range tree and a sweepline algorithm), 2D segment query (using a segment tree and a sweepline algorithm), 2D rectangle query (using a tree structure and a sweepline algorithm), inverted index searching, etc.

Used in applications

The library has been tested in various applications, including database benchmarks, [3] 2D segment tree, [4] 2D interval tree, [1] inverted index [1] and multiversion concurrency control. [5]

Related Research Articles

In computer science, an abstract data type (ADT) is a mathematical model for data types. An abstract data type is defined by its behavior (semantics) from the point of view of a user, of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This mathematical model contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user.

<span class="mw-page-title-main">Data structure</span> Particular way of storing and organizing data in a computer

In computer science, a data structure is a data organization, management, and storage format that is usually chosen for efficient access to data. More precisely, a data structure is a collection of data values, the relationships among them, and the functions or operations that can be applied to the data, i.e., it is an algebraic structure about data.

In computer science, a priority queue is an abstract data-type similar to a regular queue or stack data structure. Each element in a priority queue has an associated priority. In a priority queue, elements with high priority are served before elements with low priority. In some implementations, if two elements have the same priority, they are served in the same order in which they were enqueued. In other implementations, the order of elements with the same priority is undefined.

In computer programming, an iterator is an object that enables a programmer to traverse a container, particularly lists. Various types of iterators are often provided via a container's interface. Though the interface and semantics of a given iterator are fixed, iterators are often implemented in terms of the structures underlying a container implementation and are often tightly coupled to the container to enable the operational semantics of the iterator. An iterator performs traversal and also gives access to data elements in a container, but does not itself perform iteration.

In computer science, a set is an abstract data type that can store unique values, without any particular order. It is a computer implementation of the mathematical concept of a finite set. Unlike most other collection types, rather than retrieving a specific element from a set, one typically tests a value for membership in a set.

In computer science, declarative programming is a programming paradigm—a style of building the structure and elements of computer programs—that expresses the logic of a computation without describing its control flow.

In computing, a persistent data structure or not ephemeral data structure is a data structure that always preserves the previous version of itself when it is modified. Such data structures are effectively immutable, as their operations do not (visibly) update the structure in-place, but instead always yield a new updated structure. The term was introduced in Driscoll, Sarnak, Sleator, and Tarjans' 1986 article.

Skeleton programming is a style of computer programming based on simple high-level program structures and so called dummy code. Program skeletons resemble pseudocode, but allow parsing, compilation and testing of the code. Dummy code is inserted in a program skeleton to simulate processing and avoid compilation error messages. It may involve empty function declarations, or functions that return a correct result only for a simple test case where the expected response of the code is known.

In computer science, weight-balanced binary trees (WBTs) are a type of self-balancing binary search trees that can be used to implement dynamic sets, dictionaries (maps) and sequences. These trees were introduced by Nievergelt and Reingold in the 1970s as trees of bounded balance, or BB[α] trees. Their more common name is due to Knuth.

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

Microsoft SQL Server is a proprietary relational database management system developed by Microsoft. As a database server, it is a software product with the primary function of storing and retrieving data as requested by other software applications—which may run either on the same computer or on another computer across a network. Microsoft markets at least a dozen different editions of Microsoft SQL Server, aimed at different audiences and for workloads ranging from small single-machine applications to large Internet-facing applications with many concurrent users.

<span class="mw-page-title-main">Cartesian tree</span> Binary tree derived from a sequence of numbers

In computer science, a Cartesian tree is a binary tree derived from a sequence of numbers. The smallest number in the sequence is at the root of the tree; its left and right subtrees are constructed recursively from the subsequences to the left and right of this number. When all numbers are distinct, the Cartesian tree is uniquely defined from the properties that it is heap-ordered and that a symmetric (in-order) traversal of the tree returns the original sequence.

In computing, algorithmic skeletons, or parallelism patterns, are a high-level parallel programming model for parallel and distributed computing.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications which devote most of their execution time to computational requirements are deemed compute-intensive, whereas computing applications which require large volumes of data and devote most of their processing time to I/O and manipulation of data are deemed data-intensive.

eXtremeDB is a high-performance, low-latency, ACID-compliant embedded database management system using an in-memory database system (IMDS) architecture and designed to be linked into C/C++ based programs. It works on Windows, Linux, and other real-time and embedded operating systems.

SequenceL is a general purpose functional programming language and auto-parallelizing compiler and tool set, whose primary design objectives are performance on multi-core processor hardware, ease of programming, platform portability/optimization, and code clarity and readability. Its main advantage is that it can be used to write straightforward code that automatically takes full advantage of all the processing power available, without programmers needing to be concerned with identifying parallelisms, specifying vectorization, avoiding race conditions, and other challenges of manual directive-based programming approaches such as OpenMP.

In computer science, join-based tree algorithms are a class of algorithms for self-balancing binary search trees. This framework aims at designing highly-parallelized algorithms for various balanced binary search trees. The algorithmic framework is based on a single operation join. Under this framework, the join operation captures all balancing criteria of different balancing schemes, and all other functions join have generic implementation across different balancing schemes. The join-based algorithms can be applied to at least four balancing schemes: AVL trees, red–black trees, weight-balanced trees and treaps.

In computer science, the augmented map is an abstract data type (ADT) based on ordered maps, which associates each ordered map an augmented value. For an ordered map with key type , comparison function on and value type , the augmented value is defined based on two functions: a base function and a combine function , where is the type of the augmented value. The base function converts a single entry in to an augmented value, and the combine function combines multiple augmented values. The combine function is required to be associative and have an identity . We extend the definition of the associative function as follows:

<span class="mw-page-title-main">Concurrent hash table</span>

A concurrent hash table or concurrent hash map is an implementation of hash tables allowing concurrent access by multiple threads using a hash function.

The flattening transformation is an algorithm that transforms nested data parallelism into flat data parallelism. It was pioneered by Guy Blelloch as part of the NESL programming language. The flattening transformation is also sometimes called vectorization, but is completely unrelated to automatic vectorization. The original flattening algorithm was concerned solely with first-order multidimensional arrays containing primitive types, but was extended to handle higher-order and recursive data types in the work on Data Parallel Haskell.

References

  1. 1 2 3 4 Sun, Yihan; Ferizovic, Daniel; Belloch, Guy E. (23 March 2018). "PAM: parallel augmented maps". ACM SIGPLAN Notices. 53 (1): 290–304. doi:10.1145/3200691.3178509. ISSN   0362-1340 . Retrieved 5 September 2020.
  2. Problem Based Benchmark Suite Library
  3. Sun, Yihan; Blelloch, Guy E.; Lim, Wan Shen; Pavlo, Andrew (1 October 2019). "On supporting efficient snapshot isolation for hybrid workloads with multi-versioned indexes". Proceedings of the VLDB Endowment. 13 (2): 211–225. doi:10.14778/3364324.3364334. ISSN   2150-8097. S2CID   204841857.
  4. Sun, Yihan; Blelloch, Guy E. (1 January 2019). "Parallel Range, Segment and Rectangle Queries with Augmented Maps". 2019 Proceedings of the Meeting on Algorithm Engineering and Experiments (ALENEX). Society for Industrial and Applied Mathematics: 159–173. doi: 10.1137/1.9781611975499.13 . ISBN   978-1-61197-549-9.
  5. Ben-David, Naama; Blelloch, Guy E.; Sun, Yihan; Wei, Yuanhao (17 June 2019). "Multiversion Concurrency with Bounded Delay and Precise Garbage Collection". The 31st ACM Symposium on Parallelism in Algorithms and Architectures. Association for Computing Machinery. pp. 241–252. doi: 10.1145/3323165.3323185 . ISBN   9781450361842.