Data stream management system

Last updated

A data stream management system (DSMS) is a computer software system to manage continuous data streams. It is similar to a database management system (DBMS), which is, however, designed for static data in conventional databases. A DBMS also offers a flexible query processing so that the information needed can be expressed using queries. However, in contrast to a DBMS, a DSMS executes a continuous query that is not only performed once, but is permanently installed. Therefore, the query is continuously executed until it is explicitly uninstalled. Since most DSMS are data-driven, a continuous query produces new results as long as new data arrive at the system. This basic concept is similar to Complex event processing so that both technologies are partially coalescing.

Contents

Functional principle

One important feature of a DSMS is the possibility to handle potentially infinite and rapidly changing data streams by offering flexible processing at the same time, although there are only limited resources such as main memory. The following table provides various principles of DSMS and compares them to traditional DBMS.

Database management system (DBMS)Data stream management system (DSMS)
Persistent data (relations)Volatile data streams
Random accessSequential access
One-time queriesContinuous queries
(Theoretically) unlimited secondary storageLimited main memory
Only the current state is relevantConsideration of the order of the input
Relatively low update ratePotentially extremely high update rate
Little or no time requirementsReal-time requirements
Assumes exact dataAssumes outdated/inaccurate data
Plannable query processingVariable data arrival and data characteristics

Processing and streaming models

One of the biggest challenges for a DSMS is to handle potentially infinite data streams using a fixed amount of memory and no random access to the data. There are different approaches to limit the amount of data in one pass, which can be divided into two classes. For the one hand, there are compression techniques that try to summarize the data and for the other hand there are window techniques that try to portion the data into (finite) parts.

Synopses

The idea behind compression techniques is to maintain only a synopsis of the data, but not all (raw) data points of the data stream. The algorithms range from selecting random data points called sampling to summarization using histograms, wavelets or sketching. One simple example of a compression is the continuous calculation of an average. Instead of memorizing each data point, the synopsis only holds the sum and the number of items. The average can be calculated by dividing the sum by the number. However, it should be mentioned that synopses cannot reflect the data accurately. Thus, a processing that is based on synopses may produce inaccurate results.

Windows

Instead of using synopses to compress the characteristics of the whole data streams, window techniques only look on a portion of the data. This approach is motivated by the idea that only the most recent data are relevant. Therefore, a window continuously cuts out a part of the data stream, e.g. the last ten data stream elements, and only considers these elements during the processing. There are different kinds of such windows like sliding windows that are similar to FIFO lists or tumbling windows that cut out disjoint parts. Furthermore, the windows can also be differentiated into element-based windows, e.g., to consider the last ten elements, or time-based windows, e.g., to consider the last ten seconds of data. There are also different approaches to implementing windows. There are, for example, approaches that use timestamps or time intervals for system-wide windows or buffer-based windows for each single processing step. Sliding-window query processing is also suitable to being implemented in parallel processors by exploiting parallelism between different windows and/or within each window extent. [1]

Query processing

Since there are a lot of prototypes, there is no standardized architecture. However, most DSMS are based on the query processing in DBMS by using declarative languages to express queries, which are translated into a plan of operators. These plans can be optimized and executed. A query processing often consists of the following steps.

Formulation of continuous queries

The formulation of queries is mostly done using declarative languages like SQL in DBMS. Since there are no standardized query languages to express continuous queries, there are a lot of languages and variations. However, most of them are based on SQL, such as the Continuous Query Language (CQL), StreamSQL and ESP. There are also graphical approaches where each processing step is a box and the processing flow is expressed by arrows between the boxes.

The language strongly depends on the processing model. For example, if windows are used for the processing, the definition of a window has to be expressed. In StreamSQL, a query with a sliding window for the last 10 elements looks like follows:

SELECTAVG(price)FROMexamplestream[SIZE10ADVANCE1TUPLES]WHEREvalue>100.0

This stream continuously calculates the average value of "price" of the last 10 tuples, but only considers those tuples whose prices are greater than 100.0.

In the next step, the declarative query is translated into a logical query plan. A query plan is a directed graph where the nodes are operators and the edges describe the processing flow. Each operator in the query plan encapsulates the semantic of a specific operation, such as filtering or aggregation. In DSMSs that process relational data streams, the operators are equal or similar to the operators of the Relational algebra, so that there are operators for selection, projection, join, and set operations. This operator concept allows the very flexible and versatile processing of a DSMS.

Optimization of queries

The logical query plan can be optimized, which strongly depends on the streaming model. The basic concepts for optimizing continuous queries are equal to those from database systems. If there are relational data streams and the logical query plan is based on relational operators from the Relational algebra, a query optimizer can use the algebraic equivalences to optimize the plan. These may be, for example, to push selection operators down to the sources, because they are not so computationally intensive like join operators.

Furthermore, there are also cost-based optimization techniques like in DBMS, where a query plan with the lowest costs is chosen from different equivalent query plans. One example is to choose the order of two successive join operators. In DBMS this decision is mostly done by certain statistics of the involved databases. But, since the data of a data streams is unknown in advance, there are no such statistics in a DSMS. However, it is possible to observe a data stream for a certain time to obtain some statistics. Using these statistics, the query can also be optimized later. So, in contrast to a DBMS, some DSMS allows to optimize the query even during runtime. Therefore, a DSMS needs some plan migration strategies to replace a running query plan with a new one.

Transformation of queries

Since a logical operator is only responsible for the semantics of an operation but does not consist of any algorithms, the logical query plan must be transformed into an executable counterpart. This is called a physical query plan. The distinction between a logical and a physical operator plan allows more than one implementation for the same logical operator. The join, for example, is logically the same, although it can be implemented by different algorithms like a Nested loop join or a Sort-merge join. Notice, these algorithms also strongly depend on the used stream and processing model. Finally, the query is available as a physical query plan.

Execution of queries

Since the physical query plan consists of executable algorithms, it can be directly executed. For this, the physical query plan is installed into the system. The bottom of the graph (of the query plan) is connected to the incoming sources, which can be everything like connectors to sensors. The top of the graph is connected to the outgoing sinks, which may be for example a visualization. Since most DSMSs are data-driven, a query is executed by pushing the incoming data elements from the source through the query plan to the sink. Each time when a data element passes an operator, the operator performs its specific operation on the data element and forwards the result to all successive operators.

Examples

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

Denormalization is a strategy used on a previously-normalized database to increase performance. In computing, denormalization is the process of trying to improve the read performance of a database, at the expense of losing some write performance, by adding redundant copies of data or by grouping data. It is often motivated by performance or scalability in relational database software needing to carry out very large numbers of read operations. Denormalization differs from the unnormalized form in that denormalization benefits can only be fully realized on a data model that is otherwise normalized.

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

The relational model (RM) is an approach to managing data using a structure and language consistent with first-order predicate logic, first described in 1969 by English computer scientist Edgar F. Codd, where all data is represented in terms of tuples, grouped into relations. A database organized in terms of the relational model is a relational database.

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

In database theory, relational algebra is a theory that uses algebraic structures for modeling data, and defining queries on it with a well founded semantics. The theory was introduced by Edgar F. Codd.

Dataphor is an open-source truly-relational database management system (RDBMS) and its accompanying user interface technologies, which together are designed to provide highly declarative software application development. The Dataphor Server has its own storage engine or it can be a virtual, or federated, DBMS, meaning that it can utilize other database engines for storage.

A query plan is a sequence of steps used to access data in a SQL relational database management system. This is a specific case of the relational model concept of access plans.

<span class="mw-page-title-main">MonetDB</span> Open source column-oriented relational database management system

MonetDB is an open-source column-oriented relational database management system (RDBMS) originally developed at the Centrum Wiskunde & Informatica (CWI) in the Netherlands. It is designed to provide high performance on complex queries against large databases, such as combining tables with hundreds of columns and millions of rows. MonetDB has been applied in high-performance applications for online analytical processing, data mining, geographic information system (GIS), Resource Description Framework (RDF), text retrieval and sequence alignment processing.

Object–relational impedance mismatch creates difficulties going from data in relational data stores to usage in domain-driven object models. Object-orientation (OO) is the default method for business-centric design in programming languages. The problem lies in neither relational nor OO, but in the conceptual difficulty mapping between the two logic models. Both are logical models implementable differently on database servers, programming languages, design patterns, or other technologies. Issues range from application to enterprise scale, whenever stored relational data is used in domain-driven object models, and vice versa. Object-oriented data stores can trade this problem for other implementation difficulties.

Query optimization is a feature of many relational database management systems and other databases such as NoSQL and graph databases. The query optimizer attempts to determine the most efficient way to execute a given query by considering the possible query plans.

StreamSQL is a query language that extends SQL with the ability to process real-time data streams. SQL is primarily intended for manipulating relations, which are finite bags of tuples (rows). StreamSQL adds the ability to manipulate streams, which are infinite sequences of tuples that are not all available at the same time. Because streams are infinite, operations over streams must be monotonic. Queries over streams are generally "continuous", executing for long periods of time and returning incremental results.

Database administration is the function of managing and maintaining database management systems (DBMS) software. Mainstream DBMS software such as Oracle, IBM Db2 and Microsoft SQL Server need ongoing management. As such, corporations that use DBMS software often hire specialized information technology personnel called database administrators or DBAs.

In a SQL database query, a correlated subquery is a subquery that uses values from the outer query. Because the subquery may be evaluated once for each row processed by the outer query, it can be slow.

Strozzi NoSQL is a shell-based relational database management system initialized and developed by Carlo Strozzi that runs under Unix-like operating systems, or others with compatibility layers. Its file name NoSQL merely reflects the fact that it does not express its queries using Structured Query Language; the NoSQL RDBMS is distinct from the circa-2009 general concept of NoSQL databases, which are typically non-relational, unlike the NoSQL RDBMS. Strozzi NoSQL is released under the GNU GPL.

QUEL is a relational database query language, based on tuple relational calculus, with some similarities to SQL. It was created as a part of the Ingres DBMS effort at University of California, Berkeley, based on Codd's earlier suggested but not implemented Data Sub-Language ALPHA. QUEL was used for a short time in most products based on the freely available Ingres source code, most notably in an implementation called POSTQUEL supported by POSTGRES. As Oracle and DB2 gained market share in the early 1980s, most companies then supporting QUEL moved to SQL instead. QUEL continues to be available as a part of the Ingres DBMS, although no QUEL-specific language enhancements have been added for many years.

The following is provided as an overview of and topical guide to databases:

rasdaman Database management system

rasdaman is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently used synonym to arrays is raster data, such as in 2-D raster graphics; this actually has motivated the name rasdaman. However, rasdaman has no limitation in the number of dimensions - it can serve, for example, 1-D measurement data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z exploration data, 4-D ocean and climate data, and even beyond spatio-temporal dimensions.

<span class="mw-page-title-main">Array DBMS</span> System that provides database services specifically for arrays

An array database management system or array DBMS provides database services specifically for arrays, that is: homogeneous collections of data items, sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.

A relational data stream management system (RDSMS) is a distributed, in-memory data stream management system (DSMS) that is designed to use standards-compliant SQL queries to process unstructured and structured data streams in real-time. Unlike SQL queries executed in a traditional RDBMS, which return a result and exit, SQL queries executed in a RDSMS do not exit, generating results continuously as new data become available. Continuous SQL queries in a RDSMS use the SQL Window function to analyze, join and aggregate data streams over fixed or sliding windows. Windows can be specified as time-based or row-based.

References

  1. De Matteis, Tiziano; Mencagli, Gabriele (25 March 2016). "Parallel Patterns for Window-Based Stateful Operators on Data Streams: An Algorithmic Skeleton Approach". International Journal of Parallel Programming. 45 (2): 382–401. doi:10.1007/s10766-016-0413-x. S2CID   255600.
  2. Abadi; et al. Aurora: A Data Stream Management System. SIGMOD 2003. CiteSeerX   10.1.1.67.8671 .
  3. Jianjun Chen; David J. DeWitt; Feng Tian; Yuan Wang (2000). "NiagaraCQ: A Scalable Continuous Query System for Internet Databases" (PDF). Computer Sciences Department. University of Wisconsin–Madison . SIGMOD. Retrieved 21 November 2018.
  4. Arasu, A., et al. STREAM: The Stanford Data Stream Management System. Technical Report. 2004, Stanford InfoLab.
  5. "Chandrasekaran, S. et al, "TelegraphCQ: Continuous Dataflow Processing for an Uncertain World." CIDR 2003" (PDF). Archived from the original (PDF) on 7 February 2014. Retrieved 26 August 2011.