Relational data stream management system

Last updated

A relational data stream management system (RDSMS) is a distributed, in-memory data stream management system (DSMS) that is designed to use standards-compliant SQL queries to process unstructured and structured data streams in real-time. Unlike SQL queries executed in a traditional RDBMS, which return a result and exit, SQL queries executed in a RDSMS do not exit, generating results continuously as new data become available. Continuous SQL queries in a RDSMS use the SQL Window function to analyze, join and aggregate data streams over fixed or sliding windows. Windows can be specified as time-based or row-based.

Contents

RDSMS SQL Query Examples

Continuous SQL queries in a RDSMS conform to the ANSI SQL standards. The most common RDSMS SQL query is performed with the declarative SELECT statement. A continuous SQL SELECT operates on data across one or more data streams, with optional keywords and clauses that include FROM with an optional JOIN subclause to specify the rules for joining multiple data streams, the WHERE clause and comparison predicate to restrict the records returned by the query, GROUP BY to project streams with common values into a smaller set, HAVING to filter records resulting from a GROUP BY, and ORDER BY to sort the results.

The following is an example of a continuous data stream aggregation using a SELECT query that aggregates a sensor stream from a weather monitoring station. The SELECTquery aggregates the minimum, maximum and average temperature values over a one-second time period, returning a continuous stream of aggregated results at one second intervals.

SELECTSTREAMFLOOR(WEATHERSTREAM.ROWTIMEtoSECOND)ASFLOOR_SECOND,MIN(TEMP)ASMIN_TEMP,MAX(TEMP)ASMAX_TEMP,AVG(TEMP)ASAVG_TEMPFROMWEATHERSTREAMGROUPBYFLOOR(WEATHERSTREAM.ROWTIMETOSECOND);

RDSMS SQL queries also operate on data streams over time or row-based windows. The following example shows a second continuous SQL query using the WINDOW clause with a one-second duration. The WINDOW clause changes the behavior of the query, to output a result for each new record as it arrives. Hence the output is a stream of incrementally updated results with zero result latency.

SELECTSTREAMROWTIME,MIN(TEMP)OVERW1ASWMIN_TEMP,MAX(TEMP)OVERW1ASWMAX_TEMP,AVG(TEMP)OVERW1ASWAVG_TEMPFROMWEATHERSTREAMWINDOWW1AS(RANGEINTERVAL'1'SECONDPRECEDING);

See also

Related Research Articles

A relational database is a digital database based on the relational model of data, as proposed by E. F. Codd in 1970. A software system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems have an option of using the SQL for querying and maintaining the database.

SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data, i.e. data incorporating relations among entities and variables.

A join clause in SQL – corresponding to a join operation in relational algebra – combines columns from one or more tables into a new table. ANSI-standard SQL specifies five types of JOIN: INNER, LEFT OUTER, RIGHT OUTER, FULL OUTER and CROSS.

The SQL SELECT statement returns a result set of records, from one or more tables.

In a database, a view is the result set of a stored query on the data, which the database users can query just as they would in a persistent database collection object. This pre-established query command is kept in the database dictionary. Unlike ordinary base tables in a relational database, a view does not form part of the physical schema: as a result set, it is a virtual table computed or collated dynamically from data in the database when access to that view is requested. Changes applied to the data in a relevant underlying table are reflected in the data shown in subsequent invocations of the view. In some NoSQL databases, views are the only way to query data.

Null (SQL) Marker used in SQL databases to indicate a value does not exist

Null or NULL is a special marker used in Structured Query Language to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL Null serves to fulfil the requirement that all true relational database management systems (RDMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent Null in database theory. In SQL, NULL is a reserved word used to identify this marker.

StreamSQL is a query language that extends SQL with the ability to process real-time data streams. SQL is primarily intended for manipulating relations, which are finite bags of tuples (rows). StreamSQL adds the ability to manipulate streams, which are infinite sequences of tuples that are not all available at the same time. Because streams are infinite, operations over streams must be monotonic. Queries over streams are generally "continuous", executing for long periods of time and returning incremental results.

Gadfly is a relational database management system written in Python. Gadfly is a collection of Python modules that provides relational database functionality entirely implemented in Python. It supports a subset of the standard RDBMS Structured Query Language (SQL).

In database management, an aggregate function or aggregation function is a function where the values of multiple rows are grouped together to form a single summary value.

A GROUP BY statement in SQL specifies that a SQL SELECT statement partitions result rows into groups, based on their values in one or several columns. Typically, grouping is used to apply some sort of aggregate function for each group.

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

In SQL, a window function or analytic function is a function which uses values from one or multiple rows to return a value for each row. Window functions have an OVER clause; any function without an OVER clause is not a window function, but rather an aggregate or single-row (scalar) function.

A hierarchical query is a type of SQL query that handles hierarchical model data. They are special cases of more general recursive fixpoint queries, which compute transitive closures.

A data stream management system (DSMS) is a computer software system to manage continuous data streams. It is similar to a database management system (DBMS), which is, however, designed for static data in conventional databases. A DSMS also offers a flexible query processing so that the information needed can be expressed using queries. However, in contrast to a DBMS, a DSMS executes a continuous query that is not only performed once, but is permanently installed. Therefore, the query is continuously executed until it is explicitly uninstalled. Since most DSMS are data-driven, a continuous query produces new results as long as new data arrive at the system. This basic concept is similar to Complex event processing so that both technologies are partially coalescing.

Apache Hive

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

SQLstream is a distributed, SQL standards-compliant plus Java stream processing platform. SQLstream, Inc. is based in San Francisco, California and was launched in 2009 by Damian Black, Edan Kabatchnik and Julian Hyde, author of the open source Mondrian Relational OLAP Server Engine.

QUEL is a relational database query language, based on tuple relational calculus, with some similarities to SQL. It was created as a part of the Ingres DBMS effort at University of California, Berkeley, based on Codd's earlier suggested but not implemented Data Sub-Language ALPHA. QUEL was used for a short time in most products based on the freely available Ingres source code, most notably in an implementation called POSTQUEL supported by POSTGRES. As Oracle and DB2 gained market share in the early 1980s, most companies then supporting QUEL moved to SQL instead. QUEL continues to be available as a part of the Ingres DBMS, although no QUEL-specific language enhancements have been added for many years.

Cypher is a declarative graph query language that allows for expressive and efficient data querying in a property graph.

SQLf is a SQL extended with fuzzy set theory application for expressing flexible (fuzzy) queries to traditional Relational Databases. Among the known extensions proposed to SQL, at the present time, this is the most complete, because it allows the use of diverse fuzzy elements in all the constructions of the language SQL.

The syntax of the SQL programming language is defined and maintained by ISO/IEC SC 32 as part of ISO/IEC 9075. This standard is not freely available. Despite the existence of the standard, SQL code is not completely portable among different database systems without adjustments.