Array DBMS

Last updated

An array database management system or array DBMS provides database services specifically for arrays (also called raster data), that is: homogeneous collections of data items (often called pixels, voxels, etc.), sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.

Contents

Euclidean neighborhood of elements in arrays Euclidean neighborhood in n-D arrays.png
Euclidean neighborhood of elements in arrays

Overview

In the same style as standard database systems do on sets, Array DBMSs offer scalable, flexible storage and flexible retrieval/manipulation on arrays of (conceptually) unlimited size. As in practice arrays never appear standalone, such an array model normally is embedded into some overall data model, such as the relational model. Some systems implement arrays as an analogy to tables, some introduce arrays as an additional attribute type.

Management of arrays requires novel techniques, particularly due to the fact that traditional database tuples and objects tend to fit well into a single database page  a unit of disk access on server, typically 4  KB   while array objects easily can span several media. The prime task of the array storage manager is to give fast access to large arrays and sub-arrays. To this end, arrays get partitioned, during insertion, into so-called tiles or chunks of convenient size which then act as units of access during query evaluation.

Array DBMSs offer query languages giving declarative access to such arrays, allowing to create, manipulate, search, and delete them. Like with, e.g., SQL, expressions of arbitrary complexity can be built on top of a set of core array operations. Due to the extensions made in the data and query model, Array DBMSs sometimes are subsumed under the NoSQL category, in the sense of "not only SQL". Query optimization and parallelization are important for achieving scalability; actually, many array operators lend themselves well towards parallel evaluation, by processing each tile on separate nodes or cores.

Important application domains of Array DBMSs include Earth, Space, Life, and Social sciences, as well as the related commercial applications (such as hydrocarbon exploration in industry and OLAP in business). The variety occurring can be observed, e.g., in geo data where 1-D environmental sensor time series, 2-D satellite images, 3-D x/y/t image time series and x/y/z geophysics data, as well as 4-D x/y/z/t climate and ocean data can be found.

History and status

The relational data model, which is prevailing today, does not directly support the array paradigm to the same extent as sets and tuples. ISO SQL lists an array-valued attribute type, but this is only one-dimensional, with almost no operational support, and not usable for the application domains of Array DBMSs. Another option is to resort to BLOBs ("binary large objects") which are the equivalent to files: byte strings of (conceptually) unlimited length, but again without any query language functionality, such as multi-dimensional subsetting.

First significant work in going beyond BLOBs has been established with PICDMS. [1] This system offers the precursor of a 2-D array query language, albeit still procedural and without suitable storage support.

A first declarative query language suitable for multiple dimensions and with an algebra-based semantics has been published by Baumann, together with a scalable architecture. [2] [3] Another array database language, constrained to 2-D, has been presented by Marathe and Salem. [4] Seminal theoretical work has been accomplished by Libkin et al.; [5] in their model, called NCRA, they extend a nested relational calculus with multidimensional arrays; among the results are important contributions on array query complexity analysis. A map algebra, suitable for 2-D and 3-D spatial raster data, has been published by Mennis et al. [6]

In terms of Array DBMS implementations, the rasdaman system has the longest implementation track record of n-D arrays with full query support. Oracle GeoRaster offers chunked storage of 2-D raster maps, albeit without SQL integration. TerraLib is an open-source GIS software that extends object-relational DBMS technology to handle spatio-temporal data types; while main focus is on vector data, there is also some support for rasters. Starting with version 2.0, PostGIS embeds raster support for 2-D rasters; a special function offers declarative raster query functionality. SciQL is an array query language being added to the MonetDB DBMS. SciDB is a more recent initiative to establish array database support. Like SciQL, arrays are seen as an equivalent to tables, rather than a new attribute type as in rasdaman and PostGIS.

For the special case of sparse data, OLAP data cubes are well established; they store cell values together with their location  an adequate compression technique in face of the few locations carrying valid information at all  and operate with SQL on them. As this technique does not scale in density, standard databases are not used today for dense data, like satellite images, where most cells carry meaningful information; rather, proprietary ad hoc implementations prevail in scientific data management and similar situations. Hence, this is where Array DBMSs can make a particular contribution.

Generally, Array DBMSs are an emerging technology. While operationally deployed systems exist, like Oracle GeoRaster, PostGIS 2.0 and rasdaman, there are still many open research questions, including query language design and formalization, query optimization, parallelization and distributed processing, and scalability issues in general. Besides, scientific communities still appear reluctant in taking up array database technology and tend to favor specialized, proprietary technology.

Concepts

When adding arrays to databases, all facets of database design need to be reconsidered  ranging from conceptual modeling (such as suitable operators) over storage management (such as management of arrays spanning multiple media) to query processing (such as efficient processing strategies).

Conceptual modeling

Formally, an array A is given by a (total or partial) function A: XV where X, the domain is a d-dimensional integer interval for some d> 0 and V, called range, is some (non-empty) value set; in set notation, this can be rewritten as { (p,v) | pX, vV}. Each (p,v) in A denotes an array element or cell, and following common notation we write A[p] = v. Examples for X include {0..767} × {0..1023} (for XGA sized images), examples for V include {0..255} for 8-bit greyscale images and {0..255} × {0..255} × {0..255} for standard RGB imagery.

Following established database practice, an array query language should be declarative and safe in evaluation. As iteration over an array is at the heart of array processing, declarativeness very much centers on this aspect. The requirement, then, is that conceptually all cells should be inspected simultaneously  in other words, the query does not enforce any explicit iteration sequence over the array cells during evaluation. Evaluation safety is achieved when every query terminates after a finite number of (finite-time) steps; again, avoiding general loops and recursion is a way of achieving this. At the same time, avoiding explicit loop sequences opens up manifold optimization opportunities.

Array querying

As an example for array query operators the rasdaman algebra and query language can serve, which establish an expression language over a minimal set of array primitives. We begin with the generic core operators and then present common special cases and shorthands.

The marray operator creates an array over some given domain extent and initializes its cells:

marray index-range-specification values cell-value-expression 

where index-range-specification defines the result domain and binds an iteration variable to it, without specifying iteration sequence. The cell-value-expression is evaluated at each location of the domain.

Example: "A cutout of array A given by the corner points (10,20) and (40,50)."

marray p in [10:20,40:50] values A[p] 

This special case, pure subsetting, can be abbreviated as

A[10:20,40:50] 

This subsetting keeps the dimension of the array; to reduce dimension by extracting slices, a single slicepoint value is indicated in the slicing dimension.

Example: "A slice through an x/y/t timeseries at position t=100, retrieving all available data in x and y."

A[*:*,*:*,100] 

The wildcard operator * indicates that the current boundary of the array is to be used; note that arrays where dimension boundaries are left open at definition time may change size in that dimensions over the array's lifetime.

The above examples have simply copied the original values; instead, these values may be manipulated.

Example: "Array A, with a log() applied to each cell value."

marray p in domain(A) values log( A[p] ) 

This can be abbreviated as:

log( A ) 

Through a principle called induced operations, [7] the query language offers all operations the cell type offers on array level, too. Hence, on numeric values all the usual unary and binary arithmetic, exponential, and trigonometric operations are available in a straightforward manner, plus the standard set of Boolean operators.

The condense operator aggregates cell values into one scalar result, similar to SQL aggregates. Its application has the general form:

condense condense-op over index-range-specification using cell-value-expression 

As with marray before, the index-range-specification specifies the domain to be iterated over and binds an iteration variable to it  again, without specifying iteration sequence. Likewise, cell-value-expression is evaluated at each domain location. The condense-op clause specifies the aggregating operation used to combine the cell value expressions into one single value.

Example: "The sum over all values in A."

condense + over p in sdom(A) using A[p] 

A shorthand for this operation is:

add_cells( A ) 

In the same manner and in analogy to SQL aggregates, a number of further shorthands are provided, including counting, average, minimum, maximum, and Boolean quantifiers.

The next example demonstrates combination of marray and condense operators by deriving a histogram.

Example: "A histogram over 8-bit greyscale image A."

marray bucket in [0:255] values count_cells( A = bucket ) 

The induced comparison, A=bucket, establishes a Boolean array of the same extent as A. The aggregation operator counts the occurrences of true for each value of bucket, which subsequently is put into the proper array cell of the 1-D histogram array.

Such languages allow formulating statistical and imaging operations which can be expressed analytically without using loops. It has been proven [8] that the expressive power of such array languages in principle is equivalent to relational query languages with ranking.

Array storage

Array storage has to accommodate arrays of different dimensions and typically large sizes. A core task is to maintain spatial proximity on disk so as to reduce the number of disk accesses during subsetting. Note that an emulation of multi-dimensional arrays as nested lists (or 1-D arrays) will not per se accomplish this and, therefore, in general will not lead to scalable architectures.

Commonly arrays are partitioned into sub-arrays which form the unit of access. Regular partitioning where all partitions have the same size (except possibly for boundaries) is referred to as chunking. [9] A generalization which removes the restriction to equally sized partitions by supporting any kind of partitioning is tiling. [10] Array partitioning can improve access to array subsets significantly: by adjusting tiling to the access pattern, the server ideally can fetch all required data with only one disk access.

Compression of tiles can sometimes reduce substantially the amount of storage needed. Also for transmission of results compression is useful, as for the large amounts of data under consideration networks bandwidth often constitutes a limiting factor.

Query processing

A tile-based storage structure suggests a tile-by-tile processing strategy (in rasdaman called tile streaming). A large class of practically relevant queries can be evaluated by loading tile after tile, thereby allowing servers to process arrays orders of magnitude beyond their main memory.

Transformation of a query to a more efficient, but equivalent version during array query optimization Sample rule for heuristic array query optimization.png
Transformation of a query to a more efficient, but equivalent version during array query optimization

Due to the massive sizes of arrays in scientific/technical applications in combination with often complex queries, optimization plays a central role in making array queries efficient. Both hardware and software parallelization can be applied. An example for heuristic optimization is the rule "maximum value of an array resulting from the cell-wise addition of two input images is equivalent to adding the maximum values of each input array". By replacing the left-hand variant by the right-hand expression, costs shrink from three (costly) array traversals to two array traversals plus one (cheap) scalar operation (see Figure, which uses the SQL/MDA query standard).

Application domains

In many if not most cases where some phenomenon is sampled or simulated the result is a rasterized data set which can conveniently be stored, retrieved, and forwarded as an array. Typically, the array data are ornamented with metadata describing them further; for example, geographically referenced imagery will carry its geographic position and the coordinate reference system in which it is expressed.

The following are representative domains in which large-scale multi-dimensional array data are handled:

These are but examples; generally, arrays frequently represent sensor, simulation, image, and statistics data. More and more spatial and time dimensions are combined with abstract axes, such as sales and products; one example where such abstract axes are explicitly foreseen is the [Open_Geospatial_Consortium |Open Geospatial Consortium] (OGC) coverage model.

Standardization

Many communities have established data exchange formats, such as HDF, NetCDF, and TIFF. A de facto standard in the Earth Science communities is OPeNDAP, a data transport architecture and protocol. While this is not a database specification, it offers important components that characterize a database system, such as a conceptual model and client/server implementations.

A declarative geo raster query language, Web Coverage Processing Service (WCPS), has been standardized by the Open Geospatial Consortium (OGC).

In June 2014, ISO/IEC JTC1 SC32 WG3, which maintains the SQL database standard, has decided to add multi-dimensional array support to SQL as a new column type, [11] based on the initial array support available since the 2003 version of SQL. The new standard, adopted in Fall 2018, is named ISO 9075 SQL Part 15: MDA (Multi-Dimensional Arrays).

List of array DBMSs

See also

Related Research Articles

<span class="mw-page-title-main">Database</span> Organized collection of data in computing

In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and analyze the data. The DBMS additionally encompasses the core facilities provided to administer the database. The sum total of the database, the DBMS and the associated applications can be referred to as a database system. Often the term "database" is also used loosely to refer to any of the DBMS, the database system or an application associated with the database.

A relational database is a database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relational database systems are equipped with the option of using SQL for querying and updating the database.

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if no attribute domain has relations as elements. Or more informally, that no table column can have tables as values. Database normalization is the process of representing a database in terms of relations in standard normal forms, where first normal is a minimal requirement. SQL-92 does not support creating or using table-valued columns, which means that using only the "traditional relational database features" most relational databases will be in first normal form by necessity. Database systems which do not require first normal form are often called NoSQL systems. Newer SQL standards like SQL:1999 have started to allow so called non-atomic types, which include composite types. Even newer versions like SQL:2016 allow JSON.

A coverage is the digital representation of some spatio-temporal phenomenon. ISO 19123 provides the definition:

<span class="mw-page-title-main">Hierarchical Data Format</span> Set of file formats

Hierarchical Data Format (HDF) is a set of file formats designed to store and organize large amounts of data. Originally developed at the U.S. National Center for Supercomputing Applications, it is supported by The HDF Group, a non-profit corporation whose mission is to ensure continued development of HDF5 technologies and the continued accessibility of data stored in HDF.

K is a proprietary array processing programming language developed by Arthur Whitney and commercialized by Kx Systems. The language serves as the foundation for kdb+, an in-memory, column-based database, and other related financial products. The language, originally developed in 1993, is a variant of APL and contains elements of Scheme. Advocates of the language emphasize its speed, facility in handling arrays, and expressive syntax.

<span class="mw-page-title-main">Row- and column-major order</span> Array representation in computer memory

In computing, row-major order and column-major order are methods for storing multidimensional arrays in linear storage such as random access memory.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007. Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

In computer programming contexts, a data cube is a multi-dimensional ("n-D") array of values. Typically, the term data cube is applied in contexts where these arrays are massively larger than the hosting computer's main memory; examples include multi-terabyte/petabyte data warehouses and time series of image data.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

Multidimensional Expressions (MDX) is a query language for online analytical processing (OLAP) using a database management system. Much like SQL, it is a query language for OLAP cubes. It is also a calculation language, with syntax similar to spreadsheet formulae.

Oracle Spatial and Graph, formerly Oracle Spatial, is a free option component of the Oracle Database. The spatial features in Oracle Spatial and Graph aid users in managing geographic and location-data in a native type within an Oracle database, potentially supporting a wide range of applications — from automated mapping, facilities management, and geographic information systems (AM/FM/GIS), to wireless location services and location-enabled e-business. The graph features in Oracle Spatial and Graph include Oracle Network Data Model (NDM) graphs used in traditional network applications in major transportation, telcos, utilities and energy organizations and RDF semantic graphs used in social networks and social interactions and in linking disparate data sets to address requirements from the research, health sciences, finance, media and intelligence communities.

<span class="mw-page-title-main">Peter Baumann (computer scientist)</span> German computer scientist

Peter Baumann is a German computer scientist and professor at Constructor University, Bremen, Germany, where he is head of the Large-Scale Scientific Information Systems research group in the Department of Computer Science and Electrical Engineering.

The Open Geospatial Consortium Web Coverage Service Interface Standard (WCS) defines Web-based retrieval of coverages – that is, digital geospatial information representing space/time-varying phenomena.

In computer science, the Bx tree is a query that is used to update efficient B+ tree-based index structures for moving objects.

In database management systems (DBMS), a prepared statement, parameterized statement, or parameterized query is a feature where the database pre-compiles SQL code and stores the results, separating it from data. Benefits of prepared statements are:

The following is provided as an overview of and topical guide to databases:

rasdaman Database management system

rasdaman is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently used synonym to arrays is raster data, such as in 2-D raster graphics; this actually has motivated the name rasdaman. However, rasdaman has no limitation in the number of dimensions - it can serve, for example, 1-D measurement data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z exploration data, 4-D ocean and climate data, and even beyond spatio-temporal dimensions.

Cubes is a light-weight open source multidimensional modelling and OLAP toolkit for development reporting applications and browsing of aggregated data written in Python programming language released under the MIT License.

References

  1. Chock, M., Cardenas, A., Klinger, A.: Database structure and manipulation capabilities of a picture database management system (PICDMS). IEEE ToPAMI, 6(4):484–492, 1984
  2. Baumann, P.: On the Management of Multidimensional Discrete Data. VLDB Journal 4(3)1994, Special Issue on Spatial Database Systems, pp. 401–444
  3. Baumann, P.: A Database Array Algebra for Spatio-Temporal Data and Beyond. Proc. NGITS'99, LNCS 1649, Springer 1999, pp.76-93
  4. Marathe, A., Salem, K.: A language for manipulating arrays. Proc. VLDB'97, Athens, Greece, August 1997, pp. 46–55
  5. Libkin, L., Machlin, R., Wong, L.: A query language for multidimensional arrays: design, implementation and optimization techniques. Proc. ACM SIGMOD'96, Montreal, Canada, pp. 228–239
  6. Mennis, J., Viger, R., Tomlin, C.D.: Cubic Map Algebra Functions for Spatio-Temporal Analysis. Cartography and Geographic Information Science 32(1)2005, pp. 17–32
  7. Ritter, G. and Wilson, J. and Davidson, J.: Image Algebra: An Overview. Computer Vision, Graphics, and Image Processing, 49(1)1994, 297-336
  8. Machlin, R.: Index-Based Multidimensional Array Queries: Safety and Equivalence. Proc. ACM PODS'07, Beijing, China, June 2007, pp. 175–184
  9. Sarawagi, S., Stonebraker, M.: Efficient Organization of Large Multidimensional Arrays. Proc. ICDE'94, Houston, USA, 1994, pp. 328-336
  10. Furtado, P., Baumann, P.: Storage of Multidimensional Arrays based on Arbitrary Tiling. Proc. ICDE'99, March 23–26, 1999, Sydney, Australia, pp. 328–336
  11. Chirgwin, R.: SQL fights back against NoSQL's big data cred with SQL/MDA spec, The Register, 26 Jun 2014