Data cube

Last updated

In computer programming contexts, a data cube (or datacube) is a multi-dimensional ("n-D") array of values. Typically, the term data cube is applied in contexts where these arrays are massively larger than the hosting computer's main memory; examples include multi-terabyte/petabyte data warehouses and time series of image data.

Contents

The data cube is used to represent data (sometimes called facts) along some dimensions of interest. For example, in online analytical processing (OLAP) such dimensions could be the subsidiaries a company has, the products the company offers, and time; in this setup, a fact would be a sales event where a particular product has been sold in a particular subsidiary at a particular time. In satellite image timeseries dimensions would be latitude and longitude coordinates and time; a fact (sometimes called measure) would be a pixel at a given space and time as taken by the satellite (following some processing that is not of concern here). Even though it is called a cube (and the examples provided above happen to be 3-dimensional for brevity), a data cube generally is a multi-dimensional concept which can be 1-dimensional, 2-dimensional, 3-dimensional, or higher-dimensional. In any case, every dimension divides data into groups of cells whereas each cell in the cube represents a single measure of interest. Sometimes cubes hold only a few values with the rest being empty, i.e. undefined, while sometimes most or all cube coordinates hold a cell value. In the first case such data are called sparse, and in the second case they are called dense, although there is no hard delineation between the two.

History

Multi-dimensional arrays have long been familiar in programming languages. Fortran offers arbitrarily-indexed 1-D arrays and arrays of arrays, which allows the construction of higher-dimensional arrays, up to 15 dimensions. APL supports n-D arrays with a rich set of operations. All these have in common that arrays must fit into the main memory and are available only while the particular program maintaining them (such as image processing software) is running.

A series of data exchange formats support storage and transmission of data cube-like data, often tailored towards particular application domains. Examples include MDX for statistical (in particular, business) data, Hierarchical Data Format for general scientific data, and TIFF for imagery.

In 1992, Peter Baumann introduced management of massive data cubes with high-level user functionality combined with an efficient software architecture. [1] Datacube operations include subset extraction, processing, fusion, and in general queries in the spirit of data manipulation languages like SQL.

Some years after, the data cube concept was applied to describe time-varying business data as data cubes by Jim Gray, et al., [2] and by Venky Harinarayan, Anand Rajaraman and Jeff Ullman [3] which rank among the top 500 most cited computer science articles over a 25-year period. [4]

Around that time, a working group on Multi-Dimensional Databases ("Arbeitskreis Multi-Dimensionale Datenbanken") was established at German Gesellschaft für Informatik. [5] [6]

Datacube Inc. was an image processing company selling hardware and software applications for the PC market in 1996, however without addressing data cubes as such.

The EarthServer initiative has established geo data cube service requirements. [7]

Standardization

In 2018, the ISO SQL database language was extended with data cube functionality as "SQL – Part 15: Multi-dimensional arrays (SQL/MDA)". [8]

Web Coverage Processing Service is a geo data cube analytics language issued by the Open Geospatial Consortium in 2008. In addition to the common data cube operations, the language knows about the semantics of space and time and supports both regular and irregular grid data cubes, based on the concept of coverage data.

An industry standard for querying business data cubes, originally developed by Microsoft, is MultiDimensional eXpressions.

Implementation

Many high-level computer languages treat data cubes and other large arrays as single entities distinct from their contents. These languages, of which Fortran, APL, IDL, NumPy, PDL, and S-Lang are examples, allow the programmer to manipulate complete film clips and other data en masse with simple expressions derived from linear algebra and vector mathematics. Some languages (such as PDL) distinguish between a list of images and a data cube, while many (such as IDL) do not.

Array DBMSs (Database Management Systems) offer a data model which generically supports definition, management, retrieval, and manipulation of n-dimensional data cubes. This database category has been pioneered by the rasdaman system since 1994. [9]

Applications

Multi-dimensional arrays can meaningfully represent spatio-temporal sensor, image, and simulation data, but also statistics data where the semantics of dimensions is not necessarily of spatial or temporal nature. Generally, any kind of axis can be combined with any other into a data cube.

Mathematics

In mathematics, a one-dimensional array corresponds to a vector, a two-dimensional array resembles a matrix; more generally, a tensor may be represented as an n-dimensional data cube.

Science and engineering

For a time sequence of color images, the array is generally four-dimensional, with the dimensions representing image X and Y coordinates, time, and RGB (or other color space) color plane. For example, the EarthServer initiative [10] unites data centers from different continents offering 3-D x/y/t satellite image timeseries and 4-D x/y/z/t weather data for retrieval and server-side processing through the Open Geospatial Consortium WCPS geo data cube query language standard.

A data cube is also used in the field of imaging spectroscopy, since a spectrally-resolved image is represented as a three-dimensional volume. Earth observation data cubes combine satellite imagery such as Landsat 8 and Sentinel-2 with Geographic information system analytics. [11]

Business intelligence

In online analytical processing (OLAP), data cubes are a common arrangement of business data suitable for analysis from different perspectives through operations like slicing, dicing, pivoting, and aggregation.

See also

Related Research Articles

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">OLAP cube</span> Multidimensional data array organized for rapid analysis

An OLAP cube is a multi-dimensional array of data. Online analytical processing (OLAP) is a computer-based technique of analyzing data to look for insights. The term cube here refers to a multi-dimensional dataset, which is also sometimes called a hypercube if the number of dimensions is greater than three.

<span class="mw-page-title-main">Star schema</span> Data warehousing schema

In computing, the star schema or star model is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

In statistics, econometrics and related fields, multidimensional analysis (MDA) is a data analysis process that groups data into two categories: data dimensions and measurements. For example, a data set consisting of the number of wins for a single football team at each of several years is a single-dimensional data set. A data set consisting of the number of wins for several football teams in a single year is also a single-dimensional data set. A data set consisting of the number of wins for several football teams over several years is a two-dimensional data set.

Holos was an influential OLAP product of the 1990s. Developed by Holistic Systems in 1987, the product remained in use until around 2004. The core of the Holos Server was a business intelligence (BI) virtual machine. The Holos Language was a very broad language in that it covered a wide range of statements and concepts, including the reporting system, business rules, OLAP data, SQL data, device properties, analysis, forecasting, and data mining. Holos Server provided an array of different, but compatible, storage mechanisms for its multi-cube architecture: memory, disk, SQL. It was therefore the first product to provide "hybrid OLAP" (HOLAP). The Holos Client was both a design and delivery vehicle, and this made it quite large. Around about 2000, the Holos Language was made object-oriented (HL++) with a view to allowing the replacement of the Holos Client with a custom Java or VB product. However, the company were never sold on this, and so the project was abandoned. Before its demise, the Holos Server product ran under Windows NT, VMS, plus about 10 flavors of UNIX, and accessed over half-a-dozen different SQL databases. It was also ported to several different locales, including Japanese.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007. Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

Multidimensional Expressions (MDX) is a query language for online analytical processing (OLAP) using a database management system. Much like SQL, it is a query language for OLAP cubes. It is also a calculation language, with syntax similar to spreadsheet formulae.

<span class="mw-page-title-main">Mondrian OLAP server</span>

Mondrian is an open source OLAP server, written in Java. It supports the MDX (multidimensional expressions) query language and the XML for Analysis and olap4j interface specifications. It reads from SQL and other data sources and aggregates data in a memory cache.

Microsoft SQL Server Analysis Services (SSAS) is an online analytical processing (OLAP) and data mining tool in Microsoft SQL Server. SSAS is used as a tool by organizations to analyze and make sense of information possibly spread out across multiple databases, or in disparate tables or files. Microsoft has included a number of services in SQL Server related to business intelligence and data warehousing. These services include Integration Services, Reporting Services and Analysis Services. Analysis Services includes a group of OLAP and data mining capabilities and comes in two flavors multidimensional and tabular, where the difference between the two is how the data is presented. In a tabular model, the information is arranged in two-dimensional tables which can thus be more readable for a human. A multidimensional model can contain information with many degrees of freedom, and must be unfolded to increase readability by a human.

Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio.

The Oracle Database OLAP Option implements On-line Analytical Processing (OLAP) within an Oracle database environment. Oracle Corporation markets the Oracle Database OLAP Option as an extra-cost option to supplement the "Enterprise Edition" of its database.

<span class="mw-page-title-main">Peter Baumann (computer scientist)</span> German computer scientist

Peter Baumann is a German computer scientist and professor at Constructor University, Bremen, Germany, where he is head of the Large-Scale Scientific Information Systems research group in the Department of Computer Science and Electrical Engineering.

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

The term is used for two different things:

  1. In computer science, in-memory processing (PIM) is a computer architecture in which data operations are available directly on the data memory, rather than having to be transferred to CPU registers first. This may improve the power usage and performance of moving data between the processor and the main memory.
  2. In software engineering, in-memory processing is a software architecture where a database is kept entirely in random-access memory (RAM) or flash memory so that usual accesses, in particular read or query operations, do not require access to disk storage. This may allow faster data operations such as "joins", and faster reporting and decision-making in business.

The following is provided as an overview of and topical guide to databases:

rasdaman Database management system

rasdaman is an Array DBMS, that is: a Database Management System which adds capabilities for storage and retrieval of massive multi-dimensional arrays, such as sensor, image, simulation, and statistics data. A frequently used synonym to arrays is raster data, such as in 2-D raster graphics; this actually has motivated the name rasdaman. However, rasdaman has no limitation in the number of dimensions - it can serve, for example, 1-D measurement data, 2-D satellite imagery, 3-D x/y/t image time series and x/y/z exploration data, 4-D ocean and climate data, and even beyond spatio-temporal dimensions.

<span class="mw-page-title-main">Array DBMS</span> System that provides database services specifically for arrays

An array database management system or array DBMS provides database services specifically for arrays, that is: homogeneous collections of data items, sitting on a regular grid of one, two, or more dimensions. Often arrays are used to represent sensor, simulation, image, or statistics data. Such arrays tend to be Big Data, with single objects frequently ranging into Terabyte and soon Petabyte sizes; for example, today's earth and space observation archives typically grow by Terabytes a day. Array databases aim at offering flexible, scalable storage and retrieval on this information category.

Cubes is a light-weight open source multidimensional modelling and OLAP toolkit for development reporting applications and browsing of aggregated data written in Python programming language released under the MIT License.

The functional database model is used to support analytics applications such as financial planning and performance management. The functional database model, or the functional model for short, is different from but complementary to the relational model. The functional model is also distinct from other similarly named concepts, including the DAPLEX functional database model and functional language databases.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

References

  1. Baumann, Peter (April 1992). "Language Support for Raster Image Manipulation in Databases". Graphics Modeling and Visualization in Science and Technology. Int. Workshop on Graphics Modeling, Visualization in Science & Technology. Darmstadt, Germany: Springer (published 1993). pp. 236–45. doi:10.1007/978-3-642-77811-7_19.
  2. Gray, Jim; Chaudhuri, Surajit; Bosworth, Adam; Layman, Andrew; Reichart, Don; Venkatrao, Murali; Pellow, Frank; Pirahesh, Hamid (January 1997). "Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals". Data Mining and Knowledge Discovery. 1 (1): 29–53. doi:10.1023/A:1009726021843. S2CID   12502175.
  3. Harinarayan, Venky; Rajaraman, Anand; Ullman, Jeffrey D. (1996). "Implementing data cubes efficiently". Proceedings of the 1996 ACM SIGMOD international conference on Management of data – SIGMOD '96. Vol. 25. ACM SIGMOD. pp. 205–16. CiteSeerX   10.1.1.41.1205 . doi:10.1145/233269.233333. ISBN   978-0897917940. S2CID   3104453.
  4. 500 Most Cited Computer Science Articles (501–600), CiteSeer. 12 June 2009. Retrieved 21 March 2017.
  5. "Datenbank Rundbrief, Ausgabe 19, Mai 1997". dblp. DE: Uni Trier.
  6. "Datenbank Rundbrief, Ausgabe 23, Mai 1999". dblp. DE: Uni Trier.
  7. "The DatabaseManifesto". Earth server. EU. Retrieved 2017-09-21.
  8. "Part 15: Multi-dimensional arrays (SQL/MDA)". DIS 9075-15 Information technology – Database languages – SQL. ISO/IEC. Retrieved 2018-05-27.
  9. "Management of Multidimensional Discrete Data" (PDF). VLDB. Retrieved 2017-09-21.
  10. "EarthServer - Big Datacube Analytics at Your Fingertips". Earth server. EU. Retrieved 2017-03-31.
  11. Kopp, Steve; Becker, Peter; Doshi, Abhijit; Wright, Dawn J.; Zhang, Kaixi; Xu, Hong (2019). "Achieving the Full Vision of Earth Observation Data Cubes". Data. 4 (3): 94. doi: 10.3390/data4030094 .