Data orientation

Last updated

Data Orientation refers to how tabular data is represented in a linear memory model such as in-disk or in-memory.The two most common representations are column-oriented (columnar format) and row-oriented (row format). [1] [2]

Contents

The choice of data orientation is a trade-off and a architectural decision in databases, query engines, and numerical simulations. [1] As a result of these tradeoffs, row-oriented formats are more commonly used in Online transaction processing (OLTP) and column-oriented formats are more commonly used in Online analytical processing (OLAP). [2]

Examples of column-oriented formats include Apache ORC, [3] Apache Parquet, [4] Apache Arrow, [5] formats used by BigQuery, Amazon Redshift and Snowflake. Predominant examples of row-oriented formats include CSV, formats used in most relational databases, in-memory format of Apache Spark, and Apache Avro. [6]

Description

Tabular data is two dimensional in nature - data is represented in rows and columns. However, modern operating systems logically represent data in a linear memory model, both in-disk and in-memory. [7] [8] [9] Therefore, a table in a linear memory model requires projecting its two-dimensional items in a one-dimensional space. Data orientation refers to the decision taken in this projection. There are two prominent choices of orientation: row-oriented and column-oriented. [1] [2]

Row-oriented

In row-oriented, the elements of the table

column 1column 2column 3
item 11item 12item 13
item 21item 22item 23

are stored linearly as

item 11item 12item 13item 21item 22item 23

I.e. each row of the table is located one after the other. In this orientation, values on the same row are close in space (e.g. similar address in an addressable space).

Examples

Column-oriented

In column-oriented, the elements of the table

column 1column 2column 3
item 11item 12item 13
item 21item 22item 23

are stored linearly as

item 11item 21item 12item 22item 13item 23

I.e. each column of the table is located one after the other. In this orientation, values on the same column are close in space (e.g. similar address in an addressable space).

Examples

See list of column-oriented DBMSes for more examples.

Tradeoff

The data orientation is an important architectural decision of systems handling data because it results in important tradeoffs in performance and storage. [8] Below are selected dimensions of this tradeoff.

Random access

Row-oriented benefits from fast random access of rows. Column-oriented benefits from fast random access of columns. In both cases, this is the result of less page or cache misses when accessing the data. [8]

Insert

Row-oriented benefits from fast insertion of a new row. Column-oriented benefits from fast insertion of a new column.

This dimension is an important reason why row-oriented formats are more commonly used in Online transaction processing (OLTP), as it results in faster transactions in comparison to column-oriented. [2]

Conditional access

Row-oriented benefits from fast access under a filter. Column-oriented benefits from fast access under a projection. [4] [3]

Compute performance

Column-oriented benefits from fast analytics operations. This is the result of being able to leverage SIMD instructions. [5]

Uncompressed size

Column-oriented benefits from smaller uncompressed size. This is the result of the possibility that this orientation offers to represent certain data types with dedicated encodings. [4] [3]

For example, a table of 128 rows with a boolean column requires 128 bytes a row-oriented format (one byte per boolean) but 128 bits (16 bytes) in a column-oriented format (via a bitmap). Another example is the use of run-length encoding to encode a column.

Compressed size

Column-oriented benefits from smaller compressed size. This is the result of a higher homogeneity within a column than within multiple rows. [4] [3]

Conversion and interchange

Because both orientations represent the same data, it is possible to convert a row-oriented dataset to a column-oriented dataset and vice-versa at the expense of compute. In particular, advanced query engines often leverage each orientation's advantages, and convert from one orientation to the other as part of their execution. As an example, an Apache Spark query may

  1. read data from Apache Parquet (column-oriented)
  2. load it into Spark internal in-memory format (row-oriented)
  3. convert it to Apache Arrow for a specific computation (column-oriented)
  4. write it to Apache Avro for streaming (row-oriented)

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, of same memory size, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called one-dimensional array.

<span class="mw-page-title-main">Data warehouse</span> Centralized storage of knowledge

In computing, a data warehouse, also known as an enterprise data warehouse (EDW), is a system used for reporting and data analysis and is considered a core component of business intelligence. Data warehouses are central repositories of integrated data from one or more disparate sources. They store current and historical data in one single place that are used for creating reports. This is beneficial for companies as it enables them to interrogate and draw insights from their data and make decisions.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">Star schema</span> Data warehousing schema

In computing, the star schema or star model is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

<span class="mw-page-title-main">Snowflake schema</span> A logical arrangement of computing tables in a multidimensional database

In computing, a snowflake schema or snowflake model is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. "Snowflaking" is a method of normalizing the dimension tables in a star schema. When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. The principle behind snowflaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.

Essbase is a multidimensional database management system (MDBMS) that provides a platform upon which to build analytic applications. Essbase began as a product from Arbor Software, which merged with Hyperion Software in 1998. Oracle Corporation acquired Hyperion Solutions Corporation in 2007. Until late 2005 IBM also marketed an OEM version of Essbase as DB2 OLAP Server.

Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to that such systems are expected to respond to user requests and process them in real-time. The term is contrasted with online analytical processing (OLAP) which instead focuses on data analysis.

<span class="mw-page-title-main">Database model</span> Type of data model

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

<span class="mw-page-title-main">SAP HANA</span> Database management system by SAP

SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP SE. Its primary function as the software running a database server is to store and retrieve data as requested by the applications. In addition, it performs advanced analytics and includes extract, transform, load (ETL) capabilities as well as an application server.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">ClickHouse</span> Open-source database management system

ClickHouse is an open-source column-oriented DBMS for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the San Francisco Bay Area with the subsidiary, ClickHouse B.V., based in Amsterdam, Netherlands.

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink, and Apache Hadoop.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

DuckDB is an open-source column-oriented relational database management system (RDBMS) originally developed by Mark Raasveldt and Hannes Mühleisen at the Centrum Wiskunde & Informatica (CWI) in the Netherlands and first released in 2019. Those behind the project say it has millions of downloads per month. It is designed to provide high performance on complex queries against large databases in embedded configuration, such as combining tables with hundreds of columns and billions of rows. Unlike other embedded databases DuckDB is not focusing on transactional (OLTP) applications and instead is specialized for online analytical processing (OLAP) workloads.

References

  1. 1 2 3 "Column-stores vs. row-stores: how different are they really?". SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data. doi:10.1145/1376616.1376712.
  2. 1 2 3 4 "Compacting Transactional Data in Hybrid OLTP&OLAP Databases". Proceedings of the VLDB Endowment. doi:10.14778/2350229.2350258.
  3. 1 2 3 4 "Apache ORC" . Retrieved 2024-05-21.
  4. 1 2 3 4 "Apache Parquet" . Retrieved 2024-05-21.
  5. 1 2 "Apache Arrow" . Retrieved 2024-05-21.
  6. "Apache Avro" . Retrieved 2024-05-21.
  7. "In lieu of swap: Analyzing compressed RAM in Mac OS X and Linux". Digital Investigation. doi: 10.1016/j.diin.2014.05.011 .
  8. 1 2 3 M. Frans Kaashoek, Jerome H. Saltzer. Principles of Computer System Design. ISBN   978-0-12-374957-4.
  9. "Chapter 4 Process Address Space (Linux kernel documentation)" . Retrieved 2024-05-21.