Data orientation

Last updated January 29, 2025

Data orientation is the representation of tabular data in a linear memory model such as in-disk or in-memory. The two most common representations are column-oriented (columnar format) and row-oriented (row format).^[1]^[2]

The choice of data orientation is a trade-off and an architectural decision in databases, query engines, and numerical simulations.^[1] As a result of these tradeoffs, row-oriented formats are more commonly used in Online transaction processing (OLTP) and column-oriented formats are more commonly used in Online analytical processing (OLAP).^[2]

Examples of column-oriented formats include Apache ORC,^[3] Apache Parquet,^[4] Apache Arrow,^[5] formats used by BigQuery, Amazon Redshift and Snowflake. Predominant examples of row-oriented formats include CSV, formats used in most relational databases, the in-memory format of Apache Spark, and Apache Avro.^[6]

Description

Tabular data is two dimensional — data is modeled as rows and columns. However, computer systems represent data in a linear memory model, both in-disk and in-memory.^[7]^[8]^[9] Therefore, a table in a linear memory model requires mapping its two-dimensional scheme into a one-dimensional space. Data orientation is to the decision taken in this mapping. There are two prominent mappings: row-oriented and column-oriented.^[1]^[2]

Row-oriented

In row-oriented, the elements of the table

column 1	column 2	column 3
item 11	item 12	item 13
item 21	item 22	item 23

are stored linearly as

item 11	item 12	item 13	item 21	item 22	item 23

I.e. each row of the table is located one after the other. In this orientation, values on the same row are close in space (e.g. similar address in an addressable space).

Examples

CSV
Postgres in-disk and in-memory formats
Apache Spark in-memory format
Apache Avro

Column-oriented

In column-oriented, the elements of the table

column 1	column 2	column 3
item 11	item 12	item 13
item 21	item 22	item 23

are stored linearly as

item 11	item 21	item 12	item 22	item 13	item 23

I.e. each column of the table is located one after the other. In this orientation, values on the same column are close in space (e.g. similar address in an addressable space).

Examples

BigQuery's in-memory and storage formats
Apache Parquet
Apache ORC
Apache Arrow
DuckDB in-memory format
Pandas in-memory format
R dataframes ^[10]

See list of column-oriented DBMSes for more examples.

Tradeoff

The data orientation is an important architectural decision of systems handling data because it results in important tradeoffs in performance and storage.^[8] Below are selected dimensions of this tradeoff.

Random access

Row-oriented benefits from fast random access of rows. Column-oriented benefits from fast random access of columns. In both cases, this is the result of fewer page or cache misses when accessing the data.^[8]

Insert

Row-oriented benefits from fast insertion of a new row. Column-oriented benefits from fast insertion of a new column.

This dimension is an important reason why row-oriented formats are more commonly used in Online transaction processing (OLTP), as it results in faster transactions in comparison to column-oriented.^[2]

Conditional access

Row-oriented benefits from fast access under a filter. Column-oriented benefits from fast access under a projection.^[4]^[3]

Compute performance

Column-oriented benefits from fast analytics operations. This is the result of being able to leverage SIMD instructions.^[5]

Uncompressed size

Column-oriented benefits from smaller uncompressed size. This is the result of the possibility that this orientation offers to represent certain data types with dedicated encodings.^[4]^[3]

For example, a table of 128 rows with a Boolean column requires 128 bytes a row-oriented format (one byte per Boolean) but 128 bits (16 bytes) in a column-oriented format (via a bitmap). Another example is the use of run-length encoding to encode a column.

Compressed size

Column-oriented benefits from smaller compressed size. This is the result of a higher homogeneity within a column than within multiple rows.^[4]^[3]

Conversion and interchange

Because both orientations represent the same data, it is possible to convert a row-oriented dataset to a column-oriented dataset and vice-versa at the expense of compute. In particular, advanced query engines often leverage each orientation's advantages, and convert from one orientation to the other as part of their execution. As an example, an Apache Spark query may

read data from Apache Parquet (column-oriented)
load it into Spark internal in-memory format (row-oriented)
convert it to Apache Arrow for a specific computation (column-oriented)
write it to Apache Avro for streaming (row-oriented)

Related Research Articles

In computer science, an array is a data structure consisting of a collection of elements, of same memory size, each identified by at least one array index or key. An array is stored such that the position of each element can be computed from its index tuple by a mathematical formula. The simplest type of data structure is a linear array, also called a one-dimensional array.

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form. In the early days, it was sometimes wrongly styled as DB/2 in a false derivation from the operating system OS/2.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

Extract, transform, load (ETL) is a three-phase computing process where data is extracted from an input source, transformed, and loaded into an output data container. The data can be collected from one or more sources and it can also be output to one or more destinations. ETL processing is typically executed using software applications but it can also be done manually by system operators. ETL software typically automates the entire process and can be run manually or on recurring schedules either as single jobs or aggregated into a batch of jobs.

<span class="mw-page-title-main">Star schema</span> Data warehousing schema

In computing, the star schema or star model is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema consists of one or more fact tables referencing any number of dimension tables. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

<span class="mw-page-title-main">Snowflake schema</span> A logical arrangement of computing tables in a multidimensional database

In computing, a snowflake schema or snowflake model is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a snowflake shape. The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. "Snowflaking" is a method of normalizing the dimension tables in a star schema. When it is completely normalized along all the dimension tables, the resultant structure resembles a snowflake with the fact table in the middle. The principle behind snowflaking is normalization of the dimension tables by removing low cardinality attributes and forming separate tables.

Online transaction processing (OLTP) is a type of database system used in transaction-oriented applications, such as many operational systems. "Online" refers to the fact that such systems are expected to respond to user requests and process them in real-time. The term is contrasted with online analytical processing (OLAP) which instead focuses on data analysis.

A database model is a type of data model that determines the logical structure of a database. It fundamentally determines in which manner data can be stored, organized and manipulated. The most popular example of a database model is the relational model, which uses a table-based format.

Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

Within database management systems, the record columnar file or RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

SAP HANA is an in-memory, column-oriented, relational database management system developed and marketed by SAP SE. Its primary function as the software running a database server is to store and retrieve data as requested by the applications. In addition, it performs advanced analytics and includes extract, transform, load (ETL) capabilities as well as an application server.

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink, and Apache Hadoop.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query data lakes that contain a variety of file formats such as simple row-oriented CSV and JSON data files to more performant open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

References

1 2 3 Abadi, Daniel J.; Madden, Samuel R.; Hachem, Nabil (2008). "Column-stores vs. Row-stores: How different are they really?". Proceedings of the 2008 ACM SIGMOD international conference on Management of data. pp. 967–980. doi:10.1145/1376616.1376712. ISBN 978-1-60558-102-6.
1 2 3 4 Funke, Florian; Kemper, Alfons; Neumann, Thomas (2012). "Compacting Transactional Data in Hybrid OLTP&OLAP Databases". Proceedings of the VLDB Endowment. 5 (11): 1424–1435. doi:10.14778/2350229.2350258.
1 2 3 4 "Apache ORC" . Retrieved 2024-05-21.
1 2 3 4 "Apache Parquet" . Retrieved 2024-05-21.
1 2 "Apache Arrow" . Retrieved 2024-05-21.
↑ "Apache Avro" . Retrieved 2024-05-21.
↑ Richard, Golden G.; Case, Andrew (2014). "In lieu of swap: Analyzing compressed RAM in Mac OS X and Linux". Digital Investigation. 11: S3 –S12. doi: 10.1016/j.diin.2014.05.011 .
1 2 3 M. Frans Kaashoek, Jerome H. Saltzer (2009). Principles of Computer System Design. Morgan Kaufmann. ISBN 978-0-12-374957-4.
↑ "Chapter 4 Process Address Space (Linux kernel documentation)" . Retrieved 2024-05-21.
↑ "R Coding Basics - 9 Data Frames". www.gastonsanchez.com. Retrieved 2025-01-19.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[vs2008-1] 1 2 3 Abadi, Daniel J.; Madden, Samuel R.; Hachem, Nabil (2008). "Column-stores vs. Row-stores: How different are they really?". Proceedings of the 2008 ACM SIGMOD international conference on Management of data. pp. 967–980. doi:10.1145/1376616.1376712. ISBN 978-1-60558-102-6.

[OLTPvsOLAP2012-2] 1 2 3 4 Funke, Florian; Kemper, Alfons; Neumann, Thomas (2012). "Compacting Transactional Data in Hybrid OLTP&OLAP Databases". Proceedings of the VLDB Endowment. 5 (11): 1424–1435. doi:10.14778/2350229.2350258.

[orc-3] 1 2 3 4 "Apache ORC" . Retrieved 2024-05-21.

[parquet-4] 1 2 3 4 "Apache Parquet" . Retrieved 2024-05-21.

[arrow-5] 1 2 "Apache Arrow" . Retrieved 2024-05-21.

[avro-6] "Apache Avro" . Retrieved 2024-05-21.

[7] Richard, Golden G.; Case, Andrew (2014). "In lieu of swap: Analyzing compressed RAM in Mac OS X and Linux". Digital Investigation. 11: S3 –S12. doi: 10.1016/j.diin.2014.05.011 .

[design_book-8] 1 2 3 M. Frans Kaashoek, Jerome H. Saltzer (2009). Principles of Computer System Design. Morgan Kaufmann. ISBN 978-0-12-374957-4.

[9] "Chapter 4 Process Address Space (Linux kernel documentation)" . Retrieved 2024-05-21.

[10] "R Coding Basics - 9 Data Frames". www.gastonsanchez.com. Retrieved 2025-01-19.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

v t e Database models
Common models	Flat Hierarchical Dimensional Network Relational Entity–relationship Enhanced Graph Object-oriented Entity–attribute–value
Other models	Multi-dimensional Array Semantic Star schema XML database
Implementations	Flat file Column-oriented Document-oriented Object–relational Deductive Temporal Valid time Transaction time Decision time XML data store Key–value store Ordered Key-Value Store Triplestore

Data orientation

Contents

Description

Row-oriented

Examples

Column-oriented

Examples

Tradeoff

Random access

Insert

Conditional access

Compute performance

Uncompressed size

Compressed size

Conversion and interchange

Related Research Articles

References