RCFile

Last updated

Within database management systems, the record columnar file [1] or RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

Contents

RCFile is the result of research and collaborative efforts from Facebook, The Ohio State University, and the Institute of Computing Technology at the Chinese Academy of Sciences.

Summary

Data storage format

For example, a table in a database consists of 4 columns (c1 to c4):

c1c2c3c4
11121314
21222324
31323334
41424344
51525354

To serialize the table, RCFile partitions this table first horizontally and then vertically, instead of only partitioning the table horizontally like the row-oriented DBMS (row-store). The horizontal partitioning will first partition the table into multiple row groups based on the row-group size, which is a user-specified value determining the size of each row group. For example, the table mentioned above can be partitioned to two row groups if the user specifies three rows as the size of each row group.

Row Group 1
c1c2c3c4
11121314
21222324
31323334
Row Group 2
c1c2c3c4
41424344
51525354

Then, in every row group, RCFile partitions the data vertically like column-store. Thus, the table will be serialized as:

      Row Group 1   Row Group 2        11, 21, 31;     41, 51;       12, 22, 32;     42, 52;       13, 23, 33;     43, 53;       14, 24, 34;     44, 54;

Column data compression

Within each row group, columns are compressed to reduce storage space usage. Since data of a column are stored adjacently, the pattern of a column can be detected and thus the suitable compression algorithm can be selected for a high compression ratio.

Performance Benefits

Column-store is more efficient when a query only requires a subset of columns, because column-store only read necessary columns from disks but row-store will read an entire row.

RCFile combines merits of row-store and column-store via horizontal-vertical partitioning. With horizontal partitioning, RCFile places all columns of a row in a single machine and thus can eliminate the extra network costs when constructing a row. With vertical partitioning, for a query, RCFile will only read necessary columns from disks and thus can eliminate the unnecessary local I/O costs. Moreover, in every row group, data compression can be done by using compression algorithms used in column-store.

For example, a database might have this table:

EmpIdLastnameFirstnameSalary
10SmithJoe40000
12JonesMary50000
11JohnsonCathy44000
22JonesBob55000

This simple table includes an employee identifier (EmpId), name fields (Lastname and Firstname) and a salary (Salary). This two-dimensional format exists only in theory, in practice, storage hardware requires the data to be serialized into one form or another.

In MapReduce-based systems, data is normally stored on a distributed system, such as Hadoop Distributed File System (HDFS), and different data blocks might be stored in different machines. Thus, for column-store on MapReduce, different groups of columns might be stored on different machines, which introduces extra network costs when a query projects columns placed on different machines. For MapReduce-based systems, the merit of row-store is that there is no extra network costs to construct a row in query processing, and the merit of column-store is that there is no unnecessary local I/O costs when read data from disks.

Row-oriented systems

The common solution to the storage problem is to serialize each row of data, like this;

001:10,Smith,Joe,40000;002:12,Jones,Mary,50000;003:11,Johnson,Cathy,44000;004:22,Jones,Bob,55000;

Row-based systems are designed to efficiently return data for an entire row, or an entire record, in as few operations as possible. This matches use-cases where the system is attempting to retrieve all the information about a particular object, say the full information about one contact in a rolodex system, or the complete information about one product in an online shopping system.

Row-based systems are not efficient at performing operations that apply to the entire data set, as opposed to a specific record. For instance, in order to find all the records in the example table that have salaries between 40,000 and 50,000, the row-based system would have to seek through the entire data set looking for matching records. While the example table shown above may fit in a single disk block, a table with even a few hundred rows would not, therefore multiple disk operations would be needed to retrieve the data.

Column-oriented systems

A column-oriented system serializes all of the values of a column together, then the values of the next column. For our example table, the data would be stored in this fashion;

10:001,12:002,11:003,22:004;Smith:001,Jones:002,Johnson:003,Jones:004;Joe:001,Mary:002,Cathy:003,Bob:004;40000:001,50000:002,44000:003,55000:004;

The difference can be more clearly seen in this common modification:

...;Smith:001,Jones:002,004,Johnson:003;...

Two of the records store the same value, "Jones", therefore it is now possible to store this in the column-oriented system only once instead of twice. For many common searches, like "find all the people with the last name Jones", the answer can now be retrieved in a single operation.

Whether or not a column-oriented system will be more efficient in operation depends heavily on the operations being automated. Operations that retrieve data for objects would be slower, requiring numerous disk operations to assemble data from different columns to build up a whole-row record. However, such whole-row operations are generally rare. In the majority of cases, only a limited subset of data is retrieved. In a rolodex application, for instance, operations collecting the first names and last names from many rows in order to build a list of contacts is far more common than operations reading the data for home address.

Adoption

RCFile has been adopted in real-world systems for big data analytics.

  1. RCFile became the default data placement structure in Facebook's production Hadoop cluster. [2] By 2010 it was the world's largest Hadoop cluster, [3] where 40 terabytes compressed data sets are added every day. [4] In addition, all the data sets stored in HDFS before RCFile have also been transformed to use RCFile . [2]
  2. RCFile has been adopted in Apache Hive (since v0.4), [5] which is an open source data store system running on top of Hadoop and is being widely used in various companies around the world, [6] including several Internet services, such as Facebook, Taobao, and Netflix. [7]
  3. RCFile has been adopted in Apache Pig (since v0.7), [8] which is another open source data processing system being widely used in many organizations, [9] including several major Web service providers, such as Twitter, Yahoo, LinkedIn, AOL, and Salesforce.com.
  4. RCFile became the de facto standard data storage structure in Hadoop software environment supported by the Apache HCatalog project (formerly known as Howl [10] ) that is the table and storage management service for Hadoop. [11] RCFile is supported by the open source Elephant Bird library used in Twitter for daily data analytics. [12]

Over the following years, other Hadoop data formats also became popular. In February 2013, an Optimized Row Columnar (ORC) file format was announced by Hortonworks. [13] A month later, the Apache Parquet format was announced, developed by Cloudera and Twitter. [14]

See also

Related Research Articles

A column-oriented DBMS or columnar DBMS is a database management system (DBMS) that stores data tables by column rather than by row. Benefits include more efficient access to data when only querying a subset of columns, and more options for data compression. However, they are typically less efficient for inserting new data.

SAP IQ is a column-based, petabyte scale, relational database software system used for business intelligence, data warehousing, and data marts. Produced by Sybase Inc., now an SAP company, its primary function is to analyze large amounts of data in a low-cost, highly available environment. SAP IQ is often credited with pioneering the commercialization of column-store technology.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Apache Cassandra</span> Free and open-source database management system

Cassandra is a free and open-source, distributed, wide-column store, NoSQL database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. Cassandra offers support for clusters spanning multiple data centers, with asynchronous masterless replication allowing low latency operations for all clients. Cassandra was designed to implement a combination of Amazon's Dynamo distributed storage and replication techniques combined with Google's Bigtable data and storage engine model.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

<span class="mw-page-title-main">Apache Pig</span> Open-source data analytics software

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

<span class="mw-page-title-main">SingleStore</span> Database management system

SingleStore is a proprietary, cloud-native database designed for data-intensive applications. A distributed, relational, SQL database management system (RDBMS) that features ANSI SQL support, it is known for speed in data ingest, transaction processing, and query processing.

<span class="mw-page-title-main">Oracle NoSQL Database</span> Distributed database

Oracle NoSQL Database is a NoSQL-type distributed key-value database from Oracle Corporation. It provides transactional semantics for data manipulation, horizontal scalability, and simple administration and monitoring.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

A wide-column store is a column-oriented DBMS and therefore a special type of NoSQL database. It uses tables, rows, and columns, but unlike a relational database, the names and format of the columns can vary from row to row in the same table. A wide-column store can be interpreted as a two-dimensional key–value store. Google's Bigtable is one of the prototypical examples of a wide-column store.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">ClickHouse</span> Open-source database management system

ClickHouse is an open-source column-oriented DBMS for online analytical processing (OLAP) that allows users to generate analytical reports using SQL queries in real-time. ClickHouse Inc. is headquartered in the San Francisco Bay Area with the subsidiary, ClickHouse B.V., based in Amsterdam, Netherlands.

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

<span class="mw-page-title-main">Apache ORC</span> Column-oriented data storage format

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

<span class="mw-page-title-main">Trino (SQL query engine)</span> Open-source distributed SQL query engine

Trino is an open-source distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Trino can query data lakes that contain open column-oriented data file formats like ORC or Parquet residing on different storage systems like HDFS, AWS S3, Google Cloud Storage, or Azure Blob Storage using the Hive and Iceberg table formats. Trino also has the ability to run federated queries that query tables in different data sources such as MySQL, PostgreSQL, Cassandra, Kafka, MongoDB and Elasticsearch. Trino is released under the Apache License.

Apache IoTDB is a column-oriented open-source, time-series database (TSDB) management system written in Java. It has both edge and cloud versions, provides an optimized columnar file format for efficient time-series data storage, and TSDB with high ingestion rate, low latency queries and data analysis support. It is specially optimized for time-series oriented operations like aggregations query, downsampling and sub-sequence similarity search. The name IoTDB comes from Internet of Things (IoT) Database, which means it was designed as an IoT-native TSDB that resolves the pain points of the typical IoT scenarios, including massive data generation, high frequency sampling, out-of-order data, specific analytics requirements, high costs of storage and operation & maintenance, low computational power of IoT devices.

References

  1. Yongqiang He, Rubao Lee, Yin Huai, Zheng Shao, Namit Jain, Xiaodong Zhang, and Zhiwei Xu (2011). "RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems". IEEE 27th International Conference on Data Engineering. pp. 1200–1208.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  2. 1 2 "Hive integration: HBase and Rcfile__HadoopSummit2010". 2010-06-30.
  3. "Facebook has the world's largest Hadoop cluster!". 2010-05-09.
  4. "Apache Hadoop India Summit 2011 talk "Hive Evolution" by Namit Jain". 2011-02-24.
  5. "Class RCFile". Archived from the original on 2011-11-23. Retrieved 2012-07-21.
  6. "PoweredBy - Apache Hive - Apache Software Foundation".
  7. "Hive user group presentation from Netflix (3/18/2010)". 2010-03-19.
  8. "HiveRCInputFormat (Pig 0.17.0 API)".
  9. "PoweredBy - Apache Pig - Apache Software Foundation".
  10. Howl
  11. "HCatalog". Archived from the original on 2012-07-20. Retrieved 2012-07-21.
  12. "Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.: Kevinweil/elephant-bird". GitHub . 2018-12-15.
  13. Alan Gates (February 20, 2013). "The Stinger Initiative: Making Apache Hive 100 Times Faster". Hortonworks blog. Retrieved May 4, 2017.
  14. Justin Kestelyn (March 13, 2013). "Introducing Parquet: Efficient Columnar Storage for Apache Hadoop". Cloudera blog. Retrieved May 4, 2017.