Apache Pinot

Last updated
Apache Pinot
Original author(s)
  • Kishore Gopalakrishna
  • Xiang Fu
Developer(s) Apache Pinot
Stable release
1.2.0 / 21 August 2024;5 months ago (2024-08-21)
Repository Pinot repository
Written in Java
Operating system Cross-platform
Type
License Apache License 2.0
Website pinot.apache.org

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. [1] [2] [3] [4] [5] It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. [6] [7] [8] The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources. [9]

Contents

Pinot was first created at LinkedIn after the engineering staff determined that there were no off the shelf solutions that met the social networking site's requirements like predictable low latency, data freshness in seconds, fault tolerance and scalability. [9] [10] Pinot is used in production by technology companies such as Uber, [11] Microsoft, [8] and Factual.

History

Pinot was started as an internal project at LinkedIn in 2013 to power a variety of user-facing and business-facing products. The first analytics product at LinkedIn to use Pinot was a redesign of the social networking site's feature that allows members to see who has viewed their profile in real-time. The project was open-sourced in June 2015 under an Apache 2.0 license and was donated to the Apache Software Foundation by LinkedIn in June 2019. [9] [8]

Architecture

Architecture diagram of Apache Pinot Pinot Architecture.png
Architecture diagram of Apache Pinot

Pinot uses Apache Helix for cluster management. Helix is embedded as an agent within the different components and uses Apache ZooKeeper for coordination and maintaining the overall cluster state and health. All Pinot servers and brokers are managed by Helix. Helix is a generic cluster management framework to manage partitions and replicas in a distributed system.

Query management

Queries are received by brokers—which checks the request against the segment-to-server routing table—scattering the request between real-time and offline servers.

Cluster management

Pinot leverages Apache Helix for cluster management. Helix is a cluster management framework to manage replicated, partitioned resources in a distributed system. Helix uses Zookeeper to store cluster state and metadata.

Features

Pinot shares similar features with comparable OLAP datastores, such as Apache Druid. [12] [13] Like Druid, Pinot is a column-oriented database with various compression schemes such as Run Length and Fixed-Bit Length. Pinot supports pluggable indexing technologies - Sorted Index, Bitmap Index, Inverted Index, Star-Tree Index, and Range Index, which are what primarily differentiates Pinot from other OLAP datastores.

Pinot supports near real-time ingestion from streams such as Kafka, AWS Kinesis and batch ingestion from sources such as Hadoop, S3, Azure, GCS. Like most other OLAP datastores and data warehousing solutions, Pinot supports a SQL-like query language that supports selection, aggregation, filtering, group by, order by, distinct queries on data.

See also

Related Research Articles

<span class="mw-page-title-main">IBM Db2</span> Relational model database server

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB2 until 2017, when it changed to its present form. In the early days, it was sometimes wrongly styled as DB/2 in a false derivation from the operating system OS/2.

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries. The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP). OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

<span class="mw-page-title-main">R-tree</span> Data structures used in spatial indexing

R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons. The R-tree was proposed by Antonin Guttman in 1984 and has found significant use in both theoretical and applied contexts. A common real-world usage for an R-tree might be to store spatial objects such as restaurant locations or the polygons that typical maps are made of: streets, buildings, outlines of lakes, coastlines, etc. and then find answers quickly to queries such as "Find all museums within 2 km of my current location", "retrieve all road segments within 2 km of my location" or "find the nearest gas station". The R-tree can also accelerate nearest neighbor search for various distance metrics, including great-circle distance.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

<span class="mw-page-title-main">Pentaho</span> Business intelligence software

Pentaho is the brand name for several Data Management software products that make up the Pentaho+ Data Platform. These include Pentaho Data Integration, Pentaho Business Analytics, Pentaho Data Catalog, and Pentaho Data Optimiser.

<span class="mw-page-title-main">Apache ZooKeeper</span> System for distributed coordination

Apache ZooKeeper is an open-source server for highly reliable distributed coordination of cloud applications. It is a project of the Apache Software Foundation.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.

Data-intensive computing is a class of parallel computing applications which use a data parallel approach to process large volumes of data typically terabytes or petabytes in size and typically referred to as big data. Computing applications that devote most of their execution time to computational requirements are deemed compute-intensive, whereas applications are deemed data-intensive if they require large volumes of data and devote most of their processing time to input/output and manipulation of data.

HPCC, also known as DAS, is an open source, data-intensive computing system platform developed by LexisNexis Risk Solutions. The HPCC platform incorporates a software architecture implemented on commodity computing clusters to provide high-performance, data-parallel processing for applications utilizing big data. The HPCC platform includes system configurations to support both parallel batch data processing (Thor) and high-performance online query applications using indexed data files (Roxie). The HPCC platform also includes a data-centric declarative programming language for parallel data processing called ECL.

A cloud database is a database that typically runs on a cloud computing platform and access to the database is provided as-a-service. There are two common deployment models: users can run databases on the cloud independently, using a virtual machine image, or they can purchase access to a database service, maintained by a cloud database provider. Of the databases available on the cloud, some are SQL-based and some use a NoSQL data model.

NewSQL is a class of relational database management systems that seek to provide the scalability of NoSQL systems for online transaction processing (OLTP) workloads while maintaining the ACID guarantees of a traditional database system.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Martin L. Kersten</span> Dutch computer scientist (born 1953)

Martin L. Kersten was a computer scientist with research focus on database architectures, query optimization and their use in scientific databases. He was an architect of the MonetDB system, an open-source column store for data warehouses, online analytical processing (OLAP) and geographic information systems (GIS). He has been (co-) founder of several successful spin-offs of the Centrum Wiskunde & Informatica (CWI).

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks. He is best known for his work on Apache Spark, a leading open-source Big Data project. He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release.

Azure Data Explorer is a fully-managed big data analytics cloud platform and data-exploration service, developed by Microsoft, that ingests structured, semi-structured and unstructured data. The service then stores this data and answers analytic ad hoc queries on it with seconds of latency. It is a full-text indexing and retrieval database, including time series analysis capabilities and regular expression evaluation and text parsing.

References

  1. Cui, Tingting; Peng, Lijun; Pardoe, David; Liu, Kun; Agarwal, Deepak; Kumar, Deepak (14 August 2017). "Data-Driven Reserve Prices for Social Advertising Auctions at LinkedIn". Proceedings of the ADKDD'17. Association for Computing Machinery. pp. 1–7. doi:10.1145/3124749.3124759. ISBN   9781450351942. S2CID   12327343.
  2. Rosa, Marcello La (2021). ADVANCED INFORMATION SYSTEMS ENGINEERING: 33rd International Conference. Springer Nature. ISBN   978-3-030-79382-1.
  3. Chin, Francis Y. L.; Chen, C. L. Philip; Khan, Latifur; Lee, Kisung; Zhang, Liang-Jie (20 June 2018). Big Data – BigData 2018: 7th International Congress, Held as Part of the Services Conference Federation, SCF 2018, Seattle, WA, USA, June 25–30, 2018, Proceedings. Springer. p. 153. ISBN   978-3-319-94301-5.
  4. Im, Jean-François; Gopalakrishna, Kishore; Subramaniam, Subbu; Shrivastava, Mayank; Tumbde, Adwait; Jiang, Xiaotian; Dai, Jennifer; Lee, Seunghyun; Pawar, Neha; Li, Jialiang; Aringunram, Ravi (2018-05-27). "Pinot: Realtime OLAP for 530 Million Users". Proceedings of the 2018 International Conference on Management of Data. Sigmod '18. Association for Computing Machinery. pp. 583–594. doi:10.1145/3183713.3190661. ISBN   9781450347037. S2CID   44083085.
  5. "The Apache Software Foundation Announces Apache® Pinot™ as a Top-Level Project". blogs.apache.org. 2 August 2021.
  6. Rogers, Ryan; Subramaniam, Subbu; Peng, Sean; Durfee, David; Lee, Seunghyun; Kancha, Santosh Kumar; Sahay, Shraddha; Ahammad, Parvez (16 November 2020). "LinkedIn's Audience Engagements API: A Privacy Preserving Data Analytics System at Scale". arXiv: 2002.05839 [cs.CR].
  7. Javadi, Seyyed Ahmad; Gupta, Harsh; Manhas, Robin; Sahu, Shweta; Gandhi, Anshul (July 2018). "EASY: Efficient Segment Assignment Strategy for Reducing Tail Latencies in Pinot". 2018 IEEE 38th International Conference on Distributed Computing Systems (ICDCS). pp. 1432–1437. doi:10.1109/ICDCS.2018.00144. ISBN   978-1-5386-6871-9. S2CID   21659844.
  8. 1 2 3 Pawar, Neha. "Pinot Joins Apache Incubator" Archived 2019-04-02 at the Wayback Machine , LinkedIn Engineering, 01 April 2019
  9. 1 2 3 Gopalakrishna, Kishore. "Open Sourcing Pinot: Scaling the Wall of Real-Time Analytics". engineering.linkedin.com. LinkedIn. Archived from the original on 10 September 2015. Retrieved 3 September 2020.
  10. Yegulalp, Serdar (2015-06-11). "LinkedIn fills another SQL-on-Hadoop niche". InfoWorld.
  11. Fu, Yupeng; Soman, Chinmay (9 June 2021). "Real-time Data Infrastructure at Uber". Proceedings of the 2021 International Conference on Management of Data. Sigmod/Pods '21. Association for Computing Machinery. pp. 2503–2516. arXiv: 2104.00087 . doi:10.1145/3448016.3457552. ISBN   9781450383431. S2CID   232478317.
  12. Ordonez, Carlos; Song, Il-Yeol; Anderst-Kotsis, Gabriele; Tjoa, A. Min; Khalil, Ismail (2 October 2019). Big Data Analytics and Knowledge Discovery: 21st International Conference, DaWaK 2019, Linz, Austria, August 26–29, 2019, Proceedings. Springer. p. 170. ISBN   978-3-030-27520-4.
  13. Uttamchandani, Sandeep (10 September 2020). The Self-Service Data Roadmap. "O'Reilly Media, Inc.". ISBN   978-1-4920-7520-2.