Apache Kylin

Apache Kylin
Developer(s)	Apache Kylin Committee
Initial release	June 10, 2015;8 years ago
Stable release
3.x	3.1.3 / 5 January 2022;23 months ago
4.x	4.0.1 / 5 January 2022;23 months ago
Repository	Kylin Repository
Written in	Java
License	Apache License 2.0
Website	kylin.apache.org

Last updated December 23, 2023

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

History

The Kylin project was started in 2013, in eBay's R&D in Shanghai, China. In Oct 2014, Kylin v0.6 was open sourced on github.com with the name "KylinOLAP".^[4]

In November 2014, Kylin joined Apache Software Foundation incubator.

In December 2015, Apache Kylin graduated to be a Top Level Project.^[3]

In March 2016, Kyligence, Inc. was founded by the creators of Apache Kylin.^[5]^[6] Kyligence provides a commercial analytics platform based on Apache Kylin for on-premise and cloud-based datasets.^[7]

Architecture

Apache Kylin is built on top of Apache Hadoop, Apache Hive, Apache HBase, Apache Parquet, Apache Calcite, Apache Spark and other technologies.^[8] These technologies enable Kylin to easily scale to support massive data loads.^[9]

Kylin has the following core components:^[10]^[8]

REST Server: Receive and response user or API requests
Metadata: Persistent and manage system, especially the cube metadata;
Query Engine: Parse SQL queries to execution plan, and then talk with storage engine;
Storage Engine: Pushdown and scan underlying cube storage (default in HBase);
Job Engine: Generate and execute MapReduce or Spark job to build source data into cube;

Users

Apache Kylin has been adopted by many companies as their OLAP platform in production. Typical users includes eBay, Meituan, XiaoMi, NetEase, Beike, Yahoo! Japan.

Roadmap

Apache Kylin roadmap (from Kylin website^[11]):

Hadoop 3.0 support (Erasure Coding) - completed (v2.5)
Fully on Spark Cube engine - completed (v2.5)
Connect more data sources (MySQL, Oracle, SparkSQL, etc) - completed (v2.6)
Real-time analytics with Lambda Architecture - completed (v3.0)
Cloud-native storage (Parquet) - In progress (v4.0.0-alpha)
Ad hoc queries without Cubing

Related Research Articles

Db2 is a family of data management products, including database servers, developed by IBM. It initially supported the relational model, but was extended to support object–relational features and non-relational structures like JSON and XML. The brand name was originally styled as DB/2, then DB2 until 2017 and finally changed to its present form.

Online analytical processing, or OLAP, is an approach to answer multi-dimensional analytical (MDA) queries swiftly in computing. OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining. Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.

Apache Hive is a data warehouse software project built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries (HiveQL) into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

Within database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Druid is a column-oriented, open-source, distributed data store written in Java. Druid is designed to quickly ingest massive quantities of event data, and provide low-latency queries on top of the data. The name Druid comes from the shapeshifting Druid class in many role-playing games, to reflect that the architecture of the system can shift to solve different types of data problems.

Apache Trafodion is an open-source Top-Level Project at the Apache Software Foundation. It was originally developed by the information technology division of Hewlett-Packard Company and HP Labs to provide the SQL query language on Apache HBase targeting big data transactional or operational workloads. The project was named after the Welsh word for transactions. As of April 2021, it is no longer actively developed.

Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Apache Kudu is a free and open source column-oriented data store of the Apache Hadoop ecosystem. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides completeness to Hadoop's storage layer to enable fast analytics on fast data.

Kyvos is a business intelligence acceleration platform for cloud and big data platforms developed by an American privately held company named Kyvos Insights. The company, headquartered in Los Gatos, California, was founded by Praveen Kankariya, CEO of Impetus Technologies. The software provides OLAP-based multidimensional analysis on big data and cloud platforms and was launched officially in June 2015. In December the same year, the company was listed among the 10 Coolest Big Data Startups of 2015 by CRN Magazine.

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

References

↑ "Previous Release". v0.7.1-incubating (First Apache Release). Retrieved 15 June 2019.
1 2 "Apache Kylin - Release Notes" . Retrieved 27 September 2022.
1 2 Apache Software Foundation. "The Apache Software Foundation Announces Apache Kylin as a Top-Level Project", 8 December 2015
↑ "Announcing Kylin: Extreme OLAP Engine for Big Data". www.ebayinc.com. 2014-10-20. Retrieved 2018-11-08.
↑ "Apache Kylin Through the Eyes of the Founders - Part One". Kyligence. 2020-06-12. Retrieved 2020-09-30.
↑ "Big Data Analytics Platform | Learn More About Kyligence". Kyligence. Retrieved 2020-09-30.
↑ "Big Data Analytics Platform: Apache Kylin vs. Kyligence". Kyligence. Retrieved 2020-09-30.
1 2 "Apache Kylin | Analytical Data Warehouse for Big Data". kylin.apache.org. Retrieved 2020-09-30.
↑ Knorr, Eric (2016-03-07). "What eBay looks like under the hood". InfoWorld. Retrieved 2020-09-30.
↑ "Apache Kylin Adds Real-time OLAP". www.i-programmer.info. Retrieved 2020-09-30.
↑ Kylin, Apache. "Apache Kylin | Development Quick Guide". kylin.apache.org. Retrieved 2020-09-30.

This business software article is a stub. You can help Wikipedia by expanding it.

This database software-related article is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Previous Release". v0.7.1-incubating (First Apache Release). Retrieved 15 June 2019.

[releases-2] 1 2 "Apache Kylin - Release Notes" . Retrieved 27 September 2022.

[:0-3] 1 2 Apache Software Foundation. "The Apache Software Foundation Announces Apache Kylin as a Top-Level Project", 8 December 2015

[4] "Announcing Kylin: Extreme OLAP Engine for Big Data". www.ebayinc.com. 2014-10-20. Retrieved 2018-11-08.

[5] "Apache Kylin Through the Eyes of the Founders - Part One". Kyligence. 2020-06-12. Retrieved 2020-09-30.

[6] "Big Data Analytics Platform | Learn More About Kyligence". Kyligence. Retrieved 2020-09-30.

[7] "Big Data Analytics Platform: Apache Kylin vs. Kyligence". Kyligence. Retrieved 2020-09-30.

[:1-8] 1 2 "Apache Kylin | Analytical Data Warehouse for Big Data". kylin.apache.org. Retrieved 2020-09-30.

[9] Knorr, Eric (2016-03-07). "What eBay looks like under the hood". InfoWorld. Retrieved 2020-09-30.

[10] "Apache Kylin Adds Real-time OLAP". www.i-programmer.info. Retrieved 2020-09-30.

[11] Kylin, Apache. "Apache Kylin | Development Quick Guide". kylin.apache.org. Retrieved 2020-09-30.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza ServiceMix Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Struts 2 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik Chainsaw FOP Ivy Log4j
Attic	Abdera Apex AxKit Beehive Bluesky iBATIS C++ Standard Library Chemistry Cactus Click Continuum Deltacloud Etch Excalibur Forrest Giraph Hama Harmony HiveMind Jakarta Marmotta MXNet ODE River Shale Shindig Slide Sqoop Stanbol Tuscany Wave Wink XML
Licenses	Apache License
Category