Original author(s) | Haoyuan Li |
---|---|
Developer(s) | UC Berkeley AMPLab |
Initial release | April 8, 2013 |
Stable release | v2.9.0 / November 16, 2022 [1] |
Repository | https://github.com/Alluxio/alluxio |
Written in | Java |
Operating system | macOS, Linux |
Available in | Java |
License | Apache License 2.0 |
Website | www |
Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, [2] advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.
Data Driven Applications, such as Data Analytics, Machine Learning, and AI, use APIs (such as Hadoop HDFS API, S3 API, FUSE API) provided by Alluxio to interact with data from various storage systems at a fast speed. Popular frameworks running on top of Alluxio include Apache Spark, Presto, TensorFlow, Trino, Apache Hive, and PyTorch, etc.
Alluxio can be deployed on-premise, in the cloud (e.g. Microsoft Azure, AWS, Google Compute Engine), or a hybrid cloud environment. It can run on bare-metal or in a containerized environments such as Kubernetes, Docker, Apache Mesos.
Alluxio was initially started by Haoyuan Li at UC Berkeley's AMPLab in 2013, and open sourced in 2014. Alluxio had in excess of 1000 contributors in 2018, [3] making it one of the most active projects in the data eco-system.
Version | Original release date | Latest version | Release date |
---|---|---|---|
0.2 | 2013-04-08 | 0.2.1 | 2013-04-25 |
0.3 | 2013-10-21 | 0.3.0 | 2013-10-21 |
0.4 | 2014-02-02 | 0.4.1 | 2014-02-25 |
0.5 | 2014-07-20 | 0.5.0 | 2014-07-20 |
0.6 | 2015-03-01 | 0.6.4 | 2015-04-23 |
0.7 | 2015-07-17 | 0.7.1 | 2015-08-10 |
0.8 | 2015-10-21 | 0.8.2 | 2015-11-10 |
1.0 | 2016-02-23 | 1.0.1 | 2016-03-27 |
1.1 | 2016-06-06 | 1.1.1 | 2016-07-04 |
1.2 | 2016-07-17 | 1.2.0 | 2016-07-17 |
1.3 | 2016-10-05 | 1.3.0 | 2016-10-05 |
1.4 | 2017-01-12 | 1.4.0 | 2017-01-12 |
1.5 | 2017-06-11 | 1.5.0 | 2017-06-11 |
1.6 | 2017-09-24 | 1.6.1 | 2017-11-02 |
1.7 | 2018-01-14 | 1.7.1 | 2018-03-26 |
1.8 | 2018-07-07 | 1.8.2 | 2019-08-05 |
2.0 | 2019-06-27 | 2.0.1 | 2019-09-03 |
2.1 | 2019-11-06 | 2.1.2 | 2020-02-04 |
2.2 | 2020-03-11 | 2.2.2 | 2020-06-24 |
2.3 | 2020-06-30 | 2.3.0 | 2020-06-30 |
2.4 | 2020-10-19 | 2.4.1 | 2020-11-20 |
2.5 | 2021-03-10 | 2.5.0 | 2021-03-10 |
2.6 | 2021-06-23 | 2.6.2 | 2021-09-17 |
2.7 | 2021-11-16 | 2.7.4 | 2022-04-19 |
2.8 | 2022-05-04 | 2.8.1 | 2022-08-17 |
2.9 | 2022-11-16 | 2.9.3 | 2023-03-27 |
Legend: Old version Older version, still maintained Latest version Latest preview version |
The following is a list of notable enterprises that have used or are using Alluxio:
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.
Ion Stoica is a Romanian–American computer scientist specializing in distributed systems, cloud computing and computer networking. He is a professor of computer science at the University of California, Berkeley and co-director of AMPLab. He co-founded Conviva and Databricks with other original developers of Apache Spark.
In computing, a distributed file system (DFS) or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources.
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.
Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."
In computing, Hazelcast is a unified real-time data platform based on Java that combines a fast data store with stream processing. It is also the name of the company developing the product. The Hazelcast company is funded by venture capital and headquartered in Palo Alto, California.
Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark.
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.
Databricks, Inc. is an American software company founded by the creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. The company develops Delta Lake, an open-source project to bring reliability to data lakes for machine learning and other data science use cases.
Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.
AMPLAB was a University of California, Berkeley lab focused on big data analytics located in Soda Hall. The name stands for the Algorithms, Machines and People Lab. It has been publishing papers since 2008 and was officially launched in 2011. The AMPLab was co-directed by Professor Michael J. Franklin, Michael I. Jordan, and Ion Stoica.
Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks. He is best known for his work on Apache Spark, a leading open-source Big Data project. He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release.
Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.
Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Haoyuan (H.Y.) Li is a computer scientist and entrepreneur specializing in distributed systems, big data, and cloud computing. He is best known for proposing Virtual Distributed File System (VDFS), and creating an open-source data orchestration system, Alluxio. He is the Founder, Chairman, and CEO of Alluxio, Inc, a company commercializing the Alluxio Data Orchestration Technology. He is also an adjunct professor at Peking University. He is a frequent speaker on the topic of AI, Big Data, Cloud Computing, and Open Source at conferences.
Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.