Alluxio

Last updated
Alluxio
Original author(s) Haoyuan Li
Developer(s) UC Berkeley AMPLab
Initial release April 8, 2013;10 years ago (2013-04-08)
Stable release
v2.9.0 / November 16, 2022;13 months ago (2022-11-16) [1]
Repository https://github.com/Alluxio/alluxio
Written in Java
Operating system macOS, Linux
Available inJava
License Apache License 2.0
Website www.alluxio.io

Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, [2] advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.

Contents

Data Driven Applications, such as Data Analytics, Machine Learning, and AI, use APIs (such as Hadoop HDFS API, S3 API, FUSE API) provided by Alluxio to interact with data from various storage systems at a fast speed. Popular frameworks running on top of Alluxio include Apache Spark, Presto, TensorFlow, Trino, Apache Hive, and PyTorch, etc.

Alluxio can be deployed on-premise, in the cloud (e.g. Microsoft Azure, AWS, Google Compute Engine), or a hybrid cloud environment. It can run on bare-metal or in a containerized environments such as Kubernetes, Docker, Apache Mesos.

History

Alluxio was initially started by Haoyuan Li at UC Berkeley's AMPLab in 2013, and open sourced in 2014. Alluxio had in excess of 1000 contributors in 2018, [3] making it one of the most active projects in the data eco-system.

VersionOriginal release dateLatest versionRelease date
Old version, no longer maintained: 0.22013-04-080.2.12013-04-25
Old version, no longer maintained: 0.32013-10-210.3.02013-10-21
Old version, no longer maintained: 0.42014-02-020.4.12014-02-25
Old version, no longer maintained: 0.52014-07-200.5.02014-07-20
Old version, no longer maintained: 0.62015-03-010.6.42015-04-23
Old version, no longer maintained: 0.72015-07-170.7.12015-08-10
Old version, no longer maintained: 0.82015-10-210.8.22015-11-10
Old version, no longer maintained: 1.02016-02-231.0.12016-03-27
Old version, no longer maintained: 1.12016-06-061.1.12016-07-04
Old version, no longer maintained: 1.22016-07-171.2.02016-07-17
Old version, no longer maintained: 1.32016-10-051.3.02016-10-05
Old version, no longer maintained: 1.42017-01-121.4.02017-01-12
Old version, no longer maintained: 1.52017-06-111.5.02017-06-11
Old version, no longer maintained: 1.62017-09-241.6.12017-11-02
Old version, no longer maintained: 1.72018-01-141.7.12018-03-26
Older version, yet still maintained: 1.82018-07-071.8.22019-08-05
Older version, yet still maintained: 2.02019-06-272.0.12019-09-03
Older version, yet still maintained: 2.12019-11-062.1.22020-02-04
Older version, yet still maintained: 2.22020-03-112.2.22020-06-24
Older version, yet still maintained: 2.32020-06-302.3.02020-06-30
Older version, yet still maintained: 2.42020-10-192.4.12020-11-20
Older version, yet still maintained: 2.52021-03-102.5.02021-03-10
Older version, yet still maintained: 2.62021-06-232.6.22021-09-17
Older version, yet still maintained: 2.72021-11-162.7.42022-04-19
Older version, yet still maintained: 2.82022-05-042.8.12022-08-17
Current stable version:2.92022-11-162.9.32023-03-27
Legend:
Old version
Older version, still maintained
Latest version
Latest preview version
Future release

Enterprises that use Alluxio

The following is a list of notable enterprises that have used or are using Alluxio:

See also

Related Research Articles

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

The following tables compare general and technical information for a number of online analytical processing (OLAP) servers. Please see the individual products articles for further information.

<span class="mw-page-title-main">Ion Stoica</span> Romanian–American computer scientist

Ion Stoica is a Romanian–American computer scientist specializing in distributed systems, cloud computing and computer networking. He is a professor of computer science at the University of California, Berkeley and co-director of AMPLab. He co-founded Conviva and Databricks with other original developers of Apache Spark.

In computing, a distributed file system (DFS) or network file system is any file system that allows access to files from multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple machines to share files and storage resources.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Apache Kafka is a distributed event store and stream-processing platform. It is an open-source system developed by the Apache Software Foundation written in Java and Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. Kafka can connect to external systems via Kafka Connect, and provides the Kafka Streams libraries for stream processing applications. Kafka uses a binary TCP-based protocol that is optimized for efficiency and relies on a "message set" abstraction that naturally groups messages together to reduce the overhead of the network roundtrip. This "leads to larger network packets, larger sequential disk operations, contiguous memory blocks [...] which allows Kafka to turn a bursty stream of random message writes into linear writes."

In computing, Hazelcast is a unified real-time data platform based on Java that combines a fast data store with stream processing. It is also the name of the company developing the product. The Hazelcast company is funded by venture capital and headquartered in Palo Alto, California.

Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Deeplearning4j</span> Open-source deep learning library

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

<span class="mw-page-title-main">Databricks</span> American software company

Databricks, Inc. is an American software company founded by the creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. The company develops Delta Lake, an open-source project to bring reliability to data lakes for machine learning and other data science use cases.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

AMPLAB was a University of California, Berkeley lab focused on big data analytics located in Soda Hall. The name stands for the Algorithms, Machines and People Lab. It has been publishing papers since 2008 and was officially launched in 2011. The AMPLab was co-directed by Professor Michael J. Franklin, Michael I. Jordan, and Ion Stoica.

Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks. He is best known for his work on Apache Spark, a leading open-source Big Data project. He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release.

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware. This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Haoyuan (H.Y.) Li is a computer scientist and entrepreneur specializing in distributed systems, big data, and cloud computing. He is best known for proposing Virtual Distributed File System (VDFS), and creating an open-source data orchestration system, Alluxio. He is the Founder, Chairman, and CEO of Alluxio, Inc, a company commercializing the Alluxio Data Orchestration Technology. He is also an adjunct professor at Peking University. He is a frequent speaker on the topic of AI, Big Data, Cloud Computing, and Open Source at conferences.

<span class="mw-page-title-main">Apache Pinot</span> Open-source distributed data store

Apache Pinot is a column-oriented, open-source, distributed data store written in Java. Pinot is designed to execute OLAP queries with low latency. It is suited in contexts where fast analytics, such as aggregations, are needed on immutable data, possibly, with real-time data ingestion. The name Pinot comes from the Pinot grape vines that are pressed into liquid that is used to produce a variety of different wines. The founders of the database chose the name as a metaphor for analyzing vast quantities of data from a variety of different file formats or streaming data sources.

References

  1. "Releases · Alluxio/alluxio". github.com. Retrieved 2022-11-16.
  2. Li, Haoyuan (7 May 2018). Alluxio: A Virtual Distributed File System (Technical report). EECS Department, University of California, Berkeley. UCB/EECS-2018-29.
  3. Open HUB Alluxio development activity
  4. "This New Open Source Project Is 100X Faster than Spark SQL In Petabyte-Scale Production".
  5. "Making the Impossible Possible with Tachyon: Accelerate Spark Jobs from Hours to Seconds".
  6. "China Unicom's big bet on open source".
  7. "Operationalizing Machine Learning—Managing Provenance from Raw Data to Predictions".
  8. "Cray Analytics and Alluxio – Wrangling Enterprise Storage". Archived from the original on 2019-07-14. Retrieved 2019-02-19.
  9. "Alluxio's Use and Practice in Didi".
  10. "Data Transformation in Financial Services".
  11. "ArcGIS and Alluxio - Using Alluxio to enhance ArcGIS data capability and get faster insights from all your data".
  12. "Huawei hugs open-sourcey Alluxio: Thanks for the memories". The Register .
  13. "How Alluxio is Accelerating Apache Spark Workloads". Archived from the original on 2019-07-14. Retrieved 2019-02-19.
  14. "Getting Started with Tachyon by Use Cases".
  15. "Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks".
  16. "World's Largest Computer Maker Lenovo Selects Alluxio for Data Management of Worldwide Smartphone Data".
  17. "Enhancing the Value of Alluxio with Samsung NVMe SSDs".
  18. "Tencent Delivering Customized News to Over 100 Million Users per Month with Alluxio".
  19. "The Practice of Alluxio in Near Real-Time Data Platform at VIPShop".
  20. "Bringing Data to Life - Data Management and Visualization Techniques".