Haoyuan Li

Last updated
Haoyuan Li
Alma mater UC Berkeley (Ph.D.)
Cornell University (M.S.)
Peking University (B.S.)
Known for Alluxio
Scientific career
Fields Computer Science
Thesis Alluxio: A Virtual Distributed File System  (2018)
Doctoral advisor Ion Stoica
Scott Shenker
Website haoyuanli.com

Haoyuan (H.Y.) Li is a computer scientist and entrepreneur specializing in distributed systems, big data, and cloud computing. He is best known for proposing Virtual Distributed File System (VDFS), [1] and creating an open-source data orchestration system, Alluxio. He is the Founder, Chairman, and CEO of Alluxio, Inc, [2] [3] a company commercializing the Alluxio Data Orchestration Technology. He is also an adjunct professor at Peking University. He is a frequent speaker on the topic of AI, big data, cloud computing, and open source at conferences.

Biography

Li was born and raised in China. He attended Peking University, where he received a BS in Computer Science. While at university, he participated in programming contests representing Peking University, and placed 11th worldwide (bronze medal) in ACM ICPC 2005 and 13rd place worldwide in 2006. He then studied at Cornell University, where he received a MS in Computer Science.

He received his Computer Science PhD [1] from the UC Berkeley AMPLab, under the supervision of Prof. Ion Stoica and Prof. Scott Shenker. During his PhD, he co-created the Alluxio (a.k.a. Tachyon) open-source project, [4] which was commercialized by San Francisco Bay Area venture-backed company Alluxio, Inc. [1] [5] [6] [7] [8] [9] He was a co-founder of Alluxio, Inc.

During his PhD, he also co-created the Apache Spark Streaming project [10] and became an Apache Spark committer. [11]

Related Research Articles

The University of California, Berkeley College of Engineering is the public engineering school of the University of California, Berkeley. Established in 1931, the college occupies fourteen buildings on the northeast side of the main campus and also operates the 150-acre (61-hectare) Richmond Field Station. It is considered to be highly selective and is consistently ranked among the top engineering schools in both the nation and the world.

Data Stream Mining is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities.

Scott J. Shenker is an American computer scientist, and professor of computer science at the University of California, Berkeley. He is also the leader of the Extensible Internet Group at the International Computer Science Institute in Berkeley, California.

Randy Howard Katz is a distinguished professor emeritus at University of California, Berkeley of the electrical engineering and computer science department.

<span class="mw-page-title-main">Vertica</span> Software company

Vertica is an analytic database management software company. Vertica was founded in 2005 by the database researcher Michael Stonebraker with Andrew Palmer as the founding CEO. Ralph Breslauer and Christopher P. Lynch served as CEOs later on.

<span class="mw-page-title-main">Ion Stoica</span> Romanian–American computer scientist

Ion Stoica is a Romanian–American computer scientist specializing in distributed systems, cloud computing and computer networking. He is a professor of computer science at the University of California, Berkeley and co-director of AMPLab. He co-founded Conviva and Databricks with other original developers of Apache Spark.

Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Databricks</span> American software company

Databricks, Inc. is a global data, analytics and artificial intelligence company founded by the original creators of Apache Spark.

<span class="mw-page-title-main">Apache Mesos</span> Software to manage computer clusters

Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

The ACM SIGOPS Mark Weiser Award is awarded to an individual who has shown creativity and innovation in operating system research. The recipients began their career no earlier than 20 years prior to nomination. The special-interest-group-level award was created in 2001 and is named after Mark Weiser, the father of ubiquitous computing.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

<span class="mw-page-title-main">Ali Ghodsi</span> Swedish computer scientist

Ali Ghodsi is a Swedish computer scientist and entrepreneur of Persian origin, specializing in distributed systems and big data. He is a co-founder and CEO of Databricks and an adjunct professor at UC Berkeley. He coauthored several influential papers, including Apache Mesos and Apache Spark SQL.

AMPLAB was a University of California, Berkeley lab focused on big data analytics located in Soda Hall. The name stands for the Algorithms, Machines and People Lab. It has been publishing papers since 2008 and was officially launched in 2011. The AMPLab was co-directed by Professor Michael J. Franklin, Michael I. Jordan, and Ion Stoica.

Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks. He is best known for his work on Apache Spark, a leading open-source Big Data project. He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release.

Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.

Mosharaf Chowdhury is a Bangladeshi-American computer scientist known for his contributions to the fields of computer networking and large-scale systems for emerging machine learning and big data workloads. He is an Associate Professor of Computer Science and Engineering at the University of Michigan, Ann Arbor and leads SymbioticLab. He is the creator of coflow and the co-creator of Apache Spark.

Dominant resource fairness (DRF) is a rule for fair division. It is particularly useful for dividing computing resources in among users in cloud computing environments, where each user may require a different combination of resources. DRF was presented by Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker and Ion Stoica in 2011.

DBOS is a Database-Oriented Operating System designed to simplify and improve the scalability, security and resilience of large-scale distributed applications. It started in 2020 as a joint open source project with MIT, Stanford and Carnegie Mellon University, after a brainstorm between Michael Stonebraker and Matei Zaharia on how to scale and improve scheduling and performance of millions of Apache Spark tasks.

References

  1. 1 2 3 Li, Haoyuan (7 May 2018). Alluxio: A Virtual Distributed File System (Technical report). EECS Department, University of California, Berkeley. UCB/EECS-2018-29.
  2. "Alluxio launches its memory-centric storage system for big data workloads". techcrunch.com. TechCrunch.
  3. Woodie, Alex (3 July 2019). "Celebrating Data Independence". datanami.com. Tabor Communications.
  4. Li, Haoyuan; Ghodsi, Ali; Zaharia, Matei; Shenker, Scott; Stoica, Ion. "Tachyon: Reliable, Memory Speed Storage for Cluster Computing Frameworks" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  5. Gage, Deborah (17 March 2015). "Andreessen Horowitz Invests $7.5M in Big-Data Startup Tachyon". wsj.com. The Wall Street Journal.
  6. Brust, Andrew (15 July 2019). "Alluxio 2.0 seeks to unify fragmented data ecosystem". ZDNet . CBS Interactive.
  7. Gillin, Paul (11 July 2019). "Alluxio's data orchestration platform now spans multiple clouds". siliconangle.com. SiliconANGLE Media Inc.
  8. Mellor, Chris (12 July 2019). "You need access to those big data silos – fast? No problem, says Alluxio". blocksandfiles.com. Blocks & Files.
  9. Wells, Joyce (11 July 2019). "Breaking Down Data Silos with Data Orchestration". dbta.com. Information Today Inc.
  10. Zaharia, Matei; Das, Tathagata; Li, Haoyuan; Hunter, Timothy; Shenker, Scott; Stoica, Ion. "Discretized Streams: Fault-Tolerant Streaming Computation at Scale" (PDF).{{cite journal}}: Cite journal requires |journal= (help)
  11. "Apache Spark Committer List".