Reynold Xin

Last updated
Reynold Xin
Alma mater UC Berkeley (Ph.D.)
University of Toronto (BA.Sc.)
Known for Apache Spark, Databricks
Scientific career
Fields Computer Science
Doctoral advisor Michael J. Franklin

Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks. [1] He is best known for his work on Apache Spark, a leading open-source Big Data project. [2] He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release. [3]

Contents

Biography

Berkeley

Xin started his work on the Spark open source project while he was a doctoral candidate at the AMPLab at the University of California, Berkeley. He received his Ph.D. in computer science from Berkeley, where his advisors were Michael J. Franklin and Ion Stoica. [4]

The first research project, Shark, [5] created a system that was able to efficiently execute SQL and advanced analytics workloads at scale. Shark won Best Demo Award at SIGMOD 2012. [6] Shark was one of the first open source interactive SQL on Hadoop systems, with claims that it was between 10 and 100 times faster than Apache Hive. Shark was used by technology companies such as Yahoo, [7] although it was replaced by a newer system called Spark SQL in 2014. [8]

The second research project, GraphX, [9] created a graph processing system on top of Spark, a general data-parallel system. GraphX at the same challenged the notion that specialized systems are necessary for graph computation. GraphX was released as an open source project and merged into Spark in 2014, as the graph processing library on Spark.

Databricks

In 2013, along with Matei Zaharia and other key Spark contributors, Xin co-founded Databricks, a venture-backed company based in San Francisco that offers data platform as a service, based on Spark.

In 2014, Xin led a team of engineers from Databricks to compete in the Sort Benchmark and won the 2014 world record in Daytona GraySort using Spark, beating the previous record held by Apache Hadoop by 30 times. [10] Xin claimed that Spark was the fastest open source engine for sorting a petabyte of data. [11]

While at Databricks, he also started the DataFrames project, [12] Project Tungsten, [13] and Structured Streaming. [14] DataFrames has become the foundational API while Tungsten has become the new execution engine.

Related Research Articles

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

<span class="mw-page-title-main">Apache Pig</span> Open-source data analytics software

Apache Pig is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark. Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy and then call directly from the language.

<span class="mw-page-title-main">Apache Hive</span> Database engine

Apache Hive is a data warehouse software project, built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data. Hive provides the necessary SQL abstraction to integrate SQL-like queries into the underlying Java without the need to implement queries in the low-level Java API. Since most data warehousing applications work with SQL-based querying languages, Hive aids the portability of SQL-based applications to Hadoop. While initially developed by Facebook, Apache Hive is used and developed by other companies such as Netflix and the Financial Industry Regulatory Authority (FINRA). Amazon maintains a software fork of Apache Hive included in Amazon Elastic MapReduce on Amazon Web Services.

<span class="mw-page-title-main">Ion Stoica</span> Romanian–American computer scientist

Ion Stoica is a Romanian–American computer scientist specializing in distributed systems, cloud computing and computer networking. He is a professor of computer science at the University of California, Berkeley and co-director of AMPLab. He co-founded Conviva and Databricks with other original developers of Apache Spark.

<span class="mw-page-title-main">Apache Drill</span> Open-source software framework

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark.

<span class="mw-page-title-main">Apache Spark</span> Open-source data analytics cluster computing framework

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

<span class="mw-page-title-main">Databricks</span> American software company

Databricks, Inc. is an American software company founded by the original creators of Apache Spark. Databricks develops a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks. The company develops Delta Lake, an open-source project to bring reliability to data lakes for machine learning and other data science use cases.

Presto is a distributed query engine for big data using the SQL query language. Its architecture allows users to query data sources such as Hadoop, Cassandra, Kafka, AWS S3, Alluxio, MySQL, MongoDB and Teradata, and allows use of multiple data sources within a query. Presto is community-driven open-source software released under the Apache License.

<span class="mw-page-title-main">Apache Kylin</span> Open-source distributed analytics engine

Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.

<span class="mw-page-title-main">Ali Ghodsi</span> Iranian-Swedish computer scientist

Ali Ghodsi is an Iranian-Swedish computer scientist and entrepreneur specializing in distributed systems and big data. He is a co-founder and CEO of Databricks and an adjunct professor at UC Berkeley. He coauthored several influential papers, including Apache Mesos and Apache Spark SQL.

AMPLAB was a University of California, Berkeley lab focused on big data analytics located in Soda Hall. The name stands for the Algorithms, Machines and People Lab. It has been publishing papers since 2008 and was officially launched in 2011. The AMPLab was co-directed by Professor Michael J. Franklin, Michael I. Jordan, and Ion Stoica.

<span class="mw-page-title-main">Notebook interface</span> Programming tool blending code and documents

A notebook interface or computational notebook is a virtual notebook environment used for literate programming, a method of writing computer programs. Some notebooks are WYSIWYG environments including executable calculations embedded in formatted documents; others separate calculations and text into separate sections. Notebooks share some goals and features with spreadsheets and word processors but go beyond their limited data models.

Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud.

Kyvos is a business intelligence acceleration platform for cloud and big data platforms developed by an American privately held company named Kyvos Insights. The company, headquartered in Los Gatos, California, was founded by Praveen Kankariya, CEO of Impetus Technologies. The software provides OLAP-based multidimensional analysis on big data and cloud platforms and was launched officially in June 2015. In December the same year, the company was listed among the 10 Coolest Big Data Startups of 2015 by CRN Magazine.

Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.

Apache CarbonData is a free and open-source column-oriented data storage format of the Apache Hadoop ecosystem. It is similar to the other columnar-storage file formats available in Hadoop namely RCFile and ORC. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Haoyuan (H.Y.) Li is a computer scientist and entrepreneur specializing in distributed systems, big data, and cloud computing. He is best known for proposing Virtual Distributed File System (VDFS), and creating an open-source data orchestration system, Alluxio. He is the Founder, Chairman, and CEO of Alluxio, Inc, a company commercializing the Alluxio Data Orchestration Technology. He is also an adjunct professor at Peking University. He is a frequent speaker on the topic of AI, Big Data, Cloud Computing, and Open Source at conferences.

References

  1. "Reynold Xin: Executive Profile & Biography - Businessweek". bloomberg.com. Bloomberg Businessweek . Retrieved 21 September 2016.
  2. Woodie, Alex (8 June 2016). "Apache Spark Adoption by the Numbers". datanami.com. Tabor Communications. Retrieved 21 September 2016.
  3. "Apache Spark Developers List - [ANNOUNCE] Announcing Apache Spark 2.0.0". apache-spark-developers-list.1001551.n3.nabble.com. Retrieved 2016-08-04.
  4. "Speaker Reynold Xin". engsci.utoronto.ca. 5 October 2020.
  5. Xin, Reynold S.; Rosen, Josh; Zaharia, Matei; Franklin, Michael J.; Shenker, Scott; Stoica, Ion (2013-01-01). "Shark". Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. SIGMOD '13. New York, NY, USA: ACM. pp. 13–24. doi:10.1145/2463676.2465288. ISBN   9781450320375. S2CID   1597960.
  6. "Shark Wins Best Demo Award at SIGMOD 2012". AMPLab - UC Berkeley. 24 May 2012. Retrieved 2016-08-04.
  7. Tully. "Analytics on Spark & Shark @Yahoo" (PDF).
  8. "Shark, Spark SQL, Hive on Spark, and the future of SQL on Apache Spark". 2014-07-01. Retrieved 2016-08-04.
  9. Gonzalez, Joseph E.; Xin, Reynold S.; Dave, Ankur; Crankshaw, Daniel; Franklin, Michael J.; Stoica, Ion (2014-01-01). "GraphX: Graph Processing in a Distributed Dataflow Framework". Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation. OSDI'14. Berkeley, CA, USA: USENIX Association: 599–613. ISBN   9781931971164.
  10. Finley, Klint. "Startup Crunches 100 Terabytes of Data in a Record 23 Minutes". Wired. Retrieved 2016-08-04.
  11. "Apache Spark the fastest open source engine for sorting a petabyte". 2014-10-10. Retrieved 2016-08-04.
  12. "Introducing DataFrames in Apache Spark for Large Scale Data Science". 2015-02-17. Retrieved 2016-08-04.
  13. Woodie, Alex (4 May 2015). "Deep Dive Into Databricks' Big Speedup Plans for Apache Spark". datanami.com. Tabor Communications. Retrieved 21 September 2016.
  14. Woodie, Alex (25 February 2016). "Spark 2.0 to Introduce New 'Structured Streaming' Engine". datanami.com. Tabor Communications. Retrieved 21 September 2016.