Ali Ghodsi | |
---|---|
Born | December 1978 (age 45) |
Citizenship | Swedish |
Alma mater | KTH Royal Institute of Technology (Ph.D.) |
Known for | Apache Mesos, Apache Spark, Co-founding Databricks |
Title | CEO of Databricks |
Scientific career | |
Fields | Computer Science |
Institutions | Databricks UC Berkeley |
Thesis | Distributed k-ary System: Algorithms for Distributed Hash Tables (2006) |
Doctoral advisor | Seif Haridi |
Website | www |
Ali Ghodsi (born December 1978) [3] is a Swedish-American computer scientist and entrepreneur [4] of Persian origin, specializing in distributed systems and big data. He is a co-founder and CEO of Databricks [5] [6] [7] and an adjunct professor at UC Berkeley. He coauthored several influential papers, including Apache Mesos [8] and Apache Spark SQL. [9]
Ghodsi received his PhD from KTH Royal Institute of Technology in Sweden, advised by Seif Haridi. He was a co-founder of Peerialism AB, a Stockholm-based company developing a peer-to-peer data transfer system. He was also an assistant professor at KTH from 2008 to 2009.
He joined UC Berkeley in 2009 as a visiting scholar and worked with Scott Shenker, Ion Stoica, Michael Franklin, and Matei Zaharia on research projects in distributed systems, database systems, and networking. During this period, he helped start the Apache Mesos and Apache Spark projects. He also co-invented the concept of Dominant resource fairness, [10] in a paper that heavily influenced resource management and scheduling design in distributed systems such as Hadoop. [11]
In 2013, he co-founded Databricks, a company that commercializes Spark, and became chief executive in 2016. [12] [13]
Scott J. Shenker is an American computer scientist, and professor of computer science at the University of California, Berkeley. He is also the leader of the Extensible Internet Group at the International Computer Science Institute in Berkeley, California.
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
Apache Hive is a data warehouse software project. It is built on top of Apache Hadoop for providing data query and analysis. Hive gives an SQL-like interface to query data stored in various databases and file systems that integrate with Hadoop. Traditional SQL queries must be implemented in the MapReduce Java API to execute SQL applications and queries over distributed data.
Ion Stoica is a Romanian–American computer scientist specializing in distributed systems, cloud computing and computer networking. He is a professor of computer science at the University of California, Berkeley and co-director of AMPLab. He co-founded Conviva and Databricks with other original developers of Apache Spark.
Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.
Matei Zaharia is a Romanian-Canadian computer scientist, educator and the creator of Apache Spark.
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Databricks, Inc. is a global data, analytics and artificial intelligence company founded by the original creators of Apache Spark.
Apache Mesos is an open-source project to manage computer clusters. It was developed at the University of California, Berkeley.
A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics, and machine learning. A data lake can include structured data from relational databases, semi-structured data, unstructured data, and binary data. A data lake can be established "on premises" or "in the cloud".
Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.
Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks. He is best known for his work on Apache Spark, a leading open-source Big Data project. He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release.
Azure Data Lake is a scalable data storage and analytics service. The service is hosted in Azure, Microsoft's public cloud.
Alluxio is an open-source virtual distributed file system (VDFS). Initially as research project "Tachyon", Alluxio was created at the University of California, Berkeley's AMPLab as Haoyuan Li's Ph.D. Thesis, advised by Professor Scott Shenker & Professor Ion Stoica. Alluxio sits between computation and storage in the big data analytics stack. It provides a data abstraction layer for computation frameworks, enabling applications to connect to numerous storage systems through a common interface. The software is published under the Apache License.
Haoyuan (H.Y.) Li is a computer scientist and entrepreneur specializing in distributed systems, big data, and cloud computing. He is best known for proposing Virtual Distributed File System (VDFS), and creating an open-source data orchestration system, Alluxio. He is the Founder, Chairman, and CEO of Alluxio, Inc, a company commercializing the Alluxio Data Orchestration Technology. He is also an adjunct professor at Peking University. He is a frequent speaker on the topic of AI, big data, cloud computing, and open source at conferences.
Dominant resource fairness (DRF) is a rule for fair division. It is particularly useful for dividing computing resources in among users in cloud computing environments, where each user may require a different combination of resources. DRF was presented by Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker and Ion Stoica in 2011.