Type | Subsidiary |
---|---|
Industry | Computer software |
Founded | 2011 |
Headquarters | , United States |
Products | Hortonworks Data Platform, Hortonworks DataFlow, Hortonworks DataPlane |
Number of employees | ~1,110 (2017) [1] |
Parent | Cloudera |
Website | Hortonworks.com |
Hortonworks was a data software company based in Santa Clara, California that developed and supported open-source software (primarily around Apache Hadoop) designed to manage big data and associated processing.
Hortonworks software was used to build enterprise data services and applications such as IoT (connected cars, for example), single view of X (such as customer, risk, patient), and advanced analytics and machine learning (such as next best action and realtime cybersecurity). Hortonworks had three interoperable product lines:
In January 2019, Hortonworks completed its merger with Cloudera. [3]
Hortonworks was formed in June 2011 as an independent company, funded by $23 million venture capital from Yahoo! and Benchmark Capital. Its first office was in Sunnyvale, California. [4] The company employed contributors to the open source software project Apache Hadoop. [5] The Hortonworks Data Platform (HDP) product, first released in June 2012, [6] included Apache Hadoop and was used for storing, processing, and analyzing large volumes of data. The platform was designed to deal with data from many sources and formats. The platform included Hadoop technology such as the Hadoop Distributed File System, MapReduce, Pig, Hive, HBase, ZooKeeper, and additional components. [7]
Eric Baldeschweiler (from Yahoo) was initial chief executive, and Rob Bearden chief operating officer, formerly from SpringSource. Benchmark partner Peter Fenton was a board member. The company name refers to the character Horton the Elephant, since the elephant is the symbol for Hadoop. [4] [8]
In October 2018, Hortonworks and Cloudera announced they would be merging in an all-stock merger of equals. [9] After the merger, the Apache products of Hortonworks became Cloudera Data Platform.
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation. Cutting and Cafarella are also the co-founders of Apache Hadoop.
Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.
WANdisco, plc. develops technology that moves large Internet of Things (IoT) datasets, edge data, and Hadoop on-premises data lakes at scale to the cloud so organizations can activate their data for machine learning, artificial intelligence, and data analytics on modern cloud platforms, including Microsoft Azure, Amazon Web Services, Google, Oracle, Databricks, and Snowflake.
Cloudera, Inc. is an American software company providing enterprise data management systems that make significant use of Apache Hadoop. As of January 31, 2021, the company had approximately 1,800 customers.
Within computing database management systems, the RCFile is a data placement structure that determines how to store relational tables on computer clusters. It is designed for systems using the MapReduce framework. The RCFile structure includes a data storage format, data compression approach, and optimization techniques for data reading. It is able to meet all the four requirements of data placement: (1) fast data loading, (2) fast query processing, (3) highly efficient storage space utilization, and (4) a strong adaptivity to dynamic data access patterns.
MapR was a business software company headquartered in Santa Clara, California. MapR software provides access to a variety of data sources from a single computer cluster, including big data workloads such as Apache Hadoop and Apache Spark, a distributed file system, a multi-model database management system, and event stream processing, combining analytics in real-time with operational applications. Its technology runs on both commodity hardware and public cloud computing services. In August 2019, following financial difficulties, the technology and intellectual property of the company were sold to Hewlett Packard Enterprise.
Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
WibiData was a software company that developed big data applications for enterprises to personalize their customer experiences. It developed applications based on open-source technologies Apache Hadoop, Apache Cassandra, Apache HBase, Apache Avro and the Kiji Project. Wibidata was founded under the name Odiago in 2010 by Christophe Bisciglia, Aaron Kimball, and Garrett Wu. Based in San Francisco, California, WibiData was backed by investors such as Canaan Partners, New Enterprise Associates, SV Angel, and Eric Schmidt.
Platfora, Inc. is a big data analytics company based in San Mateo, California. The firm’s software works with the open-source software framework Apache Hadoop to assist with data analysis, data visualization, and sharing.
PSSC Labs is a California-based company that provides supercomputing solutions in the United States and internationally. Its products include "high-performance" servers, clusters, workstations, and RAID storage systems for scientific research, government and military, entertainment content creators, developers, and private clouds. The company has implemented clustering software from NASA Goddard's Beowulf project in its supercomputers designed for bioinformatics, medical imaging, computational chemistry and other scientific applications.
Big Data Partnership was a specialist big data professional services company based in London, UK. It provides consultancy, certified training and support to Europe, the Middle East and Africa-based enterprises.
Apache Phoenix is an open source, massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as its backing store. Phoenix provides a JDBC driver that hides the intricacies of the NoSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; insert and delete rows singly and in bulk; and query data through SQL. Phoenix compiles queries and other statements into native NoSQL store APIs rather than using MapReduce enabling the building of low latency applications on top of NoSQL stores.
Apache Kylin is an open source distributed analytics engine designed to provide a SQL interface and multi-dimensional analysis (OLAP) on Hadoop and Alluxio supporting extremely large datasets.
Apache NiFi is a software project from the Apache Software Foundation designed to automate the flow of data between software systems. Leveraging the concept of extract, transform, load (ETL), it is based on the "NiagaraFiles" software previously developed by the US National Security Agency (NSA), which is also the source of a part of its present name – NiFi. It was open-sourced as a part of NSA's technology transfer program in 2014.
Reynold Xin is a computer scientist and engineer specializing in big data, distributed systems, and cloud computing. He is a co-founder and Chief Architect of Databricks. He is best known for his work on Apache Spark, a leading open-source Big Data project. He was designer and lead developer of the GraphX, Project Tungsten, and Structured Streaming components and he co-designed DataFrames, all of which are part of the core Apache Spark distribution; he also served as the release manager for Spark's 2.0 release.
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.