![]() | This article's tone or style may not reflect the encyclopedic tone used on Wikipedia.(February 2012) |
Doug Cutting | |
---|---|
Doug Cutting | |
Known for | Open-source software, The Apache Software Foundation |
Awards | O'Reilly Open Source Award |
Douglass Read Cutting is a software designer, advocate for, and creator of open-source search technology. He founded two technology projects, Lucene and Nutch, with Mike Cafarella. The Apache Software Foundation now manages both projects. Cutting and Cafarella were also co-founders of Apache Hadoop. [1]
Cutting graduated from Stanford University in 1985 with a bachelor's degree. [2] [3]
Prior to developing Lucene, Cutting held search technology positions at Xerox PARC where he worked on the Scatter/Gather algorithm [4] [5] and on computational stylistics. [6] He also worked at Excite, where he was one of the chief designers of the search engine, and Apple Inc., where he was the primary author of the V-Twin text search framework. [7]
Lucene, a search indexer, and Nutch, a spider or crawler, are the two key components of an open-source general search platform that first crawls the Web for content, and then structures it into a searchable index. Cutting's leadership of these two projects extended the concepts and capabilities of general open-source software projects such as Linux and MySQL into the vertical domain of search. [8] In a 2017 article, Cutting was quoted with the statement, "Open source is a requirement for business." [9]
In December 2004, Google Research published a paper on the MapReduce algorithm, which allows very large-scale computations to be trivially parallelized across large clusters of servers. Cutting and Mike Cafarella, realizing the importance of this paper to extending Lucene into the realm of extremely large search problems, created the open-source Hadoop framework. This framework allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware. Cutting was an employee of Yahoo!, where he led the Hadoop project full-time; he later went on to work for Cloudera. [10]
In July 2009, Cutting was elected to the board of directors of the Apache Software Foundation, and in September 2010, he was elected the chairman. [11]
In 2015, Cutting was awarded the O'Reilly Open Source Award. [12]
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.
Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.
Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.
HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.
Many-task computing (MTC) in computational science is an approach to parallel computing that aims to bridge the gap between two computing paradigms: high-throughput computing (HTC) and high-performance computing (HPC).
Sector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS and MapReduce technology. Sector is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming architecture framework that supports in-storage parallel data processing for data stored in Sector. Sector/Sphere operates in a wide area network (WAN) setting.
Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.
Cloudera, Inc. is an American data lake software company.
Apache Hama is a distributed computing framework based on bulk synchronous parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. Originally a sub-project of Hadoop, it became an Apache Software Foundation top level project in 2012. It was created by Edward J. Yoon, who named it, and Hama also means hippopotamus in Yoon's native Korean language (하마), following the trend of naming Apache projects after animals and zoology. Hama was inspired by Google's Pregel large-scale graph computing framework described in 2010. When executing graph algorithms, Hama showed a fifty-fold performance increase relative to Hadoop.
David Ron Karger is an American computer scientist who is professor and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts Institute of Technology.
Hortonworks was a data software company based in Santa Clara, California that developed and supported open-source software designed to manage big data and associated processing.
Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.
Mike Cafarella is a computer scientist specializing in database management systems. He is a principal research scientist of computer science at MIT Computer Science and Artificial Intelligence Laboratory. Before coming to MIT, he was a professor of Computer Science and Engineering at the University of Michigan from 2009 to 2020. Along with Doug Cutting, he is one of the original co-founders of the Hadoop and Nutch open-source projects. Cafarella was born in New York City but moved to Westwood, MA early in his childhood. After completing his bachelor's degree at Brown University, he earned a Ph.D. specializing in database management systems at the University of Washington under Dan Suciu and Oren Etzioni. He was also involved in several notable start-ups, including Tellme Networks, and co-founder of Lattice Data, which was acquired by Apple in 2017.
Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.
Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.
Apache Kudu is a free and open source column-oriented data store of the Apache Hadoop ecosystem. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides completeness to Hadoop's storage layer to enable fast analytics on fast data.
Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink, and Apache Hadoop.
{{cite news}}
: CS1 maint: multiple names: authors list (link)Cutting is the primary author of the V-Twin search engine (part of Apple's Copland operating system effort)…