Doug Cutting

Doug Cutting
Doug Cutting
	Doug Cutting
Known for	Open-source software, The Apache Software Foundation
Awards	O'Reilly Open Source Award

Last updated April 30, 2024

Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundation. Cutting and Cafarella are also the co-founders of Apache Hadoop.^[1]

Education and early career

Cutting graduated from Stanford University in 1985 with a bachelor's degree.^[2]^[3]

Prior to developing Lucene, Cutting held search technology positions at Xerox PARC where he worked on the Scatter/Gather algorithm^[4]^[5] and on computational stylistics.^[6] He also worked at Excite, where he was one of the chief designers of the search engine, and Apple Inc., where he was the primary author of the V-Twin text search framework.^[7]

Open source projects

Lucene, a search indexer, and Nutch, a spider or crawler, are the two key components of an open-source general search platform that first crawls the Web for content, and then structures it into a searchable index. Cutting's leadership of these two projects extended the concepts and capabilities of general open-source software projects such as Linux and MySQL into the vertical domain of search.^[8] In a 2017 article, Cutting was quoted with the statement, "Open source is a requirement for business."^[9]

Use of MapReduce paradigm

In December 2004, Google Research published a paper on the MapReduce algorithm, which allows very large-scale computations to be trivially parallelized across large clusters of servers. Cutting and Mike Cafarella, realizing the importance of this paper to extending Lucene into the realm of extremely large search problems, created the open-source Hadoop framework. This framework allows applications based on the MapReduce paradigm to be run on large clusters of commodity hardware. Cutting was an employee of Yahoo!, where he led the Hadoop project full-time; he later went on to work for Cloudera.^[10]

Open source foundations and awards

In July 2009, Cutting was elected to the board of directors of the Apache Software Foundation, and in September 2010, he was elected the chairman.^[11]

In 2015, Cutting was awarded the O'Reilly Open Source Award.^[12]

Related Research Articles

<span class="mw-page-title-main">Apache Nutch</span> Open source web crawler

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a standard foundation for production search applications.

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS or Alluxio, providing Bigtable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.

Many-task computing (MTC) in computational science is an approach to parallel computing that aims to bridge the gap between two computing paradigms: high-throughput computing (HTC) and high-performance computing (HPC).

Sector/Sphere is an open source software suite for high-performance distributed data storage and processing. It can be broadly compared to Google's GFS and MapReduce technology. Sector is a distributed file system targeting data storage over a large number of commodity computers. Sphere is the programming architecture framework that supports in-storage parallel data processing for data stored in Sector. Sector/Sphere operates in a wide area network (WAN) setting.

Pentaho is business intelligence (BI) software that provides data integration, OLAP services, reporting, information dashboards, data mining and extract, transform, load (ETL) capabilities. Its headquarters are in Orlando, Florida. Pentaho was acquired by Hitachi Data Systems in 2015 and in 2017 became part of Hitachi Vantara.

Cloudera, Inc. is an American data lake software company.

Apache Hama is a distributed computing framework based on bulk synchronous parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms. Originally a sub-project of Hadoop, it became an Apache Software Foundation top level project in 2012. It was created by Edward J. Yoon, who named it, and Hama also means hippopotamus in Yoon's native Korean language (하마), following the trend of naming Apache projects after animals and zoology. Hama was inspired by Google's Pregel large-scale graph computing framework described in 2010. When executing graph algorithms, Hama showed a fifty-fold performance increase relative to Hadoop.

David Ron Karger is an American computer scientist who is professor and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) at the Massachusetts Institute of Technology.

Hortonworks was a data software company based in Santa Clara, California that developed and supported open-source software designed to manage big data and associated processing.

Sqoop is a command-line interface application for transferring data between relational databases and Hadoop.

Mike Cafarella is a computer scientist specializing in database management systems. He is a principal research scientist of computer science at MIT Computer Science and Artificial Intelligence Laboratory. Before coming to MIT, he was a professor of Computer Science and Engineering at the University of Michigan from 2009 to 2020. Along with Doug Cutting, he is one of the original co-founders of the Hadoop and Nutch open-source projects. Cafarella was born in New York City but moved to Westwood, MA early in his childhood. After completing his bachelor's degree at Brown University, he earned a Ph.D. specializing in database management systems at the University of Washington under Dan Suciu and Oren Etzioni. He was also involved in several notable start-ups, including Tellme Networks, and co-founder of Lattice Data, which was acquired by Apple in 2017.

Apache Impala is an open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop. Impala has been described as the open-source equivalent of Google F1, which inspired its development in 2012.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Apache Kudu is a free and open source column-oriented data store of the Apache Hadoop ecosystem. It is compatible with most of the data processing frameworks in the Hadoop environment. It provides completeness to Hadoop's storage layer to enable fast analytics on fast data.

Apache ORC is a free and open-source column-oriented data storage format. It is similar to the other columnar-storage file formats available in the Hadoop ecosystem such as RCFile and Parquet. It is used by most of the data processing frameworks Apache Spark, Apache Hive, Apache Flink and Apache Hadoop.

References

↑ Cutting, Mike Cafarella, Ben Lorica, Doug (2016-03-31). "The next 10 years of Apache Hadoop". O'Reilly Media. Retrieved 2018-04-16.{{cite news}}: CS1 maint: multiple names: authors list (link)
↑ "Doug Cutting—The Father of Search - Code World". www.codetd.com. Retrieved 18 May 2022.
↑ "Cloudera management team". Cloudera. Retrieved 2016-08-17.
↑ Cutting, Douglass R., David R. Karger, Jan O. Pedersen, and John W. Tukey. "Scatter/gather: A cluster-based approach to browsing large document collections." SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. (Reprinted in ACM SIGIR Forum, vol. 51, no. 2, pp. 148-159. ACM, 2017.)
↑ Pedersen, Jan O., David Karger, Douglass R. Cutting, and John W. Tukey. "Scatter-gather: a cluster-based method and apparatus for browsing large document collections." U.S. Patent 5,442,778, issued August 15, 1995.
↑ Karlgren, Jussi; Cutting, Douglass. "Recognizing text genres with simple metrics using discriminant analysis.". Proceedings of the 15th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1994.
↑ "The Lucene search engine: Powerful, flexible, and free". JavaWorld (published 2000-09-15). 15 September 2000. Retrieved 2017-01-25. Cutting is the primary author of the V-Twin search engine (part of Apple's Copland operating system effort)…
↑ "Wikipedia: Powered by Lucene". Lucene. Retrieved September 5, 2007.
↑ "Doug Cutting, 'father' of Hadoop, talks about big data tech evolution". ComputerWeekly.com. Retrieved June 26, 2018.
↑ Handy, Alex (10 August 2009). "Hadoop creator goes to Cloudera". Software Development Times. Archived from the original on 13 March 2012. Retrieved 2011-03-22.
↑ Sally (2010-07-15). "The Apache Software Foundation Announces New Board Members". The Apache Software Foundation Blog. Retrieved 2023-05-02.
↑ "O'Reilly Open Source Awards - OSCON 2015". YouTube. O'Reilly. Archived from the original on 2021-12-14. Retrieved 27 July 2015.

Articles

Blog post by Tom White about Doug Cutting creating Hadoop Note that this post was written while Hadoop was still an unnamed spinoff of Nutch. Tom updates his earlier post with the Hadoop name here.
Article co-authored by Doug Cutting in ACM Queue, 'Building Nutch: Open Source Search'

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Cutting, Mike Cafarella, Ben Lorica, Doug (2016-03-31). "The next 10 years of Apache Hadoop". O'Reilly Media. Retrieved 2018-04-16.{{cite news}}: CS1 maint: multiple names: authors list (link)

[Bachelors-2] "Doug Cutting—The Father of Search - Code World". www.codetd.com. Retrieved 18 May 2022.

[3] "Cloudera management team". Cloudera. Retrieved 2016-08-17.

[scattergather-4] Cutting, Douglass R., David R. Karger, Jan O. Pedersen, and John W. Tukey. "Scatter/gather: A cluster-based approach to browsing large document collections." SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval. (Reprinted in ACM SIGIR Forum, vol. 51, no. 2, pp. 148-159. ACM, 2017.)

[scgpatent-5] Pedersen, Jan O., David Karger, Douglass R. Cutting, and John W. Tukey. "Scatter-gather: a cluster-based method and apparatus for browsing large document collections." U.S. Patent 5,442,778, issued August 15, 1995.

[stylistics-6] Karlgren, Jussi; Cutting, Douglass. "Recognizing text genres with simple metrics using discriminant analysis.". Proceedings of the 15th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1994.

[7] "The Lucene search engine: Powerful, flexible, and free". JavaWorld (published 2000-09-15). 15 September 2000. Retrieved 2017-01-25. Cutting is the primary author of the V-Twin search engine (part of Apple's Copland operating system effort)…

[8] "Wikipedia: Powered by Lucene". Lucene. Retrieved September 5, 2007.

[9] "Doug Cutting, 'father' of Hadoop, talks about big data tech evolution". ComputerWeekly.com. Retrieved June 26, 2018.

[10] Handy, Alex (10 August 2009). "Hadoop creator goes to Cloudera". Software Development Times. Archived from the original on 13 March 2012. Retrieved 2011-03-22.

[11] Sally (2010-07-15). "The Apache Software Foundation Announces New Board Members". The Apache Software Foundation Blog. Retrieved 2023-05-02.

[oscon2015-12] "O'Reilly Open Source Awards - OSCON 2015". YouTube. O'Reilly. Archived from the original on 2021-12-14. Retrieved 27 July 2015.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

Authority control databases
International	VIAF
Academics	Association for Computing Machinery DBLP
Other	IdRef