Apache OODT

Apache OODT
Developer(s)	Apache Software Foundation
Stable release	1.9.1 / October 3, 2021;2 years ago
Repository	OODT Repository
Written in	Java
Operating system	Cross-platform
Type	Search and index API
License	Apache License 2.0
Website	oodt.apache.org

Last updated November 13, 2023

The Apache Object Oriented Data Technology (OODT) is an open source data management system framework that is managed by the Apache Software Foundation. OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA's scientific archives.

History

The project started out as an internal NASA Jet Propulsion Laboratory project incepted by Daniel J. Crichton, Sean Kelly and Steve Hughes. The early focus of the effort was on information integration and search using XML as described in Crichton et al.'s paper in the CODATA meeting in 2000.^[2]

After deploying OODT to the Planetary Data System and to the National Cancer Institute EDRN or Early Detection Research Network project, OODT in 2005 moved into the era of large scale data processing and management via NASA's Orbiting Carbon Observatory (OCO) project. OODT's role on OCO was to usher in a new data management processing framework that instead of tens of jobs per day and tens of gigabytes of data would handle 10,000 jobs per day and hundreds of terabytes of data. This required an overhaul of OODT to support these new requirements. Dr. Chris Mattmann at NASA JPL led a team of 3-4 developers between 2005-2009 and completely re-engineered OODT to support these new requirements.

Influenced by the emerging efforts in Apache Nutch and Hadoop which Mattmann participated in, OODT was given an overhaul making it more amenable towards Apache Software Foundation like projects. In addition, Mattmann had a close relationship with Dr. Justin Erenkrantz, who as the Apache Software Foundation President at the time, and the idea to bring OODT to the Apache Software Foundation emerged. In 2009, Mattmann and his team received approval from NASA and from JPL to bring OODT to Apache making it the first NASA project to be stewarded by the foundation. Seven years later, the project has released a version 1.0.

Features

OODT focuses on two canonical use cases: Big Data processing and on Information integration. Both were described in Mattmann's ICSE 2006^[3] and SMC-IT 2009^[4] papers. It provides three core services.

File Manager

A File Manager is responsible for tracking file locations, their metadata, and for transferring files from a staging area to controlled access storage.

Workflow Manager

A Workflow Manager captures control flow and data flow for complex processes, and allows for reproducibility and the construction of scientific pipelines.

Resource Manager

A Resource Manager handles allocation of Workflow Tasks and other jobs to underlying resources, e.g., Python jobs go to nodes with Python installed on them; jobs that require a large disk or CPU are properly sent to those nodes that fulfill those requirements.

In addition to the three core services, OODT provides three client-oriented frameworks that build on these services.

File Crawler

A file Crawler automatically extracts metadata and uses Apache Tika to identify file types and ingest the associated information into the File Manager.

Catalog and Archive Crawling Framework

A Push/Pull framework acquires remote files and makes them available to the system.

Catalog and Archive Service Production Generation Executive (CAS-PGE)

A scientific algorithm wrapper (called CAS-PGE, for Catalog and Archive Service Production Generation Executive) encapsulates scientific codes and allows for their execution independent of environment, and while doing so capturing provenance, and making the algorithms easily integrated into a production system.

CAS RESTful Services

A Set of RESTful APIs which exposes the capabilities of File Manager, Workflow Manager and Resource manager components.

OPSUI Monitor Dashboard

A web application for exposing services form the underlying OODT product / workflow / resource managing Control Systems via the JAX-RS ^{[ citation needed ]} specification. At this stage it is built using Apache Wicket ^{[ citation needed ]} components.

The overall motivation for OODT's re-architecting was described in a paper in Nature (journal) in 2013 by Mattmann called A Vision for Data Science.^[5]

OODT is written in the Java, and through its REST API ^[6] used in other languages including Python (programming language).

Notable uses

OODT has been recently highlighted as contributing to NASA missions including Soil Moisture Active Passive ^[7] and New Horizons.^[8] OODT also helps to power the Square Kilometre Array telescope^[9] increasing the scope of its use from Earth science, Planetary science, radio astronomy, and to other sectors. OODT is also used within bioinformatics and is a part of the Knowledgent Big Data Platform.^[10]

Related Research Articles

ASF may refer to:

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model. Hadoop was originally designed for computer clusters built from commodity hardware, which is still the common use. It has since also found use on clusters of higher-end hardware. All the modules in Hadoop are designed with a fundamental assumption that hardware failures are common occurrences and should be automatically handled by the framework.

Apache Jena is an open source Semantic Web framework for Java. It provides an API to extract data from and write to RDF graphs. The graphs are represented as an abstract "model". A model can be sourced with data from files, databases, URLs or a combination of these. A model can also be queried through SPARQL 1.1.

Solr is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features and rich document handling. Providing distributed search and index replication, Solr is designed for scalability and fault tolerance. Solr is widely used for enterprise search and analytics use cases and has an active development community and regular releases.

A scientific workflow system is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application.

<span class="mw-page-title-main">TACTIC (web framework)</span> Web-based, open source workflow platform and digital asset management system

TACTIC is a web-based, open source workflow platform and digital asset management system supported by Southpaw Technology in Toronto, ON. Designed to optimize busy production environments with high volumes of content traffic, TACTIC applies business or workflow logic to combined database and file system management. Using elements of digital asset management, production asset management and workflow management, TACTIC tracks the creation and development of digital assets through production pipelines. TACTIC is available under both commercial and open-source licenses, and also as a hosted cloud service through Amazon Web Services Marketplace.

Apache Drill is an open-source software framework that supports data-intensive distributed applications for interactive analysis of large-scale datasets. Built chiefly by contributions from developers from MapR, Drill is inspired by Google's Dremel system. Drill is an Apache top-level project. Tom Shiran is the founder of the Apache Drill Project. It was designated an Apache Software Foundation top-level project in December 2016.

Apache Attic is a project of Apache Software Foundation to provide processes to make it clear when an Apache project has reached its end-of-life. The Attic project was created in November 2008. Also the retired projects can be retained.

Apache Stanbol is an open source modular software stack and reusable set of components for semantic content management. Apache Stanbol components are meant to be accessed over RESTful interfaces to provide semantic services for content management. Thus, one application is to extend traditional content management systems with semantic services.

Apache Allura is an open-source forge software for managing source code repositories, bug reports, discussions, wiki pages, blogs and more for any number of individual projects. Allura graduated from incubation with the Apache Software Foundation in March 2013.

Apache Spark is an open-source unified analytics engine for large-scale data processing. Spark provides an interface for programming clusters with implicit data parallelism and fault tolerance. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Apache Flink is an open-source, unified stream-processing and batch-processing framework developed by the Apache Software Foundation. The core of Apache Flink is a distributed streaming data-flow engine written in Java and Scala. Flink executes arbitrary dataflow programs in a data-parallel and pipelined manner. Flink's pipelined runtime system enables the execution of bulk/batch and stream processing programs. Furthermore, Flink's runtime supports the execution of iterative algorithms natively.

Apache Tika is a content detection and analysis framework, written in Java, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different file types, and as well as providing a Java library, has server and command-line editions suitable for use from other programming languages.

Apache Parquet is a free and open-source column-oriented data storage format in the Apache Hadoop ecosystem. It is similar to RCFile and ORC, the other columnar-storage file formats in Hadoop, and is compatible with most of the data processing frameworks around Hadoop. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk.

Dask is an open-source Python library for parallel computing. Dask scales Python code from multi-core local machines to large distributed clusters in the cloud. Dask provides a familiar user interface by mirroring the APIs of other libraries in the PyData ecosystem including: Pandas, scikit-learn and NumPy. It also exposes low-level APIs that help programmers run custom algorithms in parallel.

<span class="mw-page-title-main">Apache Airflow</span> Open-source workflow management platform

Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014 as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level Apache Software Foundation project in January 2019.

References

↑ "[ANNOUNCE] Apache OODT 1.9.1 released" . Retrieved 27 September 2022.
↑ Crichton, Daniel; Hughes, John; Hyon, Jason; Kelly, Sean (2000). "Science Search and Retrieval using XML". The Second National Conference on Scientific and Technical Data, US National Committee for CODATA, National Research Council.
↑ Mattmann, Chris A.; Crichton, Daniel J.; Medvidovic, Nenad; Hughes, Steve (2006-01-01). "A software architecture-based framework for highly distributed and data intensive scientific applications". Proceedings of the 28th international conference on Software engineering. ICSE '06. New York, NY, USA: ACM. pp. 721–730. doi:10.1145/1134285.1134400. ISBN 978-1595933751. S2CID 7699385.
↑ Mattmann, C. A.; Freeborn, D.; Crichton, D.; Foster, B.; Hart, A.; Woollard, D.; Hardman, S.; Ramirez, P.; Kelly, S. (2009-07-01). "A Reusable Process Control System Framework for the Orbiting Carbon Observatory and NPP Sounder PEATE Missions". 2009 Third IEEE International Conference on Space Mission Challenges for Information Technology. pp. 165–172. doi:10.1109/SMC-IT.2009.27. ISBN 978-0-7695-3637-8. S2CID 705732.
↑ Mattmann, Chris A. (2013-01-24). "Computing: A vision for data science". Nature. 493 (7433): 473–475. Bibcode:2013Natur.493..473M. doi: 10.1038/493473a . ISSN 0028-0836. PMID 23344342.
↑ "Apache OODT APIs - OODT - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-06-27.
↑ "Apache - The ASF on Twitter" . Retrieved 2016-06-27.
↑ "Apache - The ASF on Twitter" . Retrieved 2016-06-27.
↑ "Apache - The ASF on Twitter" . Retrieved 2016-06-27.
↑ "Q&A on the Advantages of OODT - Object Oriented Data Technology - Knowledgent Perspectives". 2014-07-30. Archived from the original on 2015-04-14. Retrieved 2016-06-27.

External links

http://oodt.apache.org

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "[ANNOUNCE] Apache OODT 1.9.1 released" . Retrieved 27 September 2022.

[2] Crichton, Daniel; Hughes, John; Hyon, Jason; Kelly, Sean (2000). "Science Search and Retrieval using XML". The Second National Conference on Scientific and Technical Data, US National Committee for CODATA, National Research Council.

[3] Mattmann, Chris A.; Crichton, Daniel J.; Medvidovic, Nenad; Hughes, Steve (2006-01-01). "A software architecture-based framework for highly distributed and data intensive scientific applications". Proceedings of the 28th international conference on Software engineering. ICSE '06. New York, NY, USA: ACM. pp. 721–730. doi:10.1145/1134285.1134400. ISBN 978-1595933751. S2CID 7699385.

[4] Mattmann, C. A.; Freeborn, D.; Crichton, D.; Foster, B.; Hart, A.; Woollard, D.; Hardman, S.; Ramirez, P.; Kelly, S. (2009-07-01). "A Reusable Process Control System Framework for the Orbiting Carbon Observatory and NPP Sounder PEATE Missions". 2009 Third IEEE International Conference on Space Mission Challenges for Information Technology. pp. 165–172. doi:10.1109/SMC-IT.2009.27. ISBN 978-0-7695-3637-8. S2CID 705732.

[5] Mattmann, Chris A. (2013-01-24). "Computing: A vision for data science". Nature. 493 (7433): 473–475. Bibcode:2013Natur.493..473M. doi: 10.1038/493473a . ISSN 0028-0836. PMID 23344342.

[6] "Apache OODT APIs - OODT - Apache Software Foundation". cwiki.apache.org. Retrieved 2016-06-27.

[7] "Apache - The ASF on Twitter" . Retrieved 2016-06-27.

[8] "Apache - The ASF on Twitter" . Retrieved 2016-06-27.

[9] "Apache - The ASF on Twitter" . Retrieved 2016-06-27.

[10] "Q&A on the Advantages of OODT - Object Oriented Data Technology - Knowledgent Perspectives". 2014-07-30. Archived from the original on 2015-04-14. Retrieved 2016-06-27.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ AGE Airflow Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Buildr Calcite Camel CarbonData Cassandra Cayenne Chemistry CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Impala Jackrabbit James Jena Jini JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza ServiceMix Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Struts 2 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	MXNet Taverna
Other projects	Batik Chainsaw FOP Ivy Log4j
Attic	Abdera Apex AxKit Beehive Bluesky iBATIS C++ Standard Library Cactus Click Continuum Deltacloud Etch Excalibur Forrest Giraph Hama Harmony HiveMind Jakarta Lenya Marmotta ODE Shale Shindig Slide Sqoop Stanbol Tuscany Wave Wink XML
Licenses	Apache License
Category