Kepler scientific workflow system

Last updated
Kepler Scientific Workflow System
Stable release
2.5 / 2015-10-28 [1]
Repository
Written in Java
Operating system Linux, Mac OS X, Windows
Type Scientific workflow system
License BSD License
Website kepler-project.org

Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows. [2] [3] [4] Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement. Workflows in general, and scientific workflows in particular, are directed graphs where the nodes represent discrete computational components, and the edges represent paths along which data and results can flow between components. [5] In Kepler, the nodes are called 'Actors' and the edges are called 'channels'. Kepler includes a graphical user interface for composing workflows in a desktop environment, a runtime engine for executing workflows within the GUI and independently from a command-line, and a distributed computing option that allows workflow tasks to be distributed among compute nodes in a computer cluster or computing grid. The Kepler system principally targets the use of a workflow metaphor for organizing computational tasks that are directed towards particular scientific analysis and modeling goals. Thus, Kepler scientific workflows generally model the flow of data from one step to another in a series of computations that achieve some scientific goal.

Contents

Scientific workflow

A scientific workflow is the process of combining data and processes into a configurable, structured set of steps that implement semi-automated computational solutions to a scientific problem. Scientific workflow systems often provide graphical user interfaces to combine different technologies along with efficient methods for using them, and thus increase the efficiency of the scientists.

Access to scientific data

Kepler provides direct access to scientific data that has been archived in many of the commonly used data archives. For example, Kepler provides access to data stored in the Knowledge Network for Biocomplexity (KNB) Metacat server [6] and described using Ecological Metadata Language. Additional data sources that are supported include data accessible using the DiGIR protocol, the OPeNDAP protocol, GridFTP, JDBC, SRB, and others.

Models of Computation

Kepler differs from many of the other bioinformatics workflow management systems in that it separates the structure of the workflow model from its model of computation, such that different models for the computation of the workflow can be bound to a given workflow graph. Kepler inherits several common models of computation from the Ptolemy system, including Synchronous Data Flow (SDF), Continuous Time (CT), Process Network (PN), and Dynamic Data Flow (DDF), among others.

Hierarchical workflows

Kepler supports hierarchy in workflows, which allows complex tasks to be composed of simpler components. This feature allows workflow authors to build re-usable, modular components that can be saved for use across many different workflows.

Workflow semantics

Kepler provides a model for the semantic annotation of workflow components using terms drawn from an ontology. These annotations support many advanced features, including improved search capabilities, automated workflow validation, and improved workflow editing. [7]

Sharing workflows

Kepler components can be shared by exporting the workflow or component into a Kepler Archive (KAR) file, which is an extension of the JAR file format from Java. Once a KAR file is created, it can be emailed to colleagues, shared on web sites, or uploaded to the Kepler Component Repository. The Component Repository is centralized system for sharing Kepler workflows that is accessible via both a web portal and a web service interface. Users can directly search for and utilize components from the repository from within the Kepler workflow composition GUI.

Provenance

Provenance is a critical concept in scientific workflows, since it allows scientists to understand the origin of their results, to repeat their experiments, and to validate the processes that were used to derive data products. [8] In order for a workflow to be reproduced, provenance information must be recorded that indicates where the data originated, how it was altered, and which components and what parameter settings were used. This will allow other scientists to re-conduct the experiment, confirming the results. [9]

Little support exists in current systems to allow end-users to query provenance information in scientifically meaningful ways, in particular when advanced workflow execution models go beyond simple DAGs (as in process networks). [10]

Kepler history

The Kepler Project was created in 2002 by members of the Science Environment for Ecological Knowledge (SEEK) project [4] and the Scientific Data Management (SDM) project. The project was founded by researchers at the National Center for Ecological Analysis and Synthesis (NCEAS) at the University of California, Santa Barbara and the San Diego Supercomputer Center at the University of California, San Diego. Kepler extends Ptolemy II, which is a software system for modeling, simulation, and design of concurrent, real-time, embedded systems developed at UC Berkeley. Collaboration on Kepler quickly grew as members of various scientific disciplines realized the benefits of scientific workflows for analysis and modeling and began contributing to the system. As of 2008, Kepler collaborators come from many science disciplines, including ecology, molecular biology, genetics, physics, chemistry, conservation science, oceanography, hydrology, library science, computer science, and others. Kepler is a workflow orchestration engine which is used to make workflows for making work much easier, in the form of actor.

See also

Related Research Articles

<span class="mw-page-title-main">Provenance</span> Chronology of the ownership, custody or location of a historical object

Provenance is the chronology of the ownership, custody or location of an historical object. The term was originally mostly used in relation to works of art, but is now used in similar senses in a wide range of fields, including archaeology, paleontology, archival science, economy, computing, and scientific inquiry in general.

UNICORE (UNiform Interface to COmputing REsources) is a grid computing technology for resources such as supercomputers or cluster systems and information stored in databases. UNICORE was developed in two projects funded by the German ministry for education and research (BMBF). In European-funded projects UNICORE evolved to a middleware system used at several supercomputer centers. UNICORE served as a basis in other research projects. The UNICORE technology is open source under BSD licence and available at SourceForge.

The Earth System Modeling Framework (ESMF) is open-source software for building climate, numerical weather prediction, data assimilation, and other Earth science software applications. These applications are computationally demanding and usually run on supercomputers. The ESMF is considered a technical layer, integrated into a sophisticated common modeling infrastructure for interoperability. Other aspects of interoperability and shared infrastructure include: common experimental protocols, common analytic methods, common documentation standards for data and data provenance, shared workflow, and shared model components.

Ecological Metadata Language (EML) is a metadata standard developed by and for the ecology discipline. It is based on prior work done by the Ecological Society of America and others, including the Knowledge Network for Biocomplexity. EML is a set of XML schema documents that allow for the structural expression of metadata. It was developed specifically to allow researchers to document a typical data set in the ecological sciences.

The Ptolemy Project is an ongoing project aimed at modeling, simulating, and designing concurrent, real-time, embedded systems. The focus of the Ptolemy Project is on assembling concurrent components. The principal product of the project is the Ptolemy II model based design and simulation tool. The Ptolemy Project is conducted in the Industrial Cyber-Physical Systems Center (iCyPhy) in the Department of Electrical Engineering and Computer Sciences of the University of California at Berkeley, and is directed by Prof. Edward A. Lee.

<span class="mw-page-title-main">Apache Taverna</span>

Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name Taverna Workbench, then a project under the Apache incubator. Taverna allowed users to integrate many different software components, including WSDL SOAP or REST Web services, such as those provided by the National Center for Biotechnology Information, the European Bioinformatics Institute, the DNA Databank of Japan (DDBJ), SoapLab, BioMOBY and EMBOSS. The set of available services was not finite and users could import new service descriptions into the Taverna Workbench.

CNGrid is the Chinese national high performance computing network supported by 863 Program.

<span class="mw-page-title-main">VisTrails</span> Scientific workflow management system

VisTrails is a scientific workflow management system developed at the Scientific Computing and Imaging Institute at the University of Utah that provides support for data exploration and visualization. It is written in Python and employs Qt via PyQt bindings. The system is open source, released under the GPL v2 license. The pre-compiled versions for Windows, Mac OS X, and Linux come with an installer and several packages, including VTK, matplotlib, and ImageMagick. VisTrails also supports user-defined packages.

<span class="mw-page-title-main">LONI Pipeline</span> Scientific workflow software

The LONI Pipeline is a free distributed system for designing, executing, monitoring and sharing scientific workflows on grid computing architectures. Pipeline allows users to connect and run any number of different software tools, and conveniently visualize and download the results.

gUSE Grid computing framework

The Grid and Cloud User Support Environment (gUSE), also known as WS-PGRADE /gUSE, is an open source science gateway framework that enables users to access grid and cloud infrastructures. gUSE is developed by the Laboratory of Parallel and Distributed Systems (LPDS) at Institute for Computer Science and Control (SZTAKI) of the Hungarian Academy of Sciences.

Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services standards. The system was designed and implemented at Imperial College London as part of the Discovery Net pilot project funded by the UK e-Science Programme. Many of the concepts pioneered by Discovery Net have been later incorporated into a variety of other scientific workflow systems.

A scientific workflow system is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application.

<span class="mw-page-title-main">DataONE</span> International federation of data repositories

DataONE is a network of interoperable data repositories facilitating data sharing, data discovery, and open science. Originally supported by $21.2 million in funding from the US National Science Foundation as one of the initial DataNet programs in 2009, funding was renewed in 2014 through 2020 with an additional $15 million. DataONE helps preserve, access, use, and reuse of multi-discipline scientific data through the construction of primary cyberinfrastructure and an education and outreach program. DataONE provides scientific data archiving for ecological and environmental data produced by scientists. DataONE's goal is to preserve and provide access to multi-scale, multi-discipline, and multi-national data. Users include scientists, ecosystem managers, policy makers, students, educators, librarians, and the public.

The SHIWA project within grid computing was a project led by the LPDS of MTA Computer and Automation Research Institute. The project coordinator was Prof. Dr. Peter Kacsuk. It started on 1 July 2010 and lasted two years. SHIWA was supported by a grant from the European Commission's FP7 INFRASTRUCTURES-2010-2 call under grant agreement n°261585.

iPlant Collaborative

The iPlant Collaborative, renamed Cyverse in 2017, is a virtual organization created by a cooperative agreement funded by the US National Science Foundation (NSF) to create cyberinfrastructure for the plant sciences (botany). The NSF compared cyberinfrastructure to physical infrastructure, "... the distributed computer, information and communication technologies combined with the personnel and integrating components that provide a long-term platform to empower the modern scientific research endeavor". In September 2013 it was announced that the National Science Foundation had renewed iPlant's funding for a second 5-year term with an expansion of scope to all non-human life science research.

Arthur Harry Maria ter Hofstede is a Dutch computer scientist, and professor of information systems at the Queensland University of Technology in Australia, and professor at the Eindhoven University of Technology, known for his work in workflow patterns, YAWL, and business process management.

The High-performance Integrated Virtual Environment (HIVE) is a distributed computing environment used for healthcare-IT and biological research, including analysis of Next Generation Sequencing (NGS) data, preclinical, clinical and post market data, adverse events, metagenomic data, etc. Currently it is supported and continuously developed by US Food and Drug Administration, George Washington University, and by DNA-HIVE, WHISE-Global and Embleema. HIVE currently operates fully functionally within the US FDA supporting wide variety (+60) of regulatory research and regulatory review projects as well as for supporting MDEpiNet medical device postmarket registries. Academic deployments of HIVE are used for research activities and publications in NGS analytics, cancer research, microbiome research and in educational programs for students at GWU. Commercial enterprises use HIVE for oncology, microbiology, vaccine manufacturing, gene editing, healthcare-IT, harmonization of real-world data, in preclinical research and clinical studies.

<span class="mw-page-title-main">Cuneiform (programming language)</span> Open-source workflow language

Cuneiform is an open-source workflow language for large-scale scientific data analysis. It is a statically typed functional programming language promoting parallel computing. It features a versatile foreign function interface allowing users to integrate software from many external programming languages. At the organizational level Cuneiform provides facilities like conditional branching and general recursion making it Turing-complete. In this, Cuneiform is the attempt to close the gap between scientific workflow systems like Taverna, KNIME, or Galaxy and large-scale data analysis programming models like MapReduce or Pig Latin while offering the generality of a functional programming language.

The BioCompute Object (BCO) project is a community-driven initiative to build a framework for standardizing and sharing computations and analyses generated from High-throughput sequencing. The project has since been standardized as IEEE 2791-2020, and the project files are maintained in an open source repository. The July 22nd, 2020 edition of the Federal Register announced that the FDA now supports the use of BioCompute in regulatory submissions, and the inclusion of the standard in the Data Standards Catalog for the submission of HTS data in NDAs, ANDAs, BLAs, and INDs to CBER, CDER, and CFSAN.

Originally started as a collaborative contract between the George Washington University and the Food and Drug Administration, the project has grown to include over 20 universities, biotechnology companies, public-private partnerships and pharmaceutical companies including Seven Bridges and Harvard Medical School. The BCO aims to ease the exchange of HTS workflows between various organizations, such as the FDA, pharmaceutical companies, contract research organizations, bioinformatic platform providers, and academic researchers. Due to the sensitive nature of regulatory filings, few direct references to material can be published. However, the project is currently funded to train FDA Reviewers and administrators to read and interpret BCOs, and currently has 4 publications either submitted or nearly submitted.

<span class="mw-page-title-main">Ilkay Altintas</span> Turkish-American data and computer scientist (born 1977)

Ilkay Altintas is a Turkish-American data and computer scientist, and researcher in the domain of supercomputing and high-performance computing applications. Since 2015, Altintas has served as chief data science officer of the San Diego Supercomputer Center (SDSC), at the University of California, San Diego (UCSD), where she has also served as founder and director of the Workflows for Data Science Center of Excellence (WorDS) since 2014, as well as founder and director of the WIFIRE lab. Altintas is also the co-initiator of the Kepler scientific workflow system, an open-source platform that endows research scientists with the ability to readily collaborate, share, and design scientific workflows.

References

  1. "Archived copy". Archived from the original on 2015-11-20. Retrieved 2015-11-19.{{cite web}}: CS1 maint: archived copy as title (link)
  2. Ludäscher B., Altintas I., Berkley C., Higgins D., Jaeger-Frank E., Jones M., Lee E., Tao J., Zhao Y. 2006. Scientific Workflow Management and the Kepler System. Special Issue: Workflow in Grid Systems. Concurrency and Computation: Practice & Experience 18(10): 1039-1065.
  3. Altintas I, Berkley C, Jaeger E, Jones M, Ludäscher B, Mock S. 2004. Kepler: An Extensible System for Design and Execution of Scientific Workflows. Proceedings of the Future of Grid Data Environments, Global Grid Forum 10.
  4. 1 2 Michener, William K., James H. Beach, Matthew B. Jones, Bertram Ludaescher, Deana D. Pennington, Ricardo S. Pereira, Arcot Rajasekar, and Mark Schildhauer. 2007. "A Knowledge Environment for the Biodiversity and Ecological Sciences", Journal of Intelligent Information Systems, 29(1): 111-126. doi : 10.1007/s10844-006-0034-8
  5. Taylor, I.J.; Deelman, E.; Gannon, D.B.; Shields, M. (Eds.), “Workflows for e-Science: Scientific Workflows for Grids”, 530 p., Springer. ISBN   978-1-84628-519-6.
  6. Jones, Matthew B., C. Berkley, J. Bojilova, M. Schildhauer. 2001. Managing Scientific Metadata. IEEE Internet Computing 5 (5): 59-68.
  7. Berkley, Chad, Shawn Bowers, Matthew B. Jones, Bertram Ludaescher, Mark Schildhauer, Jing Tao. 2005. Incorporating Semantics in Scientific Workflow Authoring. 17th International Conference on Scientific and Statistical Database Management. IEEE Computer Society.
  8. "WebHome < Challenge < TWiki". Archived from the original on 2008-07-06. Retrieved 2009-04-06.
  9. Barker, A.; van Hemert, J (2008). "Scientific Workflow: A Survey and Research Directions". Parallel Processing and Applied Mathematics. PPAM 2007. Lecture Notes in Computer Science. 4967. doi:10.1007/978-3-540-68111-3_78.
  10. Shawn Bowers, Timothy McPhillips, Bertram Ludascher, Shirley Cohen, Susan B. Davidson 2006. A Model for User-Oriented Data Provenance in Pipelined Scientific Workflows.