Scientific workflow system

Last updated

A scientific workflow system is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application. [1]

Contents

Applications

Distributed scientists can collaborate on conducting large scale scientific experiments and knowledge discovery applications using distributed systems of computing resources, data sets, and devices. Scientific workflow systems play an important role in enabling this vision.

More specialized scientific workflow systems provide a visual programming front end enabling users to easily construct their applications as a visual graph by connecting nodes together, and tools have also been developed to build such applications in a platform-independent manner. [2] Each directed edge in the graph of a workflow typically represents a connection from the output of one application to the input of the next. A sequence of such edges may be called a pipeline.

A bioinformatics workflow management system is a specialized scientific workflow system focused on bioinformatics.

Scientific workflows

The simplest computerized scientific workflows are scripts that call in data, programs, and other inputs and produce outputs that might include visualizations and analytical results. These may be implemented in programs such as R or MATLAB, using a scripting language such as Python with a command-line interface, or more recently using open-source web applications such as Jupyter Notebook.

There are many motives for differentiating scientific workflows from traditional business process workflows. These include:

By focusing on the scientists, the focus of designing scientific workflow system shifts away from the workflow scheduling activities, typically considered by grid computing environments for optimizing the execution of complex computations on predefined resources, to a domain-specific view of what data types, tools and distributed resources should be made available to the scientists and how can one make them easily accessible and with specific Quality of Service requirements [3]

Scientific workflows are now recognized[ by whom? ] as a crucial element of the cyberinfrastructure, facilitating e-Science. Typically sitting on top of a middleware layer, scientific workflows are a means by which scientists can model, design, execute, debug, re-configure, and re-run their analysis and visualization pipelines. Part of the established scientific method is to create a record of the origins of a result, how it was obtained, experimental methods used, machine calibrations and parameters, etc. It is the same in e-Science, except provenance data are a record of the workflow activities invoked, services and databases accessed, data sets used, and so forth. Such information is useful for a scientist to interpret their workflow results and for other scientists to establish trust in the experimental result. [4]

Sharing workflows

Social networking communities such as myExperiment have been developed to facilitate sharing and collaborative development of scientific workflows. Galaxy provide collaborative mechanisms for editing and publication of workflow definitions and workflow results directly on the Galaxy installation.

Analysis

A key assumption underlying all scientific workflow systems is that the scientists themselves will be able to use a workflow system to develop their applications based on visual flowcharting, logic diagramming, or, as a last resort, writing code to describe the workflow logic. Powerful workflow systems make it easy for non-programmers to first sketch out workflow steps using simple flowcharting tools, and then hook in various data acquisition, analysis, and reporting tools. For maximum productivity, details of the underlying programming code should normally be hidden.

Workflow analysis techniques can be used to analyze the properties of such workflows to verify certain properties before executing them. An example of a theoretical formal analysis framework for the verification and profiling of the control-flow aspects of scientific workflows and their data flow aspects for the Discovery Net system is described in the paper, "The design and implementation of a workflow analysis tool" by Curcin et al. [5]

The authors note that introducing program analysis and verification into the workflow world requires detailed understanding of execution semantics of workflow language, including execution properties of nodes and arcs in the workflow graph, understanding functional equivalencies between workflow patterns, and many other issues. Doing such analysis is difficult, and addressing these issues requires building on formal methods used in computer science research (e.g. Petri nets) and building on these formal methods to develop user-level tools to reason about the properties of both workflows and workflow systems. The lack of such tools in the past stopped automated workflow management solutions from maturing from nice-to-have academic toys to production-level tools used outside the narrow circle of early adopters and workflow enthusiasts.

Notable systems

Notable scientific workflow systems include: [6]

More than 280 computational data analysis workflow systems have been identified, [11] although the distinction between data analysis workflows and scientific workflows is fluid, as not all analysis workflow systems are used for scientific purposes.

See also

Related Research Articles

A workflow pattern is a specialized form of design pattern as defined in the area of software engineering or business process engineering. Workflow patterns refer specifically to recurrent problems and proven solutions related to the development of workflow applications in particular, and more broadly, process-oriented applications.

<span class="mw-page-title-main">Carole Goble</span> British computer scientist

Carole Anne Goble, is a British academic who is Professor of Computer Science at the University of Manchester. She is principal investigator (PI) of the myGrid, BioCatalogue and myExperiment projects and co-leads the Information Management Group (IMG) with Norman Paton.

The myGrid consortium produces and uses a suite of tools design to “help e-Scientists get on with science and get on with scientists”. The tools support the creation of e-laboratories and have been used in domains as diverse as systems biology, social science, music, astronomy, multimedia and chemistry.

<span class="mw-page-title-main">Apache Taverna</span>

Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name Taverna Workbench, then a project under the Apache incubator. Taverna allowed users to integrate many different software components, including WSDL SOAP or REST Web services, such as those provided by the National Center for Biotechnology Information, the European Bioinformatics Institute, the DNA Databank of Japan (DDBJ), SoapLab, BioMOBY and EMBOSS. The set of available services was not finite and users could import new service descriptions into the Taverna Workbench.

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

Kepler is a free software system for designing, executing, reusing, evolving, archiving, and sharing scientific workflows. Kepler's facilities provide process and data monitoring, provenance information, and high-speed data movement. Workflows in general, and scientific workflows in particular, are directed graphs where the nodes represent discrete computational components, and the edges represent paths along which data and results can flow between components. In Kepler, the nodes are called 'Actors' and the edges are called 'channels'. Kepler includes a graphical user interface for composing workflows in a desktop environment, a runtime engine for executing workflows within the GUI and independently from a command-line, and a distributed computing option that allows workflow tasks to be distributed among compute nodes in a computer cluster or computing grid. The Kepler system principally targets the use of a workflow metaphor for organizing computational tasks that are directed towards particular scientific analysis and modeling goals. Thus, Kepler scientific workflows generally model the flow of data from one step to another in a series of computations that achieve some scientific goal.

<span class="mw-page-title-main">VisTrails</span> Scientific workflow management system

VisTrails is a scientific workflow management system developed at the Scientific Computing and Imaging Institute at the University of Utah that provides support for data exploration and visualization. It is written in Python and employs Qt via PyQt bindings. The system is open source, released under the GPL v2 license. The pre-compiled versions for Windows, Mac OS X, and Linux come with an installer and several packages, including VTK, matplotlib, and ImageMagick. VisTrails also supports user-defined packages.

<span class="mw-page-title-main">LONI Pipeline</span> Scientific workflow software

The LONI Pipeline is a free distributed system for designing, executing, monitoring and sharing scientific workflows on grid computing architectures. Pipeline allows users to connect and run any number of different software tools, and conveniently visualize and download the results.

gUSE Grid computing framework

The Grid and Cloud User Support Environment (gUSE), also known as WS-PGRADE /gUSE, is an open source science gateway framework that enables users to access grid and cloud infrastructures. gUSE is developed by the Laboratory of Parallel and Distributed Systems (LPDS) at Institute for Computer Science and Control (SZTAKI) of the Hungarian Academy of Sciences.

Discovery Net is one of the earliest examples of a scientific workflow system allowing users to coordinate the execution of remote services based on Web service and Grid Services standards. The system was designed and implemented at Imperial College London as part of the Discovery Net pilot project funded by the UK e-Science Programme. Many of the concepts pioneered by Discovery Net have been later incorporated into a variety of other scientific workflow systems.

LabKey Server is a software suite available for scientists to integrate, analyze, and share biomedical research data. The platform provides a secure data repository that allows web-based querying, reporting, and collaborating across a range of data sources. Specific scientific applications and workflows can be added on top of the basic platform and leverage a data processing pipeline.

<span class="mw-page-title-main">Apache OODT</span>

The Apache Object Oriented Data Technology (OODT) is an open source data management system framework that is managed by the Apache Software Foundation. OODT was originally developed at NASA Jet Propulsion Laboratory to support capturing, processing and sharing of data for NASA's scientific archives.

A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics.

<span class="mw-page-title-main">OnlineHPC</span>

The OnlineHPC was a free public web service that supplied tools to deal with high performance computers and online workflow editor. OnlineHPC allowed users to design and execute workflows using the online workflow designer and to work with high performance computers – clusters and clouds. Access to high performance resources was available as directly from the service user interface, as from workflow components. The workflow engine of the OnlineHPC service was Taverna as traditionally used for scientific workflow execution in such domains, as bioinformatics, cheminformatics, medicine, astronomy, social science, music, and digital preservation.

BisQue is a free, open source web-based platform for the exchange and exploration of large, complex datasets. It is being developed at the Vision Research Lab at the University of California, Santa Barbara. BisQue specifically supports large scale, multi-dimensional multimodal-images and image analysis. Metadata is stored as arbitrarily nested and linked tag/value pairs, allowing for domain-specific data organization. Image analysis modules can be added to perform complex analysis tasks on compute clusters. Analysis results are stored within the database for further querying and processing. The data and analysis provenance is maintained for reproducibility of results. BisQue can be easily deployed in cloud computing environments or on computer clusters for scalability. BisQue has been integrated into the NSF Cyberinfrastructure project CyVerse. The user interacts with BisQue via any modern web browser.

The BioCompute Object (BCO) project is a community-driven initiative to build a framework for standardizing and sharing computations and analyses generated from High-throughput sequencing. The project has since been standardized as IEEE 2791-2020, and the project files are maintained in an open source repository. The July 22nd, 2020 edition of the Federal Register announced that the FDA now supports the use of BioCompute in regulatory submissions, and the inclusion of the standard in the Data Standards Catalog for the submission of HTS data in NDAs, ANDAs, BLAs, and INDs to CBER, CDER, and CFSAN.

Originally started as a collaborative contract between the George Washington University and the Food and Drug Administration, the project has grown to include over 20 universities, biotechnology companies, public-private partnerships and pharmaceutical companies including Seven Bridges and Harvard Medical School. The BCO aims to ease the exchange of HTS workflows between various organizations, such as the FDA, pharmaceutical companies, contract research organizations, bioinformatic platform providers, and academic researchers. Due to the sensitive nature of regulatory filings, few direct references to material can be published. However, the project is currently funded to train FDA Reviewers and administrators to read and interpret BCOs, and currently has 4 publications either submitted or nearly submitted.

<span class="mw-page-title-main">Apache Airflow</span> Open-source workflow management platform

Apache Airflow is an open-source workflow management platform for data engineering pipelines. It started at Airbnb in October 2014 as a solution to manage the company's increasingly complex workflows. Creating Airflow allowed Airbnb to programmatically author and schedule their workflows and monitor them via the built-in Airflow user interface. From the beginning, the project was made open source, becoming an Apache Incubator project in March 2016 and a top-level Apache Software Foundation project in January 2019.

Pegasus is an open-source workflow management system. It provides the necessary abstractions for scientists to create scientific workflows and allows for transparent execution of these workflows on a range of computing platforms including high performance computing clusters, clouds, and national cyberinfrastructure. In Pegasus, workflows are described abstractly as directed acyclic graphs (DAGs) using a provided API for Jupyter Notebooks, Python, R, or Java. During execution, Pegasus translates the constructed abstract workflow into an executable workflow which is executed and managed by HTCondor.

Nextflow is a scientific workflow system predominantly used for bioinformatic data analyses. It imposes standards on how to programmatically author a sequence of dependent compute steps and enables their execution on various local and cloud resources. Nextflow was conceived at the Centre for Genomic Regulation in Barcelona, Spain, but has since found world-wide adoption in biomedical and genomics research facilities and laboratories.

References

  1. Sun, LiewChee; P, AtkinsonMalcolm; GaleaMichelle; Fong, AngTan; MartinPaul; Van, HemertJano I. (2016-12-12). "Scientific Workflows". ACM Computing Surveys. 49 (4): 1–39. doi:10.1145/3012429. hdl: 20.500.11820/774ef69e-a499-4bd2-a609-09f050e682ae . S2CID   9408644.
  2. D. Johnson; et al. (December 2009). "A middleware independent Grid workflow builder for scientific applications" (PDF). 2009 5th IEEE International Conference on E-Science Workshops. pp. 86–91. doi:10.1109/ESCIW.2009.5407993. ISBN   978-1-4244-5946-9. S2CID   3339794.
  3. Kyriazis, Dimosthenis; Tserpes, Konstantinos; Menychtas, Andreas; Litke, Antonis; Varvarigou, Theodora (2008). "An innovative workflow mapping mechanism for Grids in the frame of Quality of Service". Future Generation Computer Systems. 24 (6): 498–511. doi:10.1016/j.future.2007.07.009.
  4. Automatic capture and efficient storage of e-Science experiment provenance. Concurrency Computat.: Pract. Exper. 2008; 20:419–429
  5. Curcin, V.; Ghanem, M.; Guo, Y. (2010). "The design and implementation of a workflow analysis tool". Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 368 (1926): 4193–4208. Bibcode:2010RSPTA.368.4193C. doi: 10.1098/rsta.2010.0157 . PMID   20679131.
  6. Barker, Adam; Van Hemert, Jano (2008), "Scientific Workflow: A Survey and Research Directions", Parallel Processing and Applied Mathematics, 7th International Conference, PPAM 2007, Revised Selected Papers, Lecture Notes in Computer Science, vol. 4967, Gdansk, Poland: Springer Berlin / Heidelberg, pp. 746–753, CiteSeerX   10.1.1.105.4605 , doi:10.1007/978-3-540-68111-3_78, ISBN   978-3-540-68105-2
  7. Marru, Suresh; Gardler, Ross; Slominski, Aleksander; Douma, Ate; Perera, Srinath; Weerawarana, Sanjiva; Gunathilake, Lahiru; Herath, Chathura; Tangchaisin, Patanachai; Pierce, Marlon; Mattmann, Chris; Singh, Raminder; Gunarathne, Thilina; Chinthaka, Eran (2011-11-18). Proceedings of the 2011 ACM workshop on Gateway computing environments - GCE '11. p. 21. doi:10.1145/2110486.2110490. ISBN   9781450311236. S2CID   18341808.
  8. Reich, Michael; Liefeld, Ted; Gould, Joshua; Lerner, Jim; Tamayo, Pablo; Mesirov, Jill P (2006). "GenePattern 2.0". Nature Genetics. 38 (5): 500–501. doi:10.1038/ng0506-500. PMID   16642009. S2CID   5503897.
  9. Deelman, Ewa; Vahi, Karan; Juve, Gideon; Rynge, Mats; Callaghan, Scott; Maechling, Philip J.; Mayani, Rajiv; Chen, Weiwei; Ferreira da Silva, Rafael; Livny, Miron; Wenger, Kent (May 2015). "Pegasus, a workflow management system for science automation". Future Generation Computer Systems. 46: 17–35. doi: 10.1016/j.future.2014.10.008 .
  10. "BIOVIA Pipeline Pilot | Scientific Workflow Authoring Application for Data Analysis". Accelrys.com. Retrieved 2016-12-04.
  11. "Existing Workflow systems". Common Workflow Language wiki. Archived from the original on 2019-10-17.