Common Workflow Language

Last updated
Common Workflow Language
The Common Workflow Language standards
CWL-Logo-4k.png
CWL Logo
AbbreviationCWL
StatusPublished
Year started10 July 2014 (2014-07-10)
Latest version 1.2
7 August 2020 (2020-08-07)
Related standards BioCompute Object
License Apache 2.0
Website commonwl.org

The Common Workflow Language (CWL) is a standard for describing computational data-analysis workflows. [1] Development of CWL is focused particularly on serving the data-intensive sciences, such as bioinformatics, [2] medical imaging, astronomy, physics, and chemistry.

Contents

Standard

A key goal of the CWL is to allow the creation of a workflow that is portable and thus may be run reproducibly in different computational environments. [3]

The CWL originated from discussions in 2014 between Peter Amstutz, John Chilton, Nebojša Tijanić, and Michael R. Crusoe (at that time their respective affiliations were: Galaxy, Arvados, Seven Bridges, and Michigan State University) at the Open Bioinformatics Foundation BOSC 2014 codefest.

CWL is supported by multiple analysis runners and platforms [4] such as Apache Airflow (via CWL-Airflow [5] ), Arvados, Rabix, [6] Cromwell workflow engine, Toil, REANA - Reusable Analyses and CWLEXEC for IBM Spectrum LSF, and was identified in 2017 as one of the future trends for bioinformatics pipeline development. [2] Several additional analysis environments are currently implementing support for CWL including Pegasus [7] and Galaxy. [8]

Availability

The CWL Project [9] is a multi-stakeholder working group consisting of both organizations and individuals. A member project of Software Freedom Conservancy, it publishes the CWL standards freely available via its GitHub repository under a permissive Apache License 2.0.

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Orange (software)</span>

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

CellProfiler is free, open-source software designed to enable biologists without training in computer vision or programming to quantitatively measure phenotypes from thousands of images automatically. Advanced algorithms for image analysis are available as individual modules that can be placed in sequential order together to form a pipeline; the pipeline is then used to identify and measure biological objects and features in images, particularly those obtained through fluorescence microscopy.

<span class="mw-page-title-main">Chemistry Development Kit</span> Computer software

The Chemistry Development Kit (CDK) is computer software, a library in the programming language Java, for chemoinformatics and bioinformatics. It is available for Windows, Linux, Unix, and macOS. It is free and open-source software distributed under the GNU Lesser General Public License (LGPL) 2.0.

The Open Biological and Biomedical Ontologies (OBO) Foundry is a group of people dedicated to build and maintain ontologies related to the life sciences. The OBO Foundry establishes a set of principles for ontology development for creating a suite of interoperable reference ontologies in the biomedical domain. Currently, there are more than a hundred ontologies that follow the OBO Foundry principles.

GenePattern is a freely available computational biology open-source software package originally created and developed at the Broad Institute for the analysis of genomic data. Designed to enable researchers to develop, capture, and reproduce genomic analysis methodologies, GenePattern was first released in 2004. GenePattern is currently developed at the University of California, San Diego.

<span class="mw-page-title-main">Apache Taverna</span>

Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name Taverna Workbench, then a project under the Apache incubator. Taverna allowed users to integrate many different software components, including WSDL SOAP or REST Web services, such as those provided by the National Center for Biotechnology Information, the European Bioinformatics Institute, the DNA Databank of Japan (DDBJ), SoapLab, BioMOBY and EMBOSS. The set of available services was not finite and users could import new service descriptions into the Taverna Workbench.

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

<span class="mw-page-title-main">Fiji (software)</span> Open-source image-processing software

Fiji is an open source image processing package based on ImageJ2.

OpenMS is an open-source project for data analysis and processing in mass spectrometry and is released under the 3-clause BSD licence. It supports most common operating systems including Microsoft Windows, MacOS and Linux.

<span class="mw-page-title-main">Avogadro (software)</span>

Avogadro is a molecule editor and visualizer designed for cross-platform use in computational chemistry, molecular modeling, bioinformatics, materials science, and related areas. It is extensible via a plugin architecture.

A scientific workflow system is a specialized form of a workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or workflow, in a scientific application.

A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics.

Figshare is an online open access repository where researchers can preserve and share their research outputs, including figures, datasets, images, and videos. It is free to upload content and free to access, in adherence to the principle of open data. Figshare is one of a number of portfolio businesses supported by Digital Science, a subsidiary of Springer Nature.

PrecisionFDA is a secure, collaborative, high-performance computing platform that has established a growing community of experts around the analysis of biological datasets in order to advance precision medicine, inform regulatory science, and enable improvements in health outcomes. This cloud-based platform is developed and served by the United States Food and Drug Administration (FDA). PrecisionFDA connects experts, citizen scientists, and scholars from around the world and provides them with a library of computational tools, workflow features, and reference data. The platform allows researchers to upload and compare data against reference genomes, and execute bioinformatic pipelines. The variant call file (VCF) comparator tool also enables users to compare their genetic test results to reference genomes. The platform's code is open source and available on GitHub. The platform also features a crowdsourcing model to sponsor community challenges in order to stimulate the development of innovative analytics that inform precision medicine and regulatory science. Community members from around the world come together to participate in scientific challenges, solving problems that demonstrate the effectiveness of their tools, testing the capabilities of the platform, sharing their results, and engaging the community in discussions. Globally, precisionFDA has more than 5,000 users.

spaCy

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython. The library is published under the MIT license and its main developers are Matthew Honnibal and Ines Montani, the founders of the software company Explosion.

The BioCompute Object (BCO) project is a community-driven initiative to build a framework for standardizing and sharing computations and analyses generated from High-throughput sequencing. The project has since been standardized as IEEE 2791-2020, and the project files are maintained in an open source repository. The July 22nd, 2020 edition of the Federal Register announced that the FDA now supports the use of BioCompute in regulatory submissions, and the inclusion of the standard in the Data Standards Catalog for the submission of HTS data in NDAs, ANDAs, BLAs, and INDs to CBER, CDER, and CFSAN.

Originally started as a collaborative contract between the George Washington University and the Food and Drug Administration, the project has grown to include over 20 universities, biotechnology companies, public-private partnerships and pharmaceutical companies including Seven Bridges and Harvard Medical School. The BCO aims to ease the exchange of HTS workflows between various organizations, such as the FDA, pharmaceutical companies, contract research organizations, bioinformatic platform providers, and academic researchers. Due to the sensitive nature of regulatory filings, few direct references to material can be published. However, the project is currently funded to train FDA Reviewers and administrators to read and interpret BCOs, and currently has 4 publications either submitted or nearly submitted.

Jennifer "Jenny" Bryan is a data scientist and an associate professor of statistics at the University of British Columbia where she developed the Master in Data Science Program. She is a statistician and software engineer at RStudio from Vancouver, Canada and is known for creating open source tools which connect R to Google Sheets and Google Drive.

Nextflow is a scientific workflow system predominantly used for bioinformatic data analyses. It imposes standards on how to programmatically author a sequence of dependent compute steps and enables their execution on various local and cloud resources. Nextflow was conceived at the Centre for Genomic Regulation in Barcelona, Spain, but has since found world-wide adoption in biomedical and genomics research facilities and laboratories.

References

  1. Peter, Amstutz; R., Crusoe, Michael; Nebojša, Tijanić; Brad, Chapman; John, Chilton; Michael, Heuer; Andrey, Kartashov; Dan, Leehr; Hervé, Ménager (2016-07-08). "Common Workflow Language, v1.0". Figshare. doi:10.6084/m9.figshare.3115156.v2.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  2. 1 2 Leipzig, Jeremy (2017-05-01). "A review of bioinformatic pipeline frameworks". Briefings in Bioinformatics. 18 (3): 530–536. doi:10.1093/bib/bbw020. ISSN   1467-5463. PMC   5429012 . PMID   27013646.
  3. Perkel, Jeffrey M. (2019). "Workflow systems turn raw data into scientific knowledge". Nature. 573 (7772): 149–150. Bibcode:2019Natur.573..149P. doi: 10.1038/d41586-019-02619-z . ISSN   0028-0836. PMID   31477884. S2CID   201713827.
  4. "CWL Implementations". Common Workflow Language (CWL). Retrieved 10 October 2021.
  5. Barski, Artem; Kartashov, Andrey V.; Kotliar, Michael (2019-07-01). "CWL-Airflow: a lightweight pipeline manager supporting Common Workflow Language". GigaScience. 8 (7). doi:10.1093/gigascience/giz084. PMC   6639121 . PMID   31321430.
  6. Kaushik, Gaurav; Ivković, Sinisa; Simonović, Janko; Tijanić, Nebojša; Davis-Dusenbery, Brandi; Kural, Deniz (January 2017). "Rabix: An Open-Source Workflow Executor Supporting Recomputability and Interoperability of Workflow Descriptions". Pacific Symposium on Biocomputing 2017. Proceedings of the Pacific Symposium. Vol. 22. pp. 154–165. doi:10.1142/9789813207813_0016. ISBN   978-981-320-780-6. PMC   5166558 . PMID   27896971.
  7. "11.6. pegasus-cwl-converter — Pegasus WMS 5.0.1 documentation". pegasus.isi.edu. Retrieved 10 October 2021.
  8. Chilton, John; Soranzo, Nicola. "Implement a subset of the Common Workflow Language. by jmchilton · Pull Request #47 · common-workflow-language/galaxy". GitHub. Retrieved 10 October 2021.
  9. Crusoe, Michael R.; Abeln, Sanne; Iosup, Alexandru; Amstutz, Peter; Chilton, John; Tijanić, Nebojša; Ménager, Hervé; Soiland-Reyes, Stian; Gavrilović, Bogdan; Goble, Carole; The CWL Community (2022). "Methods Included: Standardizing Computational Reuse and Portability with the Common Workflow Language". Communications of the ACM. 65: 54–63. arXiv: 2105.07028 . doi:10.1145/3486897. S2CID   234742536.