Anduril (workflow engine)

Last updated
Anduril
Developer(s) Systems Biology Laboratory University of Helsinki
Initial release1 July 2010;13 years ago (2010-07-01)
Stable release
2.0.0 (2016-07-01) / July 1, 2016;7 years ago (2016-07-01) [1]
Repository
Written in Java
Operating system Linux, Microsoft Windows, Mac OS X
Type Workflow engine
License GPL (v.1.x), BSD (v.2.x)
Website www.anduril.org

Anduril is an open source component-based workflow framework for scientific data analysis [2] developed at the Systems Biology Laboratory, University of Helsinki.

Contents

Anduril is designed to enable systematic, flexible and efficient data analysis, particularly in the field of high-throughput experiments in biomedical research. The workflow system currently provides components for several types of analysis such as sequencing, gene expression, SNP, ChIP-on-chip, comparative genomic hybridization and exon microarray analysis as well as cytometry and cell imaging analysis.

Architecture and features

A workflow is a series of processing steps connected together so that the output of one step is used as the input of another. Processing steps implement data analysis tasks such as data importing, statistical tests and report generation. In Anduril, processing steps are implemented using components, which are reusable executable code that can be written in any programming language. Components are wired together into a workflow, or a component network, that is executed by the Anduril workflow engine. Workflow configuration is done using a simple yet powerful scripting language, AndurilScript. Workflow configuration and execution can be done from Eclipse, a popular multipurpose GUI, or from the command line.

The core Anduril engine is written in Java and components are written in a variety of programming languages, including Java, R, MATLAB, Lua, Perl and Python. Components may also have dependencies on third-party libraries, such as Bioconductor. Components for cell imaging and microarray analysis are provided but additional components can be implemented by users. The Anduril core has been tested on Linux and Windows.

Anduril 1.0: AndurilScript language

Hello world in AndurilScript is simply

std.echo("Hello world!")

Commenting follows the syntax of Java:

// A simple comment/* Another simple comment *//** A description that will be included in component description */

Components are called by assigning their calls to named component instances. Names cannot be re-used within a single workflow. There are special components for input files that include external files to the script. Supported atomic types are integer, float, boolean and string, and typing is done implicitly.

in1=INPUT(path="myFile.csv")constant1=1componentInstance1=MyComponent(inputPort1=in1,inputParam1=constant1)

Workflows are constructed by assigning outputs of component instances to inputs of following components.

componentInstance2=AnotherComponent(inputPort1=componentInstance1.outputPort1)

Component instances can also be wrapped as functions.

functionMyFunction(InType1in1,...,optionalInTypeMinM,ParType1param1,...,ParTypePparamP=defaultP)->(OutType1out1,...,OutTypeNoutN){...statements...returnrecord(out1=x1,...,outN=xN)}

In addition to standard if-else and switch-case statements, AndurilScript also includes for-loops.

// Iterates over 1, 2, ..., 10array=record()fori:std.range(1,10){array[i]=SomeComponent(k=i)}

Extensibility

Anduril can be extended on multiple levels. Users can add new components to existing component bundles. However, if the new component or components carry out tasks that are not related to existing bundles, users can also create new bundles.

Moksiskaan

The upset face of the Moksiskaan logo MoksiskaanLogo.svg
The upset face of the Moksiskaan logo

Moksiskaan is a data integration framework for the cancer research and molecular biology. [3] The framework provides a relational database that represents a graph of biological entities such as genes, protein, drugs, pathways, diseases, biological processes, cellular components, and molecular functions. In addition, there is a wide set of analysis and accession tools built on top of this data. The great majority of these tools are implemented as Anduril components and functions.

Moksiskaan is used mainly to interpret lists of candidate genes obtained from the genomic studies. Its tools can be used to generate graphs of biological entities related to the input genes. The exact for of these graphs may vary from the drug target predictions to the time series of signalling cascades. Some of the goals of these tools are closely related to IPA.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Orange (software)</span>

Orange is an open-source data visualization, machine learning and data mining toolkit. It features a visual programming front-end for explorative qualitative data analysis and interactive data visualization.

<span class="mw-page-title-main">KEGG</span> Collection of bioinformatics databases

KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

<span class="mw-page-title-main">Generic Model Organism Database</span>

The Generic Model Organism Database (GMOD) project provides biological research communities with a toolkit of open-source software components for visualizing, annotating, managing, and storing biological data. The GMOD project is funded by the United States National Institutes of Health, National Science Foundation and the USDA Agricultural Research Service.

<span class="mw-page-title-main">Apache Taverna</span>

Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name Taverna Workbench, then a project under the Apache incubator. Taverna allowed users to integrate many different software components, including WSDL SOAP or REST Web services, such as those provided by the National Center for Biotechnology Information, the European Bioinformatics Institute, the DNA Databank of Japan (DDBJ), SoapLab, BioMOBY and EMBOSS. The set of available services was not finite and users could import new service descriptions into the Taverna Workbench.

Mark Bender Gerstein is an American scientist working in bioinformatics and Data Science. As of 2009, he is co-director of the Yale Computational Biology and Bioinformatics program.

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

<span class="mw-page-title-main">Pan-genome</span> All genes of all strains in a clade

In the fields of molecular biology and genetics, a pan-genome is the entire set of genes from all strains within a clade. More generally, it is the union of all the genomes of a clade. The pan-genome can be broken down into a "core pangenome" that contains genes present in all individuals, a "shell pangenome" that contains genes present in two or more strains, and a "cloud pangenome" that contains genes only found in a single strain. Some authors also refer to the cloud genome as "accessory genome" containing 'dispensable' genes present in a subset of the strains and strain-specific genes. Note that the use of the term 'dispensable' has been questioned, at least in plant genomes, as accessory genes play "an important role in genome evolution and in the complex interplay between the genome and the environment". The field of study of pangenomes is called pangenomics.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

Translational bioinformatics (TBI) is a field that emerged in the 2010s to study health informatics, focused on the convergence of molecular bioinformatics, biostatistics, statistical genetics and clinical informatics. Its focus is on applying informatics methodology to the increasing amount of biomedical and genomic data to formulate knowledge and medical tools, which can be utilized by scientists, clinicians, and patients. Furthermore, it involves applying biomedical research to improve human health through the use of computer-based information system. TBI employs data mining and analyzing biomedical informatics in order to generate clinical knowledge for application. Clinical knowledge includes finding similarities in patient populations, interpreting biological information to suggest therapy treatments and predict health outcomes.

A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics.

geWorkbench is an open-source software platform for integrated genomic data analysis. It is a desktop application written in the programming language Java. geWorkbench uses a component architecture. As of 2016, there are more than 70 plug-ins available, providing for the visualization and analysis of gene expression, sequence, and structure data.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

<span class="mw-page-title-main">Gene set enrichment analysis</span> Bioinformatics method

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes which are used for the analysis.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

<span class="mw-page-title-main">Pathway analysis</span>

Pathway is the term from molecular biology for a curated schematic representation of a well characterized segment of the molecular physiological machinery, such as a metabolic pathway describing an enzymatic process within a cell or tissue or a signaling pathway model representing a regulatory process that might, in its turn, enable a metabolic or another regulatory process downstream. A typical pathway model starts with an extracellular signaling molecule that activates a specific receptor, thus triggering a chain of molecular interactions. A pathway is most often represented as a relatively small graph with gene, protein, and/or small molecule nodes connected by edges of known functional relations. While a simpler pathway might appear as a chain, complex pathway topologies with loops and alternative routes are much more common. Computational analyses employ special formats of pathway representation. In the simplest form, however, a pathway might be represented as a list of member molecules with order and relations unspecified. Such a representation, generally called Functional Gene Set (FGS), can also refer to other functionally characterised groups such as protein families, Gene Ontology (GO) and Disease Ontology (DO) terms etc. In bioinformatics, methods of pathway analysis might be used to identify key genes/ proteins within a previously known pathway in relation to a particular experiment / pathological condition or building a pathway de novo from proteins that have been identified as key affected elements. By examining changes in e.g. gene expression in a pathway, its biological activity can be explored. However most frequently, pathway analysis refers to a method of initial characterization and interpretation of an experimental condition that was studied with omics tools or genome-wide association study. Such studies might identify long lists of altered genes. A visual inspection is then challenging and the information is hard to summarize, since the altered genes map to a broad range of pathways, processes, and molecular functions. In such situations, the most productive way of exploring the list is to identify enrichment of specific FGSs in it. The general approach of enrichment analyses is to identify FGSs, members of which were most frequently or most strongly altered in the given condition, in comparison to a gene set sampled by chance. In other words, enrichment can map canonical prior knowledge structured in the form of FGSs to the condition represented by altered genes.

Nextflow is a scientific workflow system predominantly used for bioinformatic data analyses. It imposes standards on how to programmatically author a sequence of dependent compute steps and enables their execution on various local and cloud resources. Nextflow was conceived at the Centre for Genomic Regulation in Barcelona, Spain, but has since found world-wide adoption in biomedical and genomics research facilities and laboratories.

References

  1. "anduril-dev / anduril / doc / ChangeLog.txt — Bitbucket". bitbucket.org. Retrieved 2021-03-25.
  2. Ovaska, K.; Laakso, M.; Haapa-Paananen, S.; Louhimo, R.; Chen, P.; Aittomäki, V.; Valo, E.; Núñez-Fontarnau, J.; Rantanen, V.; Karinen, S.; Nousiainen, K.; Lahesmaa-Korpinen, A. M.; Miettinen, M.; Saarinen, L.; Kohonen, P.; Wu, J.; Westermarck, J.; Hautaniemi, S. (2010). "Large-scale data integration framework provides a comprehensive view on glioblastoma multiforme". Genome Medicine. 2 (9): 65. doi: 10.1186/gm186 . PMC   3092116 . PMID   20822536.
  3. Laakso, M.; Hautaniemi, S. (2010). "Integrative platform to translate gene sets to networks". Bioinformatics. 26 (14): 1802–1803. doi: 10.1093/bioinformatics/btq277 . PMID   20507894.

Further reading