Galaxy (computational biology)

Last updated
Galaxy
Developer(s) Galaxy Community
Initial release16 September 2005;18 years ago (2005-09-16)
Stable release
24.1.1 / July 2024 (2024-07)
Repository github.com/galaxyproject/galaxy
Written in Python, JavaScript
Operating system Unix-like
Platform Linux, macOS
Available inEnglish
Type Scientific workflow, data integration, analysis and data publishing
License MIT and Academic Free License [1]
Website galaxyproject.org

Galaxy [2] is a scientific workflow, data integration, [3] [4] and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system. [5]

Contents

Functionality

Galaxy is a scientific workflow system. These systems provide a means to build multi-step computational analyses akin to a recipe. They typically provide a graphical user interface [6] for specifying what data to operate on, what steps to take, and what order to do them in.

Galaxy is also a data integration platform for biological data. It supports data uploads from the user's computer, by URL, and directly from many online resources (such as the UCSC Genome Browser, BioMart and InterMine). Galaxy supports a range of widely used biological data formats, and translation between those formats. Galaxy provides a web interface to many text manipulation utilities, enabling researchers to do their own custom reformatting and manipulation without having to do any programming. Galaxy includes interval manipulation utilities for doing set theoretic operations (e.g. intersection, union, ...) on intervals. Many biological file formats include genomic interval data (a frame of reference, e.g., chromosome or contig name, and start and stop positions), allowing these data to be integrated.

Galaxy was originally written for biological data analysis, particularly genomics. The set of available tools has been greatly expanded over the years and Galaxy is now also used for gene expression, genome assembly, proteomics, epigenomics, transcriptomics and host of other disciplines in the life sciences. The platform itself is actually domain agnostic and can be applied, in theory, to any scientific domain, such as cheminformatics. [7] For example, Galaxy servers exist for image analysis, [8] computational chemistry [9] and drug design, [10] cosmology, climate modeling, social science, [11] and linguistics.

Finally, Galaxy also supports data and analysis persistence and publishing. See Reproducibility and Transparency below.

Project Goals

Galaxy is "an open, web-based platform for performing accessible, reproducible, and transparent genomic science." [12]

Accessibility

Computational biology is a specialized domain that often requires knowledge of computer programming. Galaxy aims to give biomedical researchers access to computational biology without also requiring them to understand computer programming. [13] [14] Galaxy does this by stressing a simple user interface [15] over the ability to build complex workflows. This design choice makes it relatively easy to build typical analyses, but more difficult to build complex workflows that include, for example, looping constructs. (See Apache Taverna for an example of a data-driven workflow system that supports looping. [16] )

Reproducibility

Reproducibility is a key goal of science: When scientific results are published the publications should include enough information that others can repeat the experiment and get the same results. There have been many recent efforts to extend this goal from the bench (the "wet lab") to computational experiments (the "dry lab") as well. This has proved to be a more difficult task than initially expected. [17]

Galaxy supports reproducibility by capturing sufficient information about every step in a computational analysis, so that the analysis can be repeated, exactly, at any point in the future. This includes keeping track of all input, intermediate, and final datasets, as well as the parameters provided to, and the order of each step of the analysis.

Transparency

Galaxy supports transparency in scientific research by enabling researchers to share any of their Galaxy Objects either publicly, or with specific individuals. Shared items can be examined in detail, rerun at will and copied and modified to test hypotheses.

Galaxy Objects: Histories, Workflows, Datasets and Pages

Galaxy objects are anything that can be saved, persisted, and shared in Galaxy:

Histories
Histories are computational analyses (recipes) run with specified input datasets, computational steps and parameters. Histories include all intermediate and output datasets as well.
Workflows
Workflows are computational analyses that specify all the steps (and parameters) in the analysis, but none of the data. Workflows are used to run the same analysis against multiple sets of input data.
Datasets
Datasets includes any input, intermediate, or output dataset, used or produced in an analysis.
Pages
Histories, workflows and datasets can include user-provided annotation. Galaxy Pages enables the creation of a virtual paper that describes the how and why of the overall experiment. Tight integration of Pages with Histories, Workflows, and Datasets supports this goal.

Availability

Galaxy is available:

  1. As a free public web server, [18] supported by the Galaxy Project. [19] This server includes many bioinformatics tools that are widely useful in many areas of genomics research. Users can create logins, and save histories, workflows, and datasets on the server. These saved items can also be shared with others.
  2. As open-source software that can be downloaded, installed and customized to address specific needs. [20] Galaxy can be installed locally or using a computing cloud. [21]
  3. Public web servers hosted by other organizations. [22] Several organizations with their own Galaxy installation have also opted to make those servers available to others.

Implementation

Galaxy is open-source software implemented using the Python programming language. It is developed by the Galaxy team [23] at Penn State, Johns Hopkins University, Oregon Health & Science University, and the Galaxy Community. [24]

Galaxy is extensible, as new command line tools can be integrated and shared within the Galaxy ToolShed. [25]

An example of extending Galaxy is Galaxy-P from the University of Minnesota Supercomputing Institute, which is customized as a data analysis platform for mass spectrometry-based proteomics. [26]

Community

Galaxy is an open source project and the community includes users, organizations that install their own instance, Galaxy developers, and bioinformatics tool developers. The Galaxy project has mailing lists, [27] a community hub, [28] and annual meetings. [29]

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<span class="mw-page-title-main">Generic Model Organism Database</span>

The Generic Model Organism Database (GMOD) project provides biological research communities with a toolkit of open-source software components for visualizing, annotating, managing, and storing biological data. The GMOD project is funded by the United States National Institutes of Health, National Science Foundation and the USDA Agricultural Research Service.

GenePattern is a freely available computational biology open-source software package originally created and developed at the Broad Institute for the analysis of genomic data. Designed to enable researchers to develop, capture, and reproduce genomic analysis methodologies, GenePattern was first released in 2004. GenePattern is currently developed at the University of California, San Diego.

<span class="mw-page-title-main">Apache Taverna</span>

Apache Taverna was an open source software tool for designing and executing workflows, initially created by the myGrid project under the name Taverna Workbench, then a project under the Apache incubator. Taverna allowed users to integrate many different software components, including WSDL SOAP or REST Web services, such as those provided by the National Center for Biotechnology Information, the European Bioinformatics Institute, the DNA Databank of Japan (DDBJ), SoapLab, BioMOBY and EMBOSS. The set of available services was not finite and users could import new service descriptions into the Taverna Workbench.

<span class="mw-page-title-main">Pan-genome</span> All genes of all strains in a clade

In the fields of molecular biology and genetics, a pan-genome is the entire set of genes from all strains within a clade. More generally, it is the union of all the genomes of a clade. The pan-genome can be broken down into a "core pangenome" that contains genes present in all individuals, a "shell pangenome" that contains genes present in two or more strains, and a "cloud pangenome" that contains genes only found in a single strain. Some authors also refer to the cloud genome as "accessory genome" containing 'dispensable' genes present in a subset of the strains and strain-specific genes. Note that the use of the term 'dispensable' has been questioned, at least in plant genomes, as accessory genes play "an important role in genome evolution and in the complex interplay between the genome and the environment". The field of study of pangenomes is called pangenomics.

<span class="mw-page-title-main">Robert Gentleman (statistician)</span> Canadian statistician

Robert Clifford Gentleman is a Canadian statistician and bioinformatician who is currently the founding executive director of the Center for Computational Biomedicine at Harvard Medical School. He was previously the vice president of computational biology at 23andMe. Gentleman is recognized, along with Ross Ihaka, as one of the originators of the R programming language and the Bioconductor project.

The Genomic HyperBrowser is a web-based system for statistical analysis of genomic annotation data.

<span class="mw-page-title-main">BioMart</span>

BioMart is a community-driven project to provide a single point of access to distributed research data. The BioMart project contributes open source software and data services to the international scientific community. Although the BioMart software is primarily used by the biomedical research community, it is designed in such a way that any type of data can be incorporated into the BioMart framework. The BioMart project originated at the European Bioinformatics Institute as a data management solution for the Human Genome Project. Since then, BioMart has grown to become a multi-institute collaboration involving various database projects on five continents.

GenomeSpace is an environment for genomics software tools and applications. It helps users manage their analysis workflows involving multiple diverse tools, including web applications and desktop tools and facilitates the transfer of data between tools via automatic format conversion. Analyses can use data from local or cloud-based stores.

A bioinformatics workflow management system is a specialized form of workflow management system designed specifically to compose and execute a series of computational or data manipulation steps, or a workflow, that relate to bioinformatics.

The BioCompute Object (BCO) project is a community-driven initiative to build a framework for standardizing and sharing computations and analyses generated from High-throughput sequencing. The project has since been standardized as IEEE 2791-2020, and the project files are maintained in an open source repository. The July 22nd, 2020 edition of the Federal Register announced that the FDA now supports the use of BioCompute in regulatory submissions, and the inclusion of the standard in the Data Standards Catalog for the submission of HTS data in NDAs, ANDAs, BLAs, and INDs to CBER, CDER, and CFSAN.

The 'German Network for Bioinformatics Infrastructure – de.NBI' is a national, academic and non-profit infrastructure initiated by the Federal Ministry of Education and Research funding 2015-2021. The network provides bioinformatics services to users in life sciences research and biomedicine in Germany and Europe. The partners organize training events, courses and summer schools on tools, standards and compute services provided by de.NBI to assist researchers to more effectively exploit their data. From 2022, the network will be integrated into Forschungszentrum Jülich.

Nextflow is a scientific workflow system predominantly used for bioinformatic data analysis. It establishes standards for programmatically creating a series of dependent computational steps and facilitates their execution on various local and cloud resources.

Nvidia Parabricks is a suite of free software for genome analysis developed by Nvidia, designed to deliver high throughput by resorting to graphics processing unit (GPU) acceleration.

References

  1. "Project Licenses". GitHub .
  2. The Galaxy Community (20 May 2024). "The Galaxy platform for accessible, reproducible, and collaborative data analyses: 2024 update". Nucleic Acids Research (Web Server Issue): 1–12. doi: 10.1093/nar/gkae410 . PMC   11223835 .
  3. Blankenberg, D.; Coraor, N.; Von Kuster, G.; Taylor, J.; Nekrutenko, A.; Galaxy, T. (2011). "Integrating diverse databases into an unified analysis framework: A Galaxy approach". Database. 2011: bar011. doi:10.1093/database/bar011. PMC   3092608 . PMID   21531983.
  4. Blankenberg, D.; Gordon, A.; Von Kuster, G.; Coraor, N.; Taylor, J.; Nekrutenko, A.; Galaxy, T. (2010). "Manipulation of FASTQ data with Galaxy". Bioinformatics. 26 (14): 1783–1785. doi:10.1093/bioinformatics/btq281. PMC   2894519 . PMID   20562416.
  5. "Galaxy Community Hub - Galaxy Community Hub".
  6. Schatz, M. C. (2010). "The missing graphical user interface for genomics". Genome Biology. 11 (8): 128–201. doi: 10.1186/gb-2010-11-8-128 . PMC   2945776 . PMID   20804568.
  7. Bray, Simon A.; Lucas, Xavier; Kumar, Anup; Grüning, Björn A. (1 June 2020). "The ChemicalToolbox: reproducible, user-friendly cheminformatics analysis on the Galaxy platform". Journal of Cheminformatics. 12 (1): 40. doi: 10.1186/s13321-020-00442-7 . PMC   7268608 . PMID   33431029.
  8. "biotools Galaxy Image Analysis".
  9. Hildebrandt, A. K.; Stöckel, D; Fischer, N. M.; de la Garza, L; Krüger, J; Nickels, S; Röttig, M; Schärfe, C; Schumann, M; Thiel, P; Lenhof, H. P.; Kohlbacher, O; Hildebrandt, A (2014). "Ballaxy: Web services for structural bioinformatics". Bioinformatics. 31 (1): 121–2. doi: 10.1093/bioinformatics/btu574 . PMID   25183489.
  10. "OSDDlinux". Archived from the original on 2016-05-07. Retrieved 2014-11-17.
  11. "Galaxy".
  12. Goecks, J.; Nekrutenko, A.; Taylor, J.; Galaxy Team, T. (2010). "Galaxy: A comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences". Genome Biology. 11 (8): R86. doi: 10.1186/gb-2010-11-8-r86 . PMC   2945788 . PMID   20738864.
  13. Blankenberg, D.; Taylor, J.; Nekrutenko, A.; The Galaxy, T. (2011). "Making whole genome multiple alignments usable for biologists". Bioinformatics. 27 (17): 2426–8. doi:10.1093/bioinformatics/btr398. PMC   3157923 . PMID   21775304.
  14. Blankenberg, D.; Taylor, J.; Schenck, I.; He, J.; Zhang, Y.; Ghent, M.; Veeraraghavan, N.; Albert, I.; Miller, W.; Makova, K. D.; Hardison, R. C.; Nekrutenko, A. (2007). "A framework for collaborative analysis of ENCODE data: Making large-scale analyses biologist-friendly". Genome Research. 17 (6): 960–964. doi:10.1101/gr.5578007. PMC   1891355 . PMID   17568012.
  15. Schatz, M. C. (2010). "The missing graphical user interface for genomics". Genome Biology. 11 (8): 128–201. doi: 10.1186/gb-2010-11-8-128 . PMC   2945776 . PMID   20804568.
  16. Soiland-Reyes, S (2010-12-13). "Looping". The Taverna Knowledge Blog. knowledgeblog.org. Archived from the original on 30 December 2016. Retrieved 28 January 2015.
  17. Ioannidis, J. P. A.; Allison, D. B.; Ball, C. A.; Coulibaly, I.; Cui, X.; Culhane, A. N. C.; Falchi, M.; Furlanello, C.; Game, L.; Jurman, G.; Mangion, J.; Mehta, T.; Nitzberg, M.; Page, G. P.; Petretto, E.; Van Noort, V. (2008). "Repeatability of published microarray gene expression analyses". Nature Genetics. 41 (2): 149–155. doi:10.1038/ng.295. PMID   19174838. S2CID   5153795.
  18. "usegalaxy.org: Main instance of Galaxy in the United States"
  19. "galaxyproject.org: Galaxy Community Hub"
  20. "getgalaxy.org: How to get Galaxy"
  21. Afgan, E.; Baker, D.; Coraor, N.; Chapman, B.; Nekrutenko, A.; Taylor, J. (2010). "Galaxy CloudMan: Delivering cloud compute clusters". BMC Bioinformatics. 11 (Suppl 12): S4. doi: 10.1186/1471-2105-11-S12-S4 . PMC   3040530 . PMID   21210983.
  22. "Galaxy Community Hub - Galaxy Community Hub".
  23. "Galaxy Community Hub - Galaxy Community Hub".
  24. Lazarus, R.; Taylor, J.; Qiu, W.; Nekrutenko, A. (2008). "Toward the commoditization of translational genomic research: Design and implementation features of the Galaxy genomic workbench". Summit on Translational Bioinformatics. 2008: 56–60. PMC   3041519 . PMID   21347127.
  25. Blankenberg, Daniel; Von Kuster, Gregory; Bouvier, Emil; Baker, Dannon; Afgan, Enis; Stoler, Nicholas; Taylor, James; Nekrutenko, Anton (2014). "Dissemination of scientific software with Galaxy ToolShed". Genome Biology. 15 (2): 403. doi: 10.1186/gb4161 . PMC   4038738 . PMID   25001293.
  26. Sheynkman, GM; Johnson, JE; Jagtap, PD; Shortreed, MR; Onsongo, G; Frey, BL; Griffin, TJ; Smith, LM (22 August 2014). "Using Galaxy-P to leverage RNA-Seq for the discovery of novel protein variations". BMC Genomics. 15 (703): 703. doi: 10.1186/1471-2164-15-703 . PMC   4158061 . PMID   25149441.
  27. "Galaxy Mailing Lists".
  28. "galaxyproject.org: Galaxy Community Hub
  29. "Galaxy Community Conferences (GCCS)".