Bioconductor

Bioconductor
Stable release	3.19 / 1 May 2024;0 days ago
Operating system	Linux, macOS, Windows
Platform	R programming language
Type	Bioinformatics
License	Artistic License 2.0
Website	www.bioconductor.org

Last updated May 02, 2024

Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

Bioconductor is based primarily on the statistical R programming language, but does contain contributions in other programming languages. It has two releases each year that follow the semiannual releases of R. At any one time there is a release version, which corresponds to the released version of R, and a development version, which corresponds to the development version of R. Most users will find the release version appropriate for their needs. In addition there are many genome annotation packages available that are mainly, but not solely, oriented towards different types of microarrays.

While computational methods continue to be developed to interpret biological data, the Bioconductor project is an open source software repository that hosts a wide range of statistical tools developed in the R programming environment. Utilizing a rich array of statistical and graphical features in R, many Bioconductor packages have been developed to meet various data analysis needs. The use of these packages provides a basic understanding of the R programming / command language. As a result, R and Bioconductor packages, which have a strong computing background, are used by most biologists who will benefit significantly from their ability to analyze datasets. All these results provide biologists with easy access to the analysis of genomic data without requiring programming expertise.

The project was started in the Fall of 2001 and is overseen by the Bioconductor core team, based primarily at the Fred Hutchinson Cancer Research Center, with other members coming from international institutions.

Packages

Most Bioconductor components are distributed as R packages, which are add-on modules for R. Initially most of the Bioconductor software packages focused on the analysis of single channel Affymetrix and two or more channel cDNA/Oligo microarrays. As the project has matured, the functional scope of the software packages broadened to include the analysis of all types of genomic data, such as SAGE, sequence, or SNP data.

Goals

The broad goals of the projects are to:

Provide widespread access to a broad range of powerful statistical and graphical methods for the analysis of genomic data.
Facilitate the inclusion of biological metadata in the analysis of genomic data, e.g. literature data from PubMed, annotation data from LocusLink/Entrez.
Provide a common software platform that enables the rapid development and deployment of plug-able, scalable, and interoperable software.
Further scientific understanding by producing high-quality documentation and reproducible research.
Train researchers on computational and statistical methods for the analysis of genomic data.

Main features

Documentation and reproducible research. Each Bioconductor package contains at least one vignette, which is a document that provides a textual, task-oriented description of the package's functionality. These vignettes come in several forms. Many are simple "How-to"s that are designed to demonstrate how a particular task can be accomplished with that package's software. Others provide a more thorough overview of the package or might even discuss general issues related to the package. In the future, the Bioconductor project is looking towards providing vignettes that are not specifically tied to a package, but rather are demonstrating more complex concepts. As with all aspects of the Bioconductor project, users are encouraged to participate in this effort.
Statistical and graphical methods. The Bioconductor project aims to provide access to a wide range of powerful statistical and graphical methods for the analysis of genomic data. Analysis packages are available for: pre-processing Affymetrix and Illumina, cDNA array data; identifying differentially expressed genes; graph theoretical analyses; plotting genomic data. In addition, the R package system itself provides implementations for a broad range of state-of-the-art statistical and graphical techniques, including linear and non-linear modeling, cluster analysis, prediction, resampling, survival analysis, and time series analysis.
Genome annotation. The Bioconductor project provides software for associating microarray and other genomic data in real time to biological metadata from web databases such as GenBank, LocusLink and PubMed (annotate package). Functions are also provided for incorporating the results of statistical analysis in HTML reports with links to annotation WWW resources. Software tools are available for assembling and processing genomic annotation data, from databases such as GenBank, the Gene Ontology Consortium, LocusLink, UniGene, the UCSC Human Genome Project and others with the AnnotationDbi package. Data packages are distributed to provide mappings between different probe identifiers (e.g. Affy IDs, LocusLink, PubMed). Customized annotation libraries can also be assembled.This project also contain several functions for genomic analysis and phylogenetic (e.g ggtree, phytools packages ..).
Open source. The Bioconductor project has a commitment to full open source discipline, with distribution via a SourceForge.net-like platform. All contributions are expected to exist under an open source license such as Artistic 2.0, GPL2, or BSD. There are many different reasons why open-source software is beneficial to the analysis of microarray data and to computational biology in general. The reasons include:
- To provide full access to algorithms and their implementation
- To facilitate software improvements through bug fixing and plug-ins
- To encourage good scientific computing and statistical practice by providing appropriate tools and instruction
- To provide a workbench of tools that allow researchers to explore and expand the methods used to analyze biological data
- To ensure that the international scientific community is the owner of the software tools needed to carry out research
- To lead and encourage commercial support and development of those tools that are successful
- To promote reproducible research by providing open and accessible tools with which to carry out that research (reproducible research is distinct from independent verification)
Open development. Users are encouraged to become developers, either by contributing Bioconductor compliant packages or documentation. Additionally Bioconductor provides a mechanism for linking together different groups with common goals to foster collaboration on software, possibly at the level of shared development.

Milestones

Each release of Bioconductor is developed to work best with a chosen version of R.^[1] In addition to bugfixes and updates, a new release typically adds packages. The table below maps a Bioconductor release to a R version and shows the number of available Bioconductor software packages for that release.

Version	Release date	Package count	R dependency
3.19	1 May 2024	2300	R 4.4
3.18	25 Oct 2023	2266	R 4.3
3.16	2 Nov 2022	2183	R 4.2
3.14	27 Oct 2021	2083	R 4.1
3.11	28 Apr 2020	1903	R 4.0
3.10	30 Oct 2019	1823	R 3.6
3.8	31 Oct 2018	1649	R 3.5
3.6	31 Oct 2017	1473	R 3.4
3.4	18 Oct 2016	1296	R 3.3
3.2	14 Oct 2015	1104	R 3.2
3.0	14 Oct 2014	934	R 3.1
2.13	15 Oct 2013	749	R 3.0
2.11	3 Oct 2012	610	R 2.15
2.9	1 Nov 2011	517	R 2.14
2.8	14 Apr 2011	466	R 2.13
2.7	18 Nov 2010	418	R 2.12
2.6	23 Apr 2010	389	R 2.11
2.5	28 Oct 2009	352	R 2.10
2.4	21 Apr 2009	320	R 2.9
2.3	22 Oct 2008	294	R 2.8
2.2	1 May 2008	260	R 2.7
2.1	8 Oct 2007	233	R 2.6
2.0	26 Apr 2007	214	R 2.5
1.9	4 Oct 2006	188	R 2.4
1.8	27 Apr 2006	172	R 2.3
1.7	14 Oct 2005	141	R 2.2
1.6	18 May 2005	123	R 2.1
1.5	25 Oct 2004	100	R 2.0
1.4	17 May 2004	81	R 1.9
1.3	30 Oct 2003	49	R 1.8
1.2	29 May 2003	30	R 1.7
1.1	19 Oct 2002	20	R 1.6
1.0	1 May 2002	15	R 1.5

Resources

Gentleman, R.; Carey, V.; Huber, W.; Irizarry, R.; Dudoit, S. (2005). Bioinformatics and Computational Biology Solutions Using R and Bioconductor. Springer. ISBN 978-0-387-25146-2.
Gentleman, R. (2008). R Programming for Bioinformatics. Chapman & Hall/CRC. ISBN 978-1-4200-6367-7.
Hahne, F.; Huber, W.; Gentleman, R.; Falcon, S. (2008). Bioconductor Case Studies. Springer. ISBN 978-0-387-77239-4.
Gentleman, Robert C.; Carey, Vincent J.; Bates, Douglas M.; Bolstad, Ben; Dettling, Marcel; Dudoit, Sandrine; Ellis, Byron; Gautier, Laurent; Ge, Yongchao; Gentry, Jeff; Hornik, Kurt; Hothorn, Torsten; Huber, Wolfgang; Iacus, Stefano; Irizarry, Rafael; Leisch, Friedrich; Li, Cheng; Maechler, Martin; Rossini, Anthony J.; Sawitzki, Gunther; Smith, Colin; Smyth, Gordon; Tierney, Luke; Yang, Jean Y. H.; Zhang, Jianhua (2004). "Bioconductor: open software development for computational biology and bioinformatics". Genome Biology . 5 (10): R80. doi: 10.1186/gb-2004-5-10-r80 . PMC 545600 . PMID 15461798.

Related Research Articles

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.

The Biocomplexity Institute of Virginia Tech was a research institute specializing in bioinformatics, computational biology, and systems biology. The institute had more than 250 personnel, including over 50 tenured and research faculty. Research at the institute involved collaboration in diverse disciplines such as mathematics, computer science, biology, plant pathology, biochemistry, systems biology, statistics, economics, synthetic biology and medicine. The institute developed -omic and bioinformatic tools and databases that can be applied to the study of human, animal and plant diseases as well as the discovery of new vaccine, drug and diagnostic targets.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes – in many cases, an organism's entire genome – in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult – if not impossible – to analyze without the help of computer programs.

lumi is a free, open source and open development software project for the analysis and comprehension of Illumina expression and methylation microarray data. The project was started in the summer of 2006 and set out to provide algorithms and data management tools of Illumina in the framework of Bioconductor. It is based on the statistical R programming language.

Within computational biology, an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M and A scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

Integrated Genome Browser (IGB) is an open-source genome browser, a visualization tool used to observe biologically-interesting patterns in genomic data sets, including sequence data, gene models, alignments, and data from DNA microarrays.

Rmetrics is a free and open-source software project for teaching computational finance. Rmetrics is based primarily on the statistical R programming language, but does contain contributions in other programming languages, such as Fortran, C, and C++. The project was started in 2001 by Diethelm Wuertz, based at the Swiss Federal Institute of Technology in Zurich.

Robert Clifford Gentleman is a Canadian statistician and bioinformatician who is currently the founding executive director of the Center for Computational Biomedicine at Harvard Medical School. He was previously the vice president of computational biology at 23andMe. Gentleman is recognized, along with Ross Ihaka, as one of the originators of the R programming language and the Bioconductor project.

GeneNetwork is a combined database and open-source bioinformatics data analysis software resource for systems genetics. This resource is used to study gene regulatory networks that link DNA sequence differences to corresponding differences in gene and protein expression and to variation in traits such as health and disease risk. Data sets in GeneNetwork are typically made up of large collections of genotypes and phenotypes from groups of individuals, including humans, strains of mice and rats, and organisms as diverse as Drosophila melanogaster, Arabidopsis thaliana, and barley. The inclusion of genotypes makes it practical to carry out web-based gene mapping to discover those regions of genomes that contribute to differences among individuals in mRNA, protein, and metabolite levels, as well as differences in cell function, anatomy, physiology, and behavior.

The phenotype microarray approach is a technology for high-throughput phenotyping of cells. A phenotype microarray system enables one to monitor simultaneously the phenotypic reaction of cells to environmental challenges or exogenous compounds in a high-throughput manner. The phenotypic reactions are recorded as either end-point measurements or respiration kinetics similar to growth curves.

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.

<span class="mw-page-title-main">Pathway analysis</span>

Pathway is the term from molecular biology for a curated schematic representation of a well characterized segment of the molecular physiological machinery, such as a metabolic pathway describing an enzymatic process within a cell or tissue or a signaling pathway model representing a regulatory process that might, in its turn, enable a metabolic or another regulatory process downstream. A typical pathway model starts with an extracellular signaling molecule that activates a specific receptor, thus triggering a chain of molecular interactions. A pathway is most often represented as a relatively small graph with gene, protein, and/or small molecule nodes connected by edges of known functional relations. While a simpler pathway might appear as a chain, complex pathway topologies with loops and alternative routes are much more common. Computational analyses employ special formats of pathway representation. In the simplest form, however, a pathway might be represented as a list of member molecules with order and relations unspecified. Such a representation, generally called Functional Gene Set (FGS), can also refer to other functionally characterised groups such as protein families, Gene Ontology (GO) and Disease Ontology (DO) terms etc. In bioinformatics, methods of pathway analysis might be used to identify key genes/ proteins within a previously known pathway in relation to a particular experiment / pathological condition or building a pathway de novo from proteins that have been identified as key affected elements. By examining changes in e.g. gene expression in a pathway, its biological activity can be explored. However most frequently, pathway analysis refers to a method of initial characterization and interpretation of an experimental condition that was studied with omics tools or genome-wide association study. Such studies might identify long lists of altered genes. A visual inspection is then challenging and the information is hard to summarize, since the altered genes map to a broad range of pathways, processes, and molecular functions. In such situations, the most productive way of exploring the list is to identify enrichment of specific FGSs in it. The general approach of enrichment analyses is to identify FGSs, members of which were most frequently or most strongly altered in the given condition, in comparison to a gene set sampled by chance. In other words, enrichment can map canonical prior knowledge structured in the form of FGSs to the condition represented by altered genes.

<span class="mw-page-title-main">Rafael Irizarry (scientist)</span> American professor of biostatistics

Rafael Irizarry is a professor of biostatistics at the Harvard T.H. Chan School of Public Health and professor of biostatistics and computational biology at the Dana–Farber Cancer Institute. Irizarry is known as one of the founders of the Bioconductor project.

Sandrine Dudoit is a professor of statistics and public health at the University of California, Berkeley. Her research applies statistics to microarray and genetic data; she is known as one of the founders of the open-source Bioconductor project for the development of bioinformatics software.

R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN. The large number of packages available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of the language in data science.

References

↑ "Bioconductor – Release Announcements". bioconductor.org. Bioconductor. Retrieved 28 May 2019.

External links

Official website
The R Project GNU R is a programming language for statistical computing.
Bioconductor Releases
The community of the Debian GNU/Linux distribution strives towards an automated building of BioConductor packages Archived 2007-08-11 at the Wayback Machine for their distribution. BioKnoppix and Quantian are projects extending Knoppix that have contributed bootable Debian GNU/Linux CDs providing BioConductor installations.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[BioCReleasePage-1] "Bioconductor – Release Announcements". bioconductor.org. Bioconductor. Retrieved 28 May 2019.

[1]