Bioconductor

Last updated
Bioconductor
Stable release
3.19 / 1 May 2024;0 days ago (2024-05-01)
Operating system Linux, macOS, Windows
Platform R programming language
Type Bioinformatics
License Artistic License 2.0
Website www.bioconductor.org

Bioconductor is a free, open source and open development software project for the analysis and comprehension of genomic data generated by wet lab experiments in molecular biology.

Contents

Bioconductor is based primarily on the statistical R programming language, but does contain contributions in other programming languages. It has two releases each year that follow the semiannual releases of R. At any one time there is a release version, which corresponds to the released version of R, and a development version, which corresponds to the development version of R. Most users will find the release version appropriate for their needs. In addition there are many genome annotation packages available that are mainly, but not solely, oriented towards different types of microarrays.

While computational methods continue to be developed to interpret biological data, the Bioconductor project is an open source software repository that hosts a wide range of statistical tools developed in the R programming environment. Utilizing a rich array of statistical and graphical features in R, many Bioconductor packages have been developed to meet various data analysis needs. The use of these packages provides a basic understanding of the R programming / command language. As a result, R and Bioconductor packages, which have a strong computing background, are used by most biologists who will benefit significantly from their ability to analyze datasets. All these results provide biologists with easy access to the analysis of genomic data without requiring programming expertise.

The project was started in the Fall of 2001 and is overseen by the Bioconductor core team, based primarily at the Fred Hutchinson Cancer Research Center, with other members coming from international institutions.

Packages

Most Bioconductor components are distributed as R packages, which are add-on modules for R. Initially most of the Bioconductor software packages focused on the analysis of single channel Affymetrix and two or more channel cDNA/Oligo microarrays. As the project has matured, the functional scope of the software packages broadened to include the analysis of all types of genomic data, such as SAGE, sequence, or SNP data.

Goals

The broad goals of the projects are to:

Main features

Milestones

Each release of Bioconductor is developed to work best with a chosen version of R. [1] In addition to bugfixes and updates, a new release typically adds packages. The table below maps a Bioconductor release to a R version and shows the number of available Bioconductor software packages for that release.

VersionRelease datePackage countR dependency
3.191 May 20242300 R 4.4
3.1825 Oct 20232266 R 4.3
3.162 Nov 20222183 R 4.2
3.1427 Oct 20212083 R 4.1
3.1128 Apr 20201903 R 4.0
3.1030 Oct 20191823 R 3.6
3.831 Oct 20181649 R 3.5
3.631 Oct 20171473 R 3.4
3.418 Oct 20161296 R 3.3
3.214 Oct 20151104 R 3.2
3.014 Oct 2014934 R 3.1
2.1315 Oct 2013749 R 3.0
2.113 Oct 2012610 R 2.15
2.91 Nov 2011517 R 2.14
2.814 Apr 2011466 R 2.13
2.718 Nov 2010418 R 2.12
2.623 Apr 2010389 R 2.11
2.528 Oct 2009352 R 2.10
2.421 Apr 2009320 R 2.9
2.322 Oct 2008294 R 2.8
2.21 May 2008260 R 2.7
2.18 Oct 2007233 R 2.6
2.026 Apr 2007214 R 2.5
1.94 Oct 2006188 R 2.4
1.827 Apr 2006172 R 2.3
1.714 Oct 2005141 R 2.2
1.618 May 2005123 R 2.1
1.525 Oct 2004100 R 2.0
1.417 May 200481 R 1.9
1.330 Oct 200349 R 1.8
1.229 May 200330 R 1.7
1.119 Oct 200220 R 1.6
1.01 May 200215 R 1.5

Resources

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.

<span class="mw-page-title-main">Biocomplexity Institute of Virginia Tech</span> Research organization for computational biology and related fields

The Biocomplexity Institute of Virginia Tech was a research institute specializing in bioinformatics, computational biology, and systems biology. The institute had more than 250 personnel, including over 50 tenured and research faculty. Research at the institute involved collaboration in diverse disciplines such as mathematics, computer science, biology, plant pathology, biochemistry, systems biology, statistics, economics, synthetic biology and medicine. The institute developed -omic and bioinformatic tools and databases that can be applied to the study of human, animal and plant diseases as well as the discovery of new vaccine, drug and diagnostic targets.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes – in many cases, an organism's entire genome – in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult – if not impossible – to analyze without the help of computer programs.

lumi is a free, open source and open development software project for the analysis and comprehension of Illumina expression and methylation microarray data. The project was started in the summer of 2006 and set out to provide algorithms and data management tools of Illumina in the framework of Bioconductor. It is based on the statistical R programming language.

Within computational biology, an MA plot is an application of a Bland–Altman plot for visual representation of genomic data. The plot visualizes the differences between measurements taken in two samples, by transforming the data onto M and A scales, then plotting these values. Though originally applied in the context of two channel DNA microarray gene expression data, MA plots are also used to visualise high-throughput sequencing analysis.

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

<span class="mw-page-title-main">UGENE</span> Computer software for bioinformatics

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

<span class="mw-page-title-main">Integrated Genome Browser</span>

Integrated Genome Browser (IGB) is an open-source genome browser, a visualization tool used to observe biologically-interesting patterns in genomic data sets, including sequence data, gene models, alignments, and data from DNA microarrays.

Rmetrics is a free and open-source software project for teaching computational finance. Rmetrics is based primarily on the statistical R programming language, but does contain contributions in other programming languages, such as Fortran, C, and C++. The project was started in 2001 by Diethelm Wuertz, based at the Swiss Federal Institute of Technology in Zurich.

Robert Clifford Gentleman is a Canadian statistician and bioinformatician who is currently the founding executive director of the Center for Computational Biomedicine at Harvard Medical School. He was previously the vice president of computational biology at 23andMe. Gentleman is recognized, along with Ross Ihaka, as one of the originators of the R programming language and the Bioconductor project.

GeneNetwork is a combined database and open-source bioinformatics data analysis software resource for systems genetics. This resource is used to study gene regulatory networks that link DNA sequence differences to corresponding differences in gene and protein expression and to variation in traits such as health and disease risk. Data sets in GeneNetwork are typically made up of large collections of genotypes and phenotypes from groups of individuals, including humans, strains of mice and rats, and organisms as diverse as Drosophila melanogaster, Arabidopsis thaliana, and barley. The inclusion of genotypes makes it practical to carry out web-based gene mapping to discover those regions of genomes that contribute to differences among individuals in mRNA, protein, and metabolite levels, as well as differences in cell function, anatomy, physiology, and behavior.

The phenotype microarray approach is a technology for high-throughput phenotyping of cells. A phenotype microarray system enables one to monitor simultaneously the phenotypic reaction of cells to environmental challenges or exogenous compounds in a high-throughput manner. The phenotypic reactions are recorded as either end-point measurements or respiration kinetics similar to growth curves.

<span class="mw-page-title-main">Gene set enrichment analysis</span> Bioinformatics method

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.

<span class="mw-page-title-main">Pathway analysis</span>

Pathway is the term from molecular biology for a curated schematic representation of a well characterized segment of the molecular physiological machinery, such as a metabolic pathway describing an enzymatic process within a cell or tissue or a signaling pathway model representing a regulatory process that might, in its turn, enable a metabolic or another regulatory process downstream. A typical pathway model starts with an extracellular signaling molecule that activates a specific receptor, thus triggering a chain of molecular interactions. A pathway is most often represented as a relatively small graph with gene, protein, and/or small molecule nodes connected by edges of known functional relations. While a simpler pathway might appear as a chain, complex pathway topologies with loops and alternative routes are much more common. Computational analyses employ special formats of pathway representation. In the simplest form, however, a pathway might be represented as a list of member molecules with order and relations unspecified. Such a representation, generally called Functional Gene Set (FGS), can also refer to other functionally characterised groups such as protein families, Gene Ontology (GO) and Disease Ontology (DO) terms etc. In bioinformatics, methods of pathway analysis might be used to identify key genes/ proteins within a previously known pathway in relation to a particular experiment / pathological condition or building a pathway de novo from proteins that have been identified as key affected elements. By examining changes in e.g. gene expression in a pathway, its biological activity can be explored. However most frequently, pathway analysis refers to a method of initial characterization and interpretation of an experimental condition that was studied with omics tools or genome-wide association study. Such studies might identify long lists of altered genes. A visual inspection is then challenging and the information is hard to summarize, since the altered genes map to a broad range of pathways, processes, and molecular functions. In such situations, the most productive way of exploring the list is to identify enrichment of specific FGSs in it. The general approach of enrichment analyses is to identify FGSs, members of which were most frequently or most strongly altered in the given condition, in comparison to a gene set sampled by chance. In other words, enrichment can map canonical prior knowledge structured in the form of FGSs to the condition represented by altered genes.

<span class="mw-page-title-main">Rafael Irizarry (scientist)</span> American professor of biostatistics

Rafael Irizarry is a professor of biostatistics at the Harvard T.H. Chan School of Public Health and professor of biostatistics and computational biology at the Dana–Farber Cancer Institute. Irizarry is known as one of the founders of the Bioconductor project.

Sandrine Dudoit is a professor of statistics and public health at the University of California, Berkeley. Her research applies statistics to microarray and genetic data; she is known as one of the founders of the open-source Bioconductor project for the development of bioinformatics software.

<span class="mw-page-title-main">R package</span> Extensions to the R statistical programming language

R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN. The large number of packages available for R, and the ease of installing and using them, has been cited as a major factor driving the widespread adoption of the language in data science.

References

  1. "Bioconductor – Release Announcements". bioconductor.org. Bioconductor. Retrieved 28 May 2019.