BioPerl

Last updated
BioPerl
Initial release11 June 2002;21 years ago (2002-06-11)
Stable release
1.7.7  OOjs UI icon edit-ltr-progressive.svg / 7 December 2019;4 years ago (7 December 2019)
Repository
Written in Perl
Type Bioinformatics
License Artistic License and GPL
Website bioperl.org

BioPerl [1] [2] is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project. [3]

Contents

Background

BioPerl is an active open source software project supported by the Open Bioinformatics Foundation. The first set of Perl codes of BioPerl was created by Tim Hubbard and Jong Bhak[ citation needed ] at MRC Centre Cambridge, where the first genome sequencing was carried out by Fred Sanger. MRC Centre was one of the hubs and birthplaces of modern bioinformatics as it had a large quantity of DNA sequences and 3D protein structures. Hubbard was using the th_lib.pl Perl library, which contained many useful Perl subroutines for bioinformatics. Bhak, Hubbard's first PhD student, created jong_lib.pl. Bhak merged the two Perl subroutine libraries into Bio.pl. The name BioPerl was coined jointly by Bhak and Steven Brenner at the Centre for Protein Engineering (CPE). In 1995, Brenner organized a BioPerl session at the Intelligent Systems for Molecular Biology conference, held in Cambridge. BioPerl had some users in coming months including Georg Fuellen who organized a training course in Germany. Fuellen's colleagues and students greatly extended BioPerl; this was further expanded by others, including Steve Chervitz who was actively developing Perl codes for his yeast genome database. The major expansion came when Cambridge student Ewan Birney joined the development team.[ citation needed ]

The first stable release was on 11 June 2002; the most recent stable (in terms of API) release is 1.7.2 from 7 September 2017. There are also developer releases produced periodically. Version series 1.7.x is considered to be the most stable (in terms of bugs) version of BioPerl and is recommended for everyday use.

In order to take advantage of BioPerl, the user needs a basic understanding of the Perl programming language including an understanding of how to use Perl references, modules, objects, and methods.

Features and examples

BioPerl provides software modules for many of the typical tasks of bioinformatics programming. These include:

Example of accessing GenBank to retrieve a sequence:

use Bio::DB::GenBank;  $db_obj = Bio::DB::GenBank->new;  $seq_obj = $db_obj->get_Seq_by_acc( # Insert Accession Number ); 

Example code for transforming formats

use Bio::SeqIO;  my $usage = "all2y.pl informat outfile outfileformat"; my $informat = shift or die $usage; my $outfile = shift or die $usage; my $outformat = shift or die $usage;  my $seqin = Bio::SeqIO->new( -fh  => *STDIN,  -format => $informat, ); my $seqout = Bio::SeqIO->new( -file  => ">$outfile",  -format => $outformat, );  while (my $inseq = $seqin->next_seq) {    $seqout->write_seq($inseq); } 

Example of gathering statistics for a given sequence

use Bio::Tools::SeqStats; $seq_stats = Bio::Tools::SeqStats->new($seqobj);  $weight = $seq_stats->get_mol_wt(); $monomer_ref = $seq_stats->count_monomers();  # for nucleic acid sequence $codon_ref = $seq_stats->count_codons(); 

Usage

In addition to being used directly by end-users, [4] BioPerl has also provided the base for a wide variety of bioinformatic tools, including amongst others:

New tools and algorithms from external developers are often integrated directly into BioPerl itself:

Advantages

BioPerl was one of the first biological module repositories that increased its usability. It has very easy to install modules, along with a flexible global repository. BioPerl uses good test modules for a large variety of processes.

Disadvantages

There are many ways to use BioPerl, from simple scripting to very complex object programming. This makes the language not clear and sometimes hard to understand. For as many modules that BioPerl has, some do not always work the way they are intended.[ citation needed ]

Several related bioinformatics libraries implemented in other programming languages exist as part of the Open Bioinformatics Foundation, including:

Related Research Articles

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

<span class="mw-page-title-main">Biopython</span> Collection of open-source Python software tools for computational biology

The Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

<span class="mw-page-title-main">Sequence homology</span> Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

<span class="mw-page-title-main">Fusion gene</span>

A fusion gene is a hybrid gene formed from two previously independent genes. It can occur as a result of translocation, interstitial deletion, or chromosomal inversion. Fusion genes have been found to be prevalent in all main types of human neoplasia. The identification of these fusion genes play a prominent role in being a diagnostic and prognostic marker.

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

<span class="mw-page-title-main">Ewan Birney</span> English businessman

John Frederick William Birney is joint director of EMBL's European Bioinformatics Institute (EMBL-EBI), in Hinxton, Cambridgeshire and deputy director general of the European Molecular Biology Laboratory (EMBL). He also serves as non-executive director of Genomics England, chair of the Global Alliance for Genomics and Health (GA4GH) and honorary professor of bioinformatics at the University of Cambridge. Birney has made significant contributions to genomics, through his development of innovative bioinformatics and computational biology tools. He previously served as an associate faculty member at the Wellcome Trust Sanger Institute.

<span class="mw-page-title-main">Generic Model Organism Database</span>

The Generic Model Organism Database (GMOD) project provides biological research communities with a toolkit of open-source software components for visualizing, annotating, managing, and storing biological data. The GMOD project is funded by the United States National Institutes of Health, National Science Foundation and the USDA Agricultural Research Service.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

GeneNetwork is a combined database and open-source bioinformatics data analysis software resource for systems genetics. This resource is used to study gene regulatory networks that link DNA sequence differences to corresponding differences in gene and protein expression and to variation in traits such as health and disease risk. Data sets in GeneNetwork are typically made up of large collections of genotypes and phenotypes from groups of individuals, including humans, strains of mice and rats, and organisms as diverse as Drosophila melanogaster, Arabidopsis thaliana, and barley. The inclusion of genotypes makes it practical to carry out web-based gene mapping to discover those regions of genomes that contribute to differences among individuals in mRNA, protein, and metabolite levels, as well as differences in cell function, anatomy, physiology, and behavior.

<span class="mw-page-title-main">Lincoln Stein</span> American scientist and academic

Lincoln David Stein is a scientist and Professor in bioinformatics and computational biology at the Ontario Institute for Cancer Research.

<span class="mw-page-title-main">Steven E. Brenner</span>

Steven Elliot Brenner is a professor at the Department of Plant and Microbial Biology at the University of California Berkeley, adjunct professor at the Department of Bioengineering and Therapeutic Sciences at the University of California, and San Francisco Faculty scientist, Physical Biosciences at the Lawrence Berkeley National Laboratory.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

References

  1. Stajich, J. E.; Block, D.; Boulez, K.; Brenner, S.; Chervitz, S.; Dagdigian, C.; Fuellen, G.; Gilbert, J.; Korf, I.; Lapp, H.; Lehväslaiho, H.; Matsalla, C.; Mungall, C. J.; Osborne, B. I.; Pocock, M. R.; Schattner, P.; Senger, M.; Stein, L. D.; Stupka, E.; Wilkinson, M. D.; Birney, E. (2002). "The BioPerl Toolkit: Perl Modules for the Life Sciences". Genome Research. 12 (10): 1611–1618. doi:10.1101/gr.361602. PMC   187536 . PMID   12368254.
  2. "BioPerl publications - BioPerl". Archived from the original on 2007-02-02. Retrieved 2007-01-21. A complete, up-to-date list of BioPerl references
  3. Lincoln Stein (1996). "How Perl saved the human genome project". The Perl Journal. 1 (2). Archived from the original on 2007-02-02. Retrieved 2009-02-25.
  4. Khaja R, MacDonald J, Zhang J, Scherer S (2006). "Methods for identifying and mapping recent segmental and gene duplications in eukaryotic genomes" . Gene Mapping, Discovery, and Expression. Methods Mol Biol. Vol. 338. Totowa, N.J. : Humana Press. pp. 9–20. doi:10.1385/1-59745-097-9:9. ISBN   978-1-59745-097-3. PMID   16888347.
  5. Pan, X.; Stein, L.; Brendel, V. (2005). "SynBrowse: A synteny browser for comparative sequence analysis". Bioinformatics. 21 (17): 3461–3468. doi: 10.1093/bioinformatics/bti555 . PMID   15994196.
  6. Shah, S. P.; McVicker, G. P.; MacKworth, A. K.; Rogic, S.; Ouellette, B. F. F. (2003). "GeneComber: Combining outputs of gene prediction programs for improved results". Bioinformatics. 19 (10): 1296–1297. doi: 10.1093/bioinformatics/btg139 . PMID   12835277.
  7. Lenhard, B.; Wasserman, W. W. (2002). "TFBS: Computational framework for transcription factor binding site analysis". Bioinformatics. 18 (8): 1135–1136. doi: 10.1093/bioinformatics/18.8.1135 . PMID   12176838.
  8. Huang, J.; Gutteridge, A.; Honda, W.; Kanehisa, M. (2006). "MIMOX: A web tool for phage display based epitope mapping". BMC Bioinformatics. 7: 451. doi: 10.1186/1471-2105-7-451 . PMC   1618411 . PMID   17038191.
  9. Catanho, M.; Mascarenhas, D.; Degrave, W.; De Miranda, A. B. ?L. (2006). "BioParser". Applied Bioinformatics. 5 (1): 49–53. doi: 10.2165/00822942-200605010-00007 . PMID   16539538.
  10. Wei, X.; Kuhn, D. N.; Narasimhan, G. (2003). "Degenerate primer design via clustering". Proceedings. IEEE Computer Society Bioinformatics Conference. 2: 75–83. PMID   16452781.
  11. Croce, O.; Lamarre, M. L.; Christen, R. (2006). "Querying the public databases for sequences using complex keywords contained in the feature lines". BMC Bioinformatics. 7: 45. doi: 10.1186/1471-2105-7-45 . PMC   1403806 . PMID   16441875.
  12. Landsteiner, B. R.; Olson, M. R.; Rutherford, R. (2005). "Current Comparative Table (CCT) automates customized searches of dynamic biological databases". Nucleic Acids Research. 33 (Web Server issue): W770–W773. doi:10.1093/nar/gki432. PMC   1160193 . PMID   15980582.
  13. Llabrés, M.; Rocha, J.; Rosselló, F.; Valiente, G. (2006). "On the Ancestral Compatibility of Two Phylogenetic Trees with Nested Taxa". Journal of Mathematical Biology. 53 (3): 340–364. arXiv: cs/0505086 . doi:10.1007/s00285-006-0011-4. PMID   16823581. S2CID   1704494.
  14. Pampanwar, V.; Engler, F.; Hatfield, J.; Blundy, S.; Gupta, G.; Soderlund, C. (2005). "FPC Web Tools for Rice, Maize, and Distribution". Plant Physiology. 138 (1): 116–126. doi:10.1104/pp.104.056291. PMC   1104167 . PMID   15888684.