Biopython

Last updated
Biopython
Original author(s) Chapman B, Chang J [1]
Initial releaseDecember 17, 2002;21 years ago (2002-12-17)
Stable release
1.81 [2]   OOjs UI icon edit-ltr-progressive.svg
Repository https://github.com/biopython/biopython [3]   OOjs UI icon edit-ltr-progressive.svg
Written in Python and C
Platform Cross-platform
Type Bioinformatics
License Biopython License
Website biopython.org

The Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. [1] [4] [5] It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology. [6]

Contents

History

Biopython development began in 1999 and it was first released in July 2000. [7] It was developed during a similar time frame and with analogous goals to other projects that added bioinformatics capabilities to their respective programming languages, including BioPerl, BioRuby and BioJava. Early developers on the project included Jeff Chang, Andrew Dalke and Brad Chapman, though over 100 people have made contributions to date. [8] In 2007, a similar Python project, namely PyCogent, was established. [9]

The initial scope of Biopython involved accessing, indexing and processing biological sequence files. While this is still a major focus, over the following years added modules have extended its functionality to cover additional areas of biology (see Key features and examples).

As of version 1.77, Biopython no longer supports Python 2. [10]

Design

Wherever possible, Biopython follows the conventions used by the Python programming language to make it easier for users familiar with Python. For example, Seq and SeqRecord objects can be manipulated via slicing, in a manner similar to Python's strings and lists. It is also designed to be functionally similar to other Bio* projects, such as BioPerl. [7]

Biopython is able to read and write most common file formats for each of its functional areas, and its license is permissive and compatible with most other software licenses, which allow Biopython to be used in a variety of software projects. [5]

Key features and examples

Sequences

A core concept in Biopython is the biological sequence, and this is represented by the Seq class. [11] A Biopython Seq object is similar to a Python string in many respects: it supports the Python slice notation, can be concatenated with other sequences and is immutable. In addition, it includes sequence-specific methods and specifies the particular biological alphabet used.

>>> # This script creates a DNA sequence and performs some typical manipulations>>> fromBio.SeqimportSeq>>> dna_sequence=Seq("AGGCTTCTCGTA",IUPAC.unambiguous_dna)>>> dna_sequenceSeq('AGGCTTCTCGTA', IUPACUnambiguousDNA())>>> dna_sequence[2:7]Seq('GCTTC', IUPACUnambiguousDNA())>>> dna_sequence.reverse_complement()Seq('TACGAGAAGCCT', IUPACUnambiguousDNA())>>> rna_sequence=dna_sequence.transcribe()>>> rna_sequenceSeq('AGGCUUCUCGUA', IUPACUnambiguousRNA())>>> rna_sequence.translate()Seq('RLLV', IUPACProtein())

Sequence annotation

The SeqRecord class describes sequences, along with information such as name, description and features in the form of SeqFeature objects. Each SeqFeature object specifies the type of the feature and its location. Feature types can be ‘gene’, ‘CDS’ (coding sequence), ‘repeat_region’, ‘mobile_element’ or others, and the position of features in the sequence can be exact or approximate.

>>> # This script loads an annotated sequence from file and views some of its contents.>>> fromBioimportSeqIO>>> seq_record=SeqIO.read("pTC2.gb","genbank")>>> seq_record.name'NC_019375'>>> seq_record.description'Providencia stuartii plasmid pTC2, complete sequence.'>>> seq_record.features[14]SeqFeature(FeatureLocation(ExactPosition(4516), ExactPosition(5336), strand=1), type='mobile_element')>>> seq_record.seqSeq("GGATTGAATATAACCGACGTGACTGTTACATTTAGGTGGCTAAACCCGTCAAGC...GCC", IUPACAmbiguousDNA())

Input and output

Biopython can read and write to a number of common sequence formats, including FASTA, FASTQ, GenBank, Clustal, PHYLIP and NEXUS. When reading files, descriptive information in the file is used to populate the members of Biopython classes, such as SeqRecord. This allows records of one file format to be converted into others.

Very large sequence files can exceed a computer's memory resources, so Biopython provides various options for accessing records in large files. They can be loaded entirely into memory in Python data structures, such as lists or dictionaries, providing fast access at the cost of memory usage. Alternatively, the files can be read from disk as needed, with slower performance but lower memory requirements.

>>> # This script loads a file containing multiple sequences and saves each one in a different format.>>> fromBioimportSeqIO>>> genomes=SeqIO.parse("salmonella.gb","genbank")>>> forgenomeingenomes:... SeqIO.write(genome,genome.id+".fasta","fasta")

Accessing online databases

Through the Bio.Entrez module, users of Biopython can download biological data from NCBI databases. Each of the functions provided by the Entrez search engine is available through functions in this module, including searching for and downloading records.

>>> # This script downloads genomes from the NCBI Nucleotide database and saves them in a FASTA file.>>> fromBioimportEntrez>>> fromBioimportSeqIO>>> output_file=open("all_records.fasta","w")>>> Entrez.email="my_email@example.com">>> records_to_download=["FO834906.1","FO203501.1"]>>> forrecord_idinrecords_to_download:... handle=Entrez.efetch(db="nucleotide",id=record_id,rettype="gb")... seqRecord=SeqIO.read(handle,format="gb")... handle.close()... output_file.write(seqRecord.format("fasta"))

Phylogeny

Figure 1: A rooted phylogenetic tree created by Bio.Phylo showing the relationship between different organisms' Apaf-1 homologs Phylo.draw.png
Figure 1: A rooted phylogenetic tree created by Bio.Phylo showing the relationship between different organisms' Apaf-1 homologs
Figure 2: The same tree as above, drawn unrooted using Graphviz via Bio.Phylo Phylo.draw graphviz.png
Figure 2: The same tree as above, drawn unrooted using Graphviz via Bio.Phylo

The Bio.Phylo module provides tools for working with and visualising phylogenetic trees. A variety of file formats are supported for reading and writing, including Newick, NEXUS and phyloXML. Common tree manipulations and traversals are supported via the Tree and Clade objects. Examples include converting and collating tree files, extracting subsets from a tree, changing a tree's root, and analysing branch features such as length or score. [13]

Rooted trees can be drawn in ASCII or using matplotlib (see Figure 1), and the Graphviz library can be used to create unrooted layouts (see Figure 2).

Genome diagrams

Figure 3: A diagram of the genes on the pKPS77 plasmid, visualised using the GenomeDiagram module in Biopython PKPS77.png
Figure 3: A diagram of the genes on the pKPS77 plasmid, visualised using the GenomeDiagram module in Biopython

The GenomeDiagram module provides methods of visualising sequences within Biopython. [15] Sequences can be drawn in a linear or circular form (see Figure 3), and many output formats are supported, including PDF and PNG. Diagrams are created by making tracks and then adding sequence features to those tracks. By looping over a sequence's features and using their attributes to decide if and how they are added to the diagram's tracks, one can exercise much control over the appearance of the final diagram. Cross-links can be drawn between different tracks, allowing one to compare multiple sequences in a single diagram.

Macromolecular structure

The Bio.PDB module can load molecular structures from PDB and mmCIF files, and was added to Biopython in 2003. [16] The Structure object is central to this module, and it organises macromolecular structure in a hierarchical fashion: Structure objects contain Model objects which contain Chain objects which contain Residue objects which contain Atom objects. Disordered residues and atoms get their own classes, DisorderedResidue and DisorderedAtom, that describe their uncertain positions.

Using Bio.PDB, one can navigate through individual components of a macromolecular structure file, such as examining each atom in a protein. Common analyses can be carried out, such as measuring distances or angles, comparing residues and calculating residue depth.

Population genetics

The Bio.PopGen module adds support to Biopython for Genepop, a software package for statistical analysis of population genetics. [17] This allows for analyses of Hardy–Weinberg equilibrium, linkage disequilibrium and other features of a population's allele frequencies.

This module can also carry out population genetic simulations using coalescent theory with the fastsimcoal2 program. [18]

Wrappers for command line tools

Many of Biopython's modules contain command line wrappers for commonly used tools, allowing these tools to be used from within Biopython. These wrappers include BLAST, Clustal, PhyML, EMBOSS and SAMtools. Users can subclass a generic wrapper class to add support for any other command line tool.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

In bioinformatics, BLAST is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. A BLAST search enables a researcher to compare a subject protein or nucleotide sequence with a library or database of sequences, and identify database sequences that resemble the query sequence above a certain threshold. For example, following the discovery of a previously unknown gene in the mouse, a scientist will typically perform a BLAST search of the human genome to see if humans carry a similar gene; BLAST will identify sequences in the human genome that resemble the mouse gene based on similarity of sequence.

In bioinformatics and biochemistry, the FASTA format is a text-based format for representing either nucleotide sequences or amino acid (protein) sequences, in which nucleotides or amino acids are represented using single-letter codes.

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

<span class="mw-page-title-main">BioPerl</span> Collection of Perl modules for bioinformatics

BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project.

FASTA is a DNA and protein sequence alignment software package first described by David J. Lipman and William R. Pearson in 1985. Its legacy is the FASTA format which is now ubiquitous in bioinformatics.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

A phylogenetic network is any graph used to visualize evolutionary relationships between nucleotide sequences, genes, chromosomes, genomes, or species. They are employed when reticulation events such as hybridization, horizontal gene transfer, recombination, or gene duplication and loss are believed to be involved. They differ from phylogenetic trees by the explicit modeling of richly linked networks, by means of the addition of hybrid nodes instead of only tree nodes. Phylogenetic trees are a subset of phylogenetic networks. Phylogenetic networks can be inferred and visualised with software such as SplitsTree, the R-package, phangorn, and, more recently, Dendroscope. A standard format for representing phylogenetic networks is a variant of Newick format which is extended to support networks as well as trees.

<span class="mw-page-title-main">Galaxy (computational biology)</span>

Galaxy is a scientific workflow, data integration, and data and analysis persistence and publishing platform that aims to make computational biology accessible to research scientists that do not have computer programming or systems administration experience. Although it was initially developed for genomics research, it is largely domain agnostic and is now used as a general bioinformatics workflow management system.

<span class="mw-page-title-main">UGENE</span>

UGENE is computer software for bioinformatics. It works on personal computer operating systems such as Windows, macOS, or Linux. It is released as free and open-source software, under a GNU General Public License (GPL) version 2.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

The Variant Call Format (VCF) is a standard text file format used in bioinformatics for storing gene sequence variations. The format was developed in 2010 for the 1000 Genomes Project and has since been used by other large-scale genotyping and DNA sequencing projects. VCF is a common output format for variant calling programs due to its relative simplicity and scalability. Many tools have been developed for editing and manipulating VCF files, including VCFtools, which was released in conjunction with the VCF format in 2011, and BCFtools, which was included as part of SAMtools until being split into an independent package in 2014.

<span class="mw-page-title-main">Sequence Read Archive</span>

The Sequence Read Archive is a bioinformatics database that provides a public repository for DNA sequencing data, especially the "short reads" generated by high-throughput sequencing, which are typically less than 1,000 base pairs in length. The archive is part of the International Nucleotide Sequence Database Collaboration (INSDC), and run as a collaboration between the NCBI, the European Bioinformatics Institute (EBI), and the DNA Data Bank of Japan (DDBJ).

<span class="mw-page-title-main">.NET Bio</span> Microsoft free software

.NET Bio is an open source bioinformatics and genomics library created to enable simple loading, saving and analysis of biological data. It was designed for .NET Standard 2.0 and was part of the Microsoft Biology Initiative in the eScience division.

Sequence Alignment Map (SAM) is a text-based format originally for storing biological sequences aligned to a reference sequence developed by Heng Li and Bob Handsaker et al. It was developed when the 1000 Genomes Project wanted to move away from the MAQ mapper format and decided to design a new format. The overall TAB-delimited flavour of the format came from an earlier format inspired by BLAT’s PSL. The name of SAM came from Gabor Marth from University of Utah, who originally had a format under the same name but with a different syntax more similar to a BLAST output. It is widely used for storing data, such as nucleotide sequences, generated by next generation sequencing technologies, and the standard has been broadened to include unmapped sequences. The format supports short and long reads (up to 128 Mbp) produced by different sequencing platforms and is used to hold mapped data within the Genome Analysis Toolkit (GATK) and across the Broad Institute, the Wellcome Sanger Institute, and throughout the 1000 Genomes Project.

Bloom filters are space-efficient probabilistic data structures used to test whether an element is a part of a set. Bloom filters require much less space than other data structures for representing sets, however the downside of Bloom filters is that there is a false positive rate when querying the data structure. Since multiple elements may have the same hash values for a number of hash functions, then there is a probability that querying for a non-existent element may return a positive if another element with the same hash values has been added to the Bloom filter. Assuming that the hash function has equal probability of selecting any index of the Bloom filter, the false positive rate of querying a Bloom filter is a function of the number of bits, number of hash functions and number of elements of the Bloom filter. This allows the user to manage the risk of a getting a false positive by compromising on the space benefits of the Bloom filter.

References

  1. 1 2 Chapman, Brad; Chang, Jeff (August 2000). "Biopython: Python tools for computational biology". ACM SIGBIO Newsletter. 20 (2): 15–19. doi: 10.1145/360262.360268 . S2CID   9417766.
  2. "Release biopython-181: Commit Release 1.81 (#4233)" . Retrieved 22 April 2023.
  3. Error: Unable to display the reference properly. See the documentation for details.
  4. Cock, Peter JA; Antao, Tiago; Chang, Jeffery T; Chapman, Brad A; Cox, Cymon J; Dalke, Andrew; Friedberg, Iddo; Hamelryck, Thomas; Kauff, Frank; Wilczynski, Bartek; de Hoon, Michiel JL (20 March 2009). "Biopython: freely available Python tools for computational molecular biology and bioinformatics". Bioinformatics. 25 (11): 1422–3. doi:10.1093/bioinformatics/btp163. PMC   2682512 . PMID   19304878.
  5. 1 2 Refer to the Biopython website for other papers describing Biopython, and a list of over one hundred publications using/citing Biopython.
  6. Mangalam, Harry (September 2002). "The Bio* toolkits—a brief overview". Briefings in Bioinformatics. 3 (3): 296–302. doi: 10.1093/bib/3.3.296 . PMID   12230038.
  7. 1 2 Chapman, Brad (11 March 2004), The Biopython Project: Philosophy, functionality and facts (PDF), retrieved 11 September 2014
  8. List of Biopython contributors, archived from the original on 11 September 2014, retrieved 11 September 2014
  9. Knight, R; Maxwell, P; Birmingham, A; Carnes, J; Caporaso, J. G.; Easton, B. C.; Eaton, M; Hamady, M; Lindsay, H; Liu, Z; Lozupone, C; McDonald, D; Robeson, M; Sammut, R; Smit, S; Wakefield, M. J.; Widmann, J; Wikman, S; Wilson, S; Ying, H; Huttley, G. A. (2007). "Py Cogent: A toolkit for making sense from sequence". Genome Biology. 8 (8): R171. doi: 10.1186/gb-2007-8-8-r171 . PMC   2375001 . PMID   17708774.
  10. Daley, Chris, Biopython 1.77 released , retrieved 6 October 2021
  11. Chang, Jeff; Chapman, Brad; Friedberg, Iddo; Hamelryck, Thomas; de Hoon, Michiel; Cock, Peter; Antao, Tiago; Talevich, Eric; Wilczynski, Bartek (29 May 2014), Biopython Tutorial and Cookbook , retrieved 28 August 2014
  12. Zmasek, Christian M; Zhang, Qing; Ye, Yuzhen; Godzik, Adam (24 October 2007). "Surprising complexity of the ancestral apoptosis network". Genome Biology. 8 (10): R226. doi: 10.1186/gb-2007-8-10-r226 . PMC   2246300 . PMID   17958905.
  13. Talevich, Eric; Invergo, Brandon M; Cock, Peter JA; Chapman, Brad A (21 August 2012). "Bio.Phylo: A unified toolkit for processing, analyzing and visualizing phylogenetic trees in Biopython". BMC Bioinformatics. 13 (209): 209. doi: 10.1186/1471-2105-13-209 . PMC   3468381 . PMID   22909249.
  14. "Klebsiella pneumoniae strain KPS77 plasmid pKPS77, complete sequence". NCBI. Retrieved 10 September 2014.
  15. Pritchard, Leighton; White, Jennifer A; Birch, Paul RJ; Toth, Ian K (March 2006). "GenomeDiagram: a python package for the visualization of large-scale genomic data". Bioinformatics. 22 (5): 616–617. doi: 10.1093/bioinformatics/btk021 . PMID   16377612.
  16. Hamelryck, Thomas; Manderick, Bernard (10 May 2003). "PDB file parser and structure class implemented in Python". Bioinformatics. 19 (17): 2308–2310. doi: 10.1093/bioinformatics/btg299 . PMID   14630660.
  17. Rousset, François (January 2008). "GENEPOP'007: a complete re-implementation of the GENEPOP software for Windows and Linux". Molecular Ecology Resources. 8 (1): 103–106. doi:10.1111/j.1471-8286.2007.01931.x. PMID   21585727. S2CID   25776992.
  18. Excoffier, Laurent; Foll, Matthieu (1 March 2011). "fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios". Bioinformatics. 27 (9): 1332–1334. doi: 10.1093/bioinformatics/btr124 . PMID   21398675.