BioJava

Last updated
BioJava
Original author(s) Andreas Prlić
Developer(s) Amr ALHOSSARY, Andreas Prlic, Dmytro Guzenko, Hannes Brandstätter-Müller, Jose Manuel Duarte, Thomas Down, Michael L Heuer, Peter Troshin, JianJiong Gao, Aleix Lafita, Peter Rose, Spencer Bliven
Initial release2002;22 years ago (2002)
Stable release
6.0.3 / December 19, 2021;2 years ago (2021-12-19)
Repository github.com/biojava
Written in Java
Platform Web browser with Java SE
Available inEnglish
Type Bioinformatics
License Lesser GPL 2.1
Website biojava.org

BioJava is an open-source software project dedicated to provide Java tools to process biological data. [1] [2] [3] BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. [4] This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

Contents

Additional projects from BioJava include rcsb-sequenceviewer, biojava-http, biojava-spark, and rcsb-viewers.

Features

BioJava provides software modules for many of the typical tasks of bioinformatics programming. These include:

History and publications

The BioJava project grew out of work by Thomas Down and Matthew Pocock to create an API to simplify development of Java-based Bioinformatics tools. BioJava is an active open source project that has been developed over more than 12 years and by more than 60 developers. BioJava is one of a number of Bio* projects designed to reduce code duplication. [5] Examples of such projects that fall under Bio* apart from BioJava are BioPython, [6] BioPerl, [7] BioRuby, [8] EMBOSS [9] etc.

In October 2012, the first paper on BioJava was published. [10] This paper detailed BioJava's modules, functionalities, and purpose.

As of November 2018 Google Scholar counts more than 130 citations. [11]

The most recent paper on BioJava was written in February 2017. [12] This paper detailed a new tool named BioJava-ModFinder. This tool can be used for identification and subsequent mapping of protein modifications to 3D in the Protein Data Bank (PBD). The package was also integrated with the RCSB PDB web application and added protein modification annotations to the sequence diagram and structure display. More than 30,000 structures with protein modifications were identified by using BioJava-ModFinder and can be found on the RCSB PDB website.

In the year 2008, BioJava's first Application note was published. [2] It was migrated from its original CVS repository to GitHub in April 2013. [13] The project has been moved to a separate repository, BioJava-legacy, and is still maintained for minor changes and bug fixes. [14]

Version 3 was released in December 2010. It was a major update to the prior versions. The aim of this release was to rewrite BioJava so that it could be modularized into small, reusable components. This allowed developers to contribute more easily and reduced dependencies. The new approach seen in BioJava 3 was modeled after the Apache Commons.

Version 4 was released in January 2015. This version brought many new features and improvements to the packages biojava-core, biojava-structure, biojava-structure-gui, biojava-phylo, as well as others. BioJava 4.2.0 was the first release to be available using Maven from the Maven Central.

Version 5 was released in March 2018. This represents a major milestone for the project. BioJava 5.0.0 is the first released based on Java 8 which introduces the use of lambda functions and streaming API calls. There were also major changes to biojava-structure module. Also, the previous data models for macro-molecular structures have been adapted to more closely represent the mmCIF data model. This was the first release in over two years. Some of the other improvements include optimizations in the biojava-structure module to improve symmetry detection and added support for MMTF formats. Other general improvements include Javadoc updates, dependency versions, and all tests are now Junit4. The release contains 1,170 commits from 19 contributors.

Modules

During 2014-2015, large parts of the original code base were rewritten. BioJava 3 is a clear departure from the version 1 series. It now consists of several independent modules built using an automation tool called Apache Maven. [15] These modules provide state-of-the-art tools for protein structure comparison, pairwise and multiple sequence alignments, working with DNA and protein sequences, analysis of amino acid properties, detecting protein modifications, predicting disordered regions in proteins, and parsers for common file formats using a biologically meaningful data model. The original code has been moved into a separate BioJava legacy project, which is still available for backward compatibility. [16]

BioJava 5 introduced new features to two modules, biojava-alignment and biojava-structure.

The following sections will describe several of the new modules and highlight some of the new features that are included in the latest version of BioJava.

BioJava 5 Module Layout.png

Core Module

This module provides Java classes to model amino acid or nucleotide sequences. The classes were designed so that the names are familiar and make sense to biologists and also provide a concrete representation of the steps in going from a gene sequence to a protein sequence for computer scientists and programmers.

A major change between the legacy BioJava project and BioJava3 lies in the way framework has been designed to exploit then-new innovations in Java. A sequence is defined as a generic interface allowing the rest of the modules to create any utility that operates on all sequences. Specific classes for common sequences such as DNA and proteins have been defined in order to improve usability for biologists. The translation engine really leverages this work by allowing conversions between DNA, RNA and amino acid sequences. This engine can handle details such as choosing the codon table, converting start codons to methionine, trimming stop codons, specifying the reading frame and handing ambiguous sequences.

Special attention has been paid to designing the storage of sequences to minimize space needs. Special design patterns such as the Proxy pattern allowed the developers to create the framework such that sequences can be stored in memory, fetched on demand from a web service such as UniProt, or read from a FASTA file as needed. The latter two approaches save memory by not loading sequence data until it is referenced in the application. This concept can be extended to handle very large genomic datasets, such as NCBI GenBank or a proprietary database.

Protein structure modules

This window shows two proteins with IDs "4hhb.A" and "4hhb.B" aligned against each other. The code is given on the left side. This is produced using BioJava libraries which in turn uses Jmol viewer. The FATCAT rigid algorithm is used here to do the alignment. This window shows two proteins with IDs "4hhb.A" and "4hhb.B" aligned against each other.png
This window shows two proteins with IDs "4hhb.A" and "4hhb.B" aligned against each other. The code is given on the left side. This is produced using BioJava libraries which in turn uses Jmol viewer. The FATCAT rigid algorithm is used here to do the alignment.

The protein structure modules provide tools to represent and manipulate 3D biomolecular structures. They focus on protein structure comparison.

The following algorithms have been implemented and included in BioJava.

These algorithms are used to provide the RCSB Protein Data Bank (PDB) [20] Protein Comparison Tool as well as systematic comparisons of all proteins in the PDB on a weekly basis. [21]

Parsers for PDB [22] and mmCIF [23] file formats allow the loading of structure data into a reusable data model. This feature is used by the SIFTS project to map between UniProt sequences and PDB structures. [24] Information from the RCSB PDB can be dynamically fetched without the need to manually download data. For visualization, an interface to the 3D viewer Jmol is provided. [4]

Genome and Sequencing modules

This module is focused on the creation of gene sequence objects from the core module. This is realized by supporting the parsing of the following popular standard file formats generated by open source gene prediction applications:

Then the gene sequence objects are written out as a GFF3 format and is imported into GMOD. [28] These file formats are well defined but what gets written in the file is very flexible.

For providing input-output support for several common variants of the FASTQ file format from the next generation sequencers, [29] a separate sequencing module is provided. For samples on how to use this module please go to this link.

Alignment module

This module contains several classes and methods that allow users to perform pairwise and multiple sequence alignment. Sequences can be aligned in both a single and multi-threaded fashion. BioJava implements the Needleman-Wunsch [30] algorithm for optimal global alignments and the Smith and Waterman's [31] algorithm for local alignments. The outputs of both local and global alignments are available in standard formats. In addition to these two algorithms, there is an implementation of Guan–Uberbacher algorithm [32] which performs global sequence alignment very efficiently since it only uses linear memory.

For Multiple Sequence Alignment , any of the methods discussed above can be used to progressively perform a multiple sequence alignment.

ModFinder module

An example application using the ModFinder module and the protein structure module. Protein modifications are mapped onto the sequence and structure of ferredoxin I (PDB ID 1GAO). Two possible iron-sulfur clusters are shown on the protein sequence (3Fe-4S (F3S): orange triangles/lines; 4Fe-4S (SF4): purple diamonds/ lines). The 4Fe-4S cluster is displayed in the Jmol structure window above the sequence display An example application using the ModFinder module and the protein structure module.png
An example application using the ModFinder module and the protein structure module. Protein modifications are mapped onto the sequence and structure of ferredoxin I (PDB ID 1GAO). Two possible iron–sulfur clusters are shown on the protein sequence (3Fe–4S (F3S): orange triangles/lines; 4Fe–4S (SF4): purple diamonds/ lines). The 4Fe–4S cluster is displayed in the Jmol structure window above the sequence display

The ModFinder module provides new methods to identify and classify protein modifications in protein 3D structures. Over 400 different types of protein modifications such as phosphorylation, glycosylation, disulfide bonds metal chelation etc. were collected and curated based on annotations in PSI-MOD, [34] RESID [35] and RCSB PDB. [36] The module also provides an API for detecting pre-, co-, and post-translational protein modifications within protein structures. This module can also identify phosphorylation and print all pre-loaded modifications from a structure.

Amino acid properties module

This module attempts to provide accurate physio-chemical properties of proteins. The properties that can calculated using this module are as follows:

The precise molecular weights for common isotopically labelled amino acids are included in this module. There also exists flexibility to define new amino acid molecules with their molecular weights using simple XML configuration files. This can be useful where the precise mass is of high importance such as mass spectrometry experiments.

Protein disorder module

The goal of this module is to provide users ways to find disorders in protein molecules. BioJava includes a Java implementation of the RONN predictor. The BioJava 3.0.5 makes use of Java's support for multithreading to improve performance by up to 3.2 times, [37] on a modern quad-core machine, as compared to the legacy C implementation.

There are two ways to use this module:

Some features of this module include:

Web service access module

As per the current trends in bioinformatics, web based tools are gaining popularity. The web service module allows bioinformatics services to be accessed using REST protocols. Currently, two services are implemented: NCBI Blast through the Blast URLAPI (previously known as QBlast) and the HMMER web service. [38]

Comparisons with other alternatives

The need for customized software in the field of bioinformatics has been addressed by several groups and individuals. Similar to BioJava, open-source software projects such as BioPerl, BioPython, and BioRuby all provide tool-kits with multiple functionality that make it easier to create customized pipelines or analysis.

As the names suggest, the projects mentioned above use different programming languages. All of these APIs offer similar tools so on what criteria should one base their choice? For programmers who are experienced in only one of these languages, the choice is straightforward. However, for a well-rounded bioinformaticist who knows all of these languages and wants to choose the best language for a job, the choice can be made based on the following guidelines given by a software review done on the Bio* tool-kits. [5]

In general, for small programs (<500 lines) that will be used by only an individual or small group, it is hard to beat Perl and BioPerl. These constraints probably cover the needs of 90 per cent of personal bioinformatics programming.

For beginners, and for writing larger programs in the Bio domain, especially those to be shared and supported by others, Python’s clarity and brevity make it very attractive.

For those who might be leaning towards a career in bioinformatics and who want to learn only one language, Java has the widest general programming support, very good support in the Bio domain with BioJava, and is now the de facto language of business (the new COBOL, for better or worse).

Apart from these Bio* projects there is another project called STRAP which uses Java and aims for similar goals. The STRAP-toolbox, similar to BioJava is also a Java-toolkit for the design of Bioinformatics programs and scripts. The similarities and differences between BioJava and STRAP are as follows:

Similarities

Differences

Projects using BioJava

The following projects make use of BioJava.

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

The Protein Data Bank (PDB) is a database for the three-dimensional structural data of large biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray crystallography, NMR spectroscopy, or, increasingly, cryo-electron microscopy, and submitted by biologists and biochemists from around the world, are freely accessible on the Internet via the websites of its member organisations. The PDB is overseen by an organization called the Worldwide Protein Data Bank, wwPDB.

<span class="mw-page-title-main">Sequence alignment</span> Process in bioinformatics that identifies equivalent sites within molecular sequences

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of functional, structural, or evolutionary relationships between the sequences. Aligned sequences of nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps are inserted between the residues so that identical or similar characters are aligned in successive columns. Sequence alignments are also used for non-biological sequences such as calculating the distance cost between strings in a natural language, or to display financial data.

<span class="mw-page-title-main">Structural alignment</span> Aligning molecular sequences using sequence and structural information

Structural alignment attempts to establish homology between two or more polymer structures based on their shape and three-dimensional conformation. This process is usually applied to protein tertiary structures but can also be used for large RNA molecules. In contrast to simple structural superposition, where at least some equivalent residues of the two structures are known, structural alignment requires no a priori knowledge of equivalent positions. Structural alignment is a valuable tool for the comparison of proteins with low sequence similarity, where evolutionary relationships between proteins cannot be easily detected by standard sequence alignment techniques. Structural alignment can therefore be used to imply evolutionary relationships between proteins that share very little common sequence. However, caution should be used in using the results as evidence for shared evolutionary ancestry because of the possible confounding effects of convergent evolution by which multiple unrelated amino acid sequences converge on a common tertiary structure.

<span class="mw-page-title-main">Structural bioinformatics</span> Bioinformatics subfield

Structural bioinformatics is the branch of bioinformatics that is related to the analysis and prediction of the three-dimensional structure of biological macromolecules such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D structures such as comparisons of overall folds and local motifs, principles of molecular folding, evolution, binding interactions, and structure/function relationships, working both from experimentally solved structures and from computational models. The term structural has the same meaning as in structural biology, and structural bioinformatics can be seen as a part of computational structural biology. The main objective of structural bioinformatics is the creation of new methods of analysing and manipulating biological macromolecular data in order to solve problems in biology and generate new knowledge.

<span class="mw-page-title-main">BioPerl</span> Collection of Perl modules for bioinformatics

BioPerl is a collection of Perl modules that facilitate the development of Perl scripts for bioinformatics applications. It has played an integral role in the Human Genome Project.

<span class="mw-page-title-main">Biopython</span> Collection of open-source Python software tools for computational biology

The Biopython project is an open-source collection of non-commercial Python tools for computational biology and bioinformatics, created by an international association of developers. It contains classes to represent biological sequences and sequence annotations, and it is able to read and write to a variety of file formats. It also allows for a programmatic means of accessing online databases of biological information, such as those at NCBI. Separate modules extend Biopython's capabilities to sequence alignment, protein structure, population genetics, phylogenetics, sequence motifs, and machine learning. Biopython is one of a number of Bio* projects designed to reduce code duplication in computational biology.

<span class="mw-page-title-main">UniProt</span> Database of protein sequences and functional information

UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, United States.

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

In academia, computational immunology is a field of science that encompasses high-throughput genomic and bioinformatics approaches to immunology. The field's main aim is to convert immunological data into computational problems, solve these problems using mathematical and computational approaches and then convert these results into immunologically meaningful interpretations.

<span class="mw-page-title-main">Pfam</span> Database of protein families

Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 36.0, was released in September 2023 and contains 20,795 families.

InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.

The Worldwide Protein Data Bank, wwPDB, is an organization that maintains the archive of macromolecular structure. Its mission is to maintain a single Protein Data Bank Archive of macromolecular structural data that is freely and publicly available to the global community.

<span class="mw-page-title-main">Helen M. Berman</span> American chemist

Helen Miriam Berman is a Board of Governors Professor of Chemistry and Chemical Biology at Rutgers University and a former director of the RCSB Protein Data Bank. A structural biologist, her work includes structural analysis of protein-nucleic acid complexes, and the role of water in molecular interactions. She is also the founder and director of the Nucleic Acid Database, and led the Protein Structure Initiative Structural Genomics Knowledgebase.

<span class="mw-page-title-main">Circular permutation in proteins</span> Arrangement of amino acid sequence

A circular permutation is a relationship between proteins whereby the proteins have a changed order of amino acids in their peptide sequence. The result is a protein structure with different connectivity, but overall similar three-dimensional (3D) shape. In 1979, the first pair of circularly permuted proteins – concanavalin A and lectin – were discovered; over 2000 such proteins are now known.

The Virus Pathogen Database and Analysis Resource (ViPR) is an integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for viral pathogens in the U.S. National Institute of Allergy and Infectious Diseases (NIAID) Category A-C Priority Pathogen lists for biodefense research, and other viral pathogens causing emerging/reemerging infectious diseases. ViPR is one of the five Bioinformatics Resource Centers (BRC) funded by NIAID, a component of the National Institutes of Health (NIH), which is an agency of the United States Department of Health and Human Services.

The Influenza Research Database (IRD) is an integrative and comprehensive publicly available database and analysis resource to search, analyze, visualize, save and share data for influenza virus research. IRD is one of the five Bioinformatics Resource Centers (BRC) funded by the National Institute of Allergy and Infectious Diseases (NIAID), a component of the National Institutes of Health (NIH), which is an agency of the United States Department of Health and Human Services.

SWISS-MODEL is a structural bioinformatics web-server dedicated to homology modeling of 3D protein structures. Homology modeling is currently the most accurate method to generate reliable three-dimensional protein structure models and is routinely used in many practical applications. Homology modelling methods make use of experimental protein structures ("templates") to build models for evolutionary related proteins ("targets").

The Biological Magnetic Resonance Data Bank is an open access repository of nuclear magnetic resonance (NMR) spectroscopic data from peptides, proteins, nucleic acids and other biologically relevant molecules. The database is operated by the University of Wisconsin–Madison and is supported by the National Library of Medicine. The BMRB is part of the Research Collaboratory for Structural Bioinformatics and, since 2006, it is a partner in the Worldwide Protein Data Bank (wwPDB). The repository accepts NMR spectral data from laboratories around the world and, once the data is validated, it is available online at the BMRB website. The database has also an ftp site, where data can be downloaded in the bulk. The BMRB has two mirror sites, one at the Protein Database Japan (PDBj) at Osaka University and one at the Magnetic Resonance Research Center (CERM) at the University of Florence in Italy. The site at Japan accepts and processes data depositions.

References

  1. Prlić A, Yates A, Bliven SE, et al. (October 2012). "BioJava: an open-source framework for bioinformatics in 2012". Bioinformatics. 28 (20): 2693–5. doi:10.1093/bioinformatics/bts494. PMC   3467744 . PMID   22877863.
  2. 1 2 Holland RC, Down TA, Pocock M, Prlić A, Huen D, James K, et al. (2008). "BioJava: an open-source framework for bioinformatics". Bioinformatics. 24 (18): 2096–7. doi:10.1093/bioinformatics/btn397. PMC   2530884 . PMID   18689808.
  3. VS Matha and P Kangueane, 2009, Bioinformatics: a concept-based introduction, 2009. p26
  4. 1 2 3 Hanson, R.M. (2010) Jmol a paradigm shift in crystallographic visualization.
  5. 1 2 Mangalam H (2002). "The Bio* toolkits--a brief overview". Briefings in Bioinformatics. 3 (3): 296–302. doi: 10.1093/bib/3.3.296 . PMID   12230038.
  6. Cock PJ, Antao T, Chang JT, et al. (June 2009). "Biopython: freely available Python tools for computational molecular biology and bioinformatics". Bioinformatics. 25 (11): 1422–3. doi:10.1093/bioinformatics/btp163. PMC   2682512 . PMID   19304878.
  7. Stajich JE, Block D, Boulez K, et al. (October 2002). "The Bioperl toolkit: Perl modules for the life sciences". Genome Res. 12 (10): 1611–8. doi:10.1101/gr.361602. PMC   187536 . PMID   12368254.
  8. Goto N, Prins P, Nakao M, Bonnal R, Aerts J, Katayama T (October 2010). "BioRuby: bioinformatics software for the Ruby programming language". Bioinformatics. 26 (20): 2617–9. doi:10.1093/bioinformatics/btq475. PMC   2951089 . PMID   20739307.
  9. Rice P, Longden I, Bleasby A (June 2000). "EMBOSS: the European Molecular Biology Open Software Suite". Trends Genet. 16 (6): 276–7. doi:10.1016/S0168-9525(00)02024-2. PMID   10827456.
  10. Prlić A, Yates A, Bliven SE, et al. (October 2012). "BioJava: an open-source framework for bioinformatics in 2012". Bioinformatics. 28 (20): 2693–5. doi:10.1093/bioinformatics/bts494. PMC   3467744 . PMID   22877863.
  11. "Google Scholar". scholar.google.com. Retrieved 2018-11-22.
  12. Gao, Jianjiong; Prlić, Andreas; Bi, Chunxiao; Bluhm, Wolfgang F.; Dimitropoulos, Dimitris; Xu, Dong; Bourne, Philip E.; Rose, Peter W. (2017-02-17). "BioJava-ModFinder: identification of protein modifications in 3D structures from the Protein Data Bank". Bioinformatics. 33 (13): 2047–2049. doi:10.1093/bioinformatics/btx101. ISSN   1367-4803. PMC   5870676 . PMID   28334105.
  13. "History" . Retrieved 30 Jan 2015.
  14. BioJava-legacy Archived 2013-01-09 at the Wayback Machine
  15. Maven, Apache. "Maven". Apache.
  16. BioJava legacy project Archived 2013-01-09 at the Wayback Machine
  17. 1 2 Ye Y, Godzik A (October 2003). "Flexible structure alignment by chaining aligned fragment pairs allowing twists". Bioinformatics. 19 (Suppl 2): ii246–55. doi: 10.1093/bioinformatics/btg1086 . PMID   14534198.
  18. Shindyalov IN, Bourne PE (September 1998). "Protein structure alignment by incremental combinatorial extension (CE) of the optimal path". Protein Eng. 11 (9): 739–47. doi: 10.1093/protein/11.9.739 . PMID   9796821.
  19. Bliven S, Prlić A (2012). "Circular permutation in proteins". PLOS Comput. Biol. 8 (3): e1002445. Bibcode:2012PLSCB...8E2445B. doi: 10.1371/journal.pcbi.1002445 . PMC   3320104 . PMID   22496628.
  20. Rose PW, Beran B, Bi C, et al. (January 2011). "The RCSB Protein Data Bank: redesigned web site and web services". Nucleic Acids Res. 39 (Database issue): D392–401. doi:10.1093/nar/gkq1021. PMC   3013649 . PMID   21036868.
  21. Prlić A, Bliven S, Rose PW, et al. (December 2010). "Pre-calculated protein structure alignments at the RCSB PDB website". Bioinformatics. 26 (23): 2983–5. doi:10.1093/bioinformatics/btq572. PMC   3003546 . PMID   20937596.
  22. Bernstein FC, Koetzle TF, Williams GJ, et al. (May 1977). "The Protein Data Bank: a computer-based archival file for macromolecular structures". J. Mol. Biol. 112 (3): 535–42. doi:10.1016/s0022-2836(77)80200-3. PMID   875032.
  23. Fitzgerald, P.M.D. et al. (2006) Macromolecular dictionary (mmCIF). In Hall, S.R.
  24. Velankar S, McNeil P, Mittard-Runte V, et al. (January 2005). "E-MSD: an integrated data resource for bioinformatics". Nucleic Acids Res. 33 (Database issue): D262–5. doi:10.1093/nar/gki058. PMC   540012 . PMID   15608192.
  25. Besemer J, Borodovsky M (July 2005). "GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses". Nucleic Acids Res. 33 (Web Server issue): W451–4. doi:10.1093/nar/gki487. PMC   1160247 . PMID   15980510.
  26. Blanco E, Abril JF (2009). "Computational Gene Annotation in New Genome Assemblies Using GeneID". Bioinformatics for DNA Sequence Analysis. Methods in Molecular Biology. Vol. 537. pp. 243–61. doi:10.1007/978-1-59745-251-9_12. ISBN   978-1-58829-910-9. PMID   19378148.
  27. Kelley DR, Liu B, Delcher AL, Pop M, Salzberg SL (January 2012). "Gene prediction with Glimmer for metagenomic sequences augmented by classification and clustering". Nucleic Acids Res. 40 (1): e9. doi:10.1093/nar/gkr1067. PMC   3245904 . PMID   22102569.
  28. Stein LD, Mungall C, Shu S, et al. (October 2002). "The generic genome browser: a building block for a model organism system database". Genome Res. 12 (10): 1599–610. doi:10.1101/gr.403602. PMC   187535 . PMID   12368253.
  29. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice PM (April 2010). "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants". Nucleic Acids Res. 38 (6): 1767–71. doi:10.1093/nar/gkp1137. PMC   2847217 . PMID   20015970.
  30. Needleman SB, Wunsch CD (March 1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins". J. Mol. Biol. 48 (3): 443–53. doi:10.1016/0022-2836(70)90057-4. PMID   5420325.
  31. Smith TF, Waterman MS (March 1981). "Identification of common molecular subsequences". J. Mol. Biol. 147 (1): 195–7. CiteSeerX   10.1.1.63.2897 . doi:10.1016/0022-2836(81)90087-5. PMID   7265238.
  32. Guan X, Uberbacher EC (February 1996). "Alignments of DNA and protein sequences containing frameshift errors". Comput. Appl. Biosci. 12 (1): 31–40. doi: 10.1093/bioinformatics/12.1.31 . PMID   8670617.
  33. Chen K, Jung YS, Bonagura CA, et al. (February 2002). "Azotobacter vinelandii ferredoxin I: a sequence and structure comparison approach to alteration of [4Fe-4S]2+/+ reduction potential". J. Biol. Chem. 277 (7): 5603–10. doi: 10.1074/jbc.M108916200 . PMID   11704670.
  34. Montecchi-Palazzi L, Beavis R, Binz PA, et al. (August 2008). "The PSI-MOD community standard for representation of protein modification data". Nat. Biotechnol. 26 (8): 864–6. doi:10.1038/nbt0808-864. PMID   18688235. S2CID   205270043.
  35. Garavelli JS (June 2004). "The RESID Database of Protein Modifications as a resource and annotation tool". Proteomics. 4 (6): 1527–33. doi: 10.1002/pmic.200300777 . PMID   15174122. S2CID   25712150.
  36. Berman HM, Westbrook J, Feng Z, et al. (January 2000). "The Protein Data Bank". Nucleic Acids Res. 28 (1): 235–42. doi:10.1093/nar/28.1.235. PMC   102472 . PMID   10592235.
  37. Yang ZR, Thomson R, McNeil P, Esnouf RM (August 2005). "RONN: the bio-basis function neural network technique applied to the detection of natively disordered regions in proteins". Bioinformatics. 21 (16): 3369–76. doi: 10.1093/bioinformatics/bti534 . PMID   15947016.
  38. Finn RD, Clements J, Eddy SR (July 2011). "HMMER web server: interactive sequence similarity searching". Nucleic Acids Res. 39 (Web Server issue): W29–37. doi:10.1093/nar/gkr367. PMC   3125773 . PMID   21593126.
  39. Paterson T, Law A (November 2012). "JEnsembl: a version-aware Java API to Ensembl data systems". Bioinformatics. 28 (21): 2724–31. doi:10.1093/bioinformatics/bts525. PMC   3476335 . PMID   22945789.
  40. Kim T, Tyndel MS, Huang H, et al. (March 2012). "MUSI: an integrated system for identifying multiple specificity from very large peptide or nucleic acid data sets". Nucleic Acids Res. 40 (6): e47. doi:10.1093/nar/gkr1294. PMC   3315295 . PMID   22210894.
  41. Gront D, Kolinski A (February 2008). "Utility library for structural bioinformatics". Bioinformatics. 24 (4): 584–5. doi: 10.1093/bioinformatics/btm627 . PMID   18227118.