This is a list of software tools and web portals used for gene prediction.
Name | Description | Species | References |
---|---|---|---|
FINDER | Automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences | Eukaryotes | [1] |
FragGeneScan | Predicting genes in complete genomes and sequencing Reads | Prokaryotes, Metagenomes | [2] |
ATGpr | Identifies translational initiation sites in cDNA sequences | Human | [3] |
Prodigal | Its name stands for Prokaryotic Dynamic Programming Genefinding Algorithm. It is based on log-likelihood functions and does not use Hidden or Interpolated Markov Models. | Prokaryotes, Metagenomes (metaProdigal) | [4] |
AUGUSTUS | Eukaryote gene predictor | Eukaryotes | [5] |
BGF | Hidden Markov model (HMM) and dynamic programming based ab initio gene prediction program | [6] | |
DIOGENES | Fast detection of coding regions in short genome sequences | ||
Dragon Promoter Finder | Program to recognize vertebrate RNA polymerase II promoters | Vertebrates | [7] |
EasyGene | The gene finder is based on a hidden Markov model (HMM) that is automatically estimated for a new genome. | Prokaryotes | [8] [9] |
EuGene | Integrative gene finding | Prokaryotes, Eukaryotes | [10] [11] |
FGENESH | HMM-based gene structure prediction: multiple genes, both chains | Eukaryotes | [12] |
FrameD | Find genes and frameshift in G+C rich prokaryote sequences | Prokaryotes, Eukaryotes | [13] |
GeMoMa | Homology-based gene prediction based on amino acid and intron position conservation as well as RNA-Seq data | [14] [15] | |
GENIUS II | Links ORFs in complete genomes to protein 3D structures | Prokaryotes, Eukaryotes | [16] |
geneid | Program to predict genes, exons, splice sites, and other signals along DNA sequences | Eukaryotes | [17] |
GeneParser | Parse DNA sequences into introns and exons | Eukaryotes | [18] |
GeneMark | Family of self-training gene prediction programs | Prokaryotes, Eukaryotes, Metagenomes | [19] [20] [21] [22] |
GeneTack | Predicts genes with frameshifts in prokaryote genomes | Prokaryotes | [23] |
GenomeScan | Predicts the locations and exon-intron structures of genes in genome sequences from a variety of organisms, GENSCAN server is the GenomeScan's predecessor | Vertebrate, Arabidopsis, Maize | [24] |
GENSCAN | Predicts the locations and exon-intron structures of genes in genome sequences from a variety of organisms | Vertebrate, Arabidopsis, Maize | [25] [26] [27] |
GLIMMER | Finds genes in microbial DNA | Prokaryotes | [28] [29] [30] |
GLIMMERHMM | Eukaryotic gene-finding system | Eukaryotes | [31] |
GrailEXP | Predicts exons, genes, promoters, polyas, CpG islands, EST similarities, and repeat elements in DNA sequence | Human, Mus musculus, Arabidopsis thaliana, Drosophila melanogaster | [32] [33] |
mGene | Support-vector machine (SVM) based system to find genes | Eukaryotes | [34] |
mGene.ngs | SVM based system to find genes using heterogeneous information: RNA-seq, tiling arrays | Eukaryotes | [35] |
MORGAN | Decision tree system to find genes in vertebrate DNA | Eukaryotes | [36] |
BioNIX | Web tool to combine results from different programs: GRAIL, FEX, HEXON, MZEF, GENEMARK, GENEFINDER, FGENE, BLAST, POLYAH, REPEATMASKER, TRNASCAN | Prokaryotes, Eukaryotes | [37] |
NNPP | Neural network promoter prediction | Prokaryotes, Eukaryotes | [38] |
NNSPLICE | Neural network splice site prediction | Drosophila, Human | [39] |
ORFfinder | Graphical analysis tool to find all open reading frames | Prokaryotes, Eukaryotes | [40] |
Regulatory Sequence Analysis Tools | Series of modular computer programs to detect regulatory signals in non-coding sequences | Fungi, Prokaryotes, Metazoa, Protist, Plants | [41] [42] |
PHANOTATE | A tool to annotate phage genomes. | Phages | [43] |
SplicePredictor | Method to identify potential splice sites in (plant) pre-mRNA by sequence inspection using Bayesian statistical models | Eukaryotes | [44] |
VEIL | Hidden Markov model to find genes in vertebrate DNA Server | Eukaryotes | [45] |
BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a huge range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).
Nucleic acid structure prediction is a computational method to determine secondary and tertiary nucleic acid structure from its sequence. Secondary structure can be predicted from one or several nucleic acid sequences. Tertiary structure can be predicted from the sequence, or by comparative modeling.
Steven Lloyd Salzberg is an American computational biologist and computer scientist who is a Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where he is also Director of the Center for Computational Biology.
Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.
In bioinformatics, k-mers are substrings of length contained within a biological sequence. Primarily used within the context of computational genomics and sequence analysis, in which k-mers are composed of nucleotides, k-mers are capitalized upon to assemble DNA sequences, improve heterologous gene expression, identify species in metagenomic samples, and create attenuated vaccines. Usually, the term k-mer refers to all of a sequence's subsequences of length , such that the sequence AGAT would have four monomers, three 2-mers, two 3-mers and one 4-mer (AGAT). More generally, a sequence of length will have k-mers and total possible k-mers, where is number of possible monomers.
A ribosome binding site, or ribosomal binding site (RBS), is a sequence of nucleotides upstream of the start codon of an mRNA transcript that is responsible for the recruitment of a ribosome during the initiation of translation. Mostly, RBS refers to bacterial sequences, although internal ribosome entry sites (IRES) have been described in mRNAs of eukaryotic cells or viruses that infect eukaryotes. Ribosome recruitment in eukaryotes is generally mediated by the 5' cap present on eukaryotic mRNAs.
MUMmer is a bioinformatics software system for sequence alignment. It is based on the suffix tree data structure and is one of the fastest and most efficient systems available for this task, enabling it to be applied to very long sequences. It has been widely used for comparing different genomes to one another. In recent years, it has become a popular algorithm for comparing genome assemblies to one another, which allows scientists to determine how a genome has changed after adding more DNA sequence or after running a different genome assembly program. The acronym "MUMmer" comes from "Maximal Unique Matches", or MUMs. The original algorithms in the MUMMER software package were designed by Art Delcher, Simon Kasif and Steven Salzberg. Mummer was the first whole genome comparison system developed in Bioinformatics. It was originally applied to comparison of two related strains of bacteria.
GeneMark is a generic name for a family of ab initio gene prediction programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type. The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" in each of six possible reading frames or being "non-coding". Original GeneMark is an HMM-like algorithm; it can be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM.
Anders Krogh is a bioinformatician at the University of Copenhagen, where he leads the university's bioinformatics center. He is known for his pioneering work on the use of hidden Markov models in bioinformatics, and is co-author of a widely used textbook in bioinformatics. In addition, he also co-authored one of the early textbooks on neural networks. His current research interests include promoter analysis, non-coding RNA, gene prediction and protein structure prediction.
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
Sean Roberts Eddy is Professor of Molecular & Cellular Biology and of Applied Mathematics at Harvard University. Previously he was based at the Janelia Research Campus from 2006 to 2015 in Virginia. His research interests are in bioinformatics, computational biology and biological sequence analysis. As of 2016 projects include the use of Hidden Markov models in HMMER, Infernal Pfam and Rfam.
Molecular recognition features (MoRFs) are small intrinsically disordered regions in proteins that undergo a disorder-to-order transition upon binding to their partners. MoRFs are implicated in protein-protein interactions, which serve as the initial step in molecular recognition. MoRFs are disordered prior to binding to their partners, whereas they form a common 3D structure after interacting with their partners. As MoRF regions tend to resemble disordered proteins with some characteristics of ordered proteins, they can be classified as existing in an extended semi-disordered state.
SEA-PHAGES stands for Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science; it was formerly called the National Genomics Research Initiative. This was the first initiative launched by the Howard Hughes Medical Institute (HHMI) Science Education Alliance (SEA) by their director Tuajuanda C. Jordan in 2008 to improve the retention of Science, technology, engineering, and mathematics (STEM) students. SEA-PHAGES is a two-semester undergraduate research program administered by the University of Pittsburgh's Graham Hatfull's group and the Howard Hughes Medical Institute's Science Education Division. Students from over 100 universities nationwide engage in authentic individual research that includes a wet-bench laboratory and a bioinformatics component.
Wojciech Maciej Karlowski is a Polish biologist specializing in molecular biology and bioinformatics, and a full professor in biological sciences. He is Head of the Department of Computational Biology at the Faculty of Biology at the Adam Mickiewicz University in Poznan. His major scientific interests include identification of non-coding RNAs, genomics, high-throughput analyses, and functional annotation of biological sequences.