This is a list of software tools and web portals used for gene prediction.
Name | Description | Species | References |
---|---|---|---|
FINDER | Automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences | Eukaryotes | [1] |
FragGeneScan | Predicting genes in complete genomes and sequencing Reads | Prokaryotes, Metagenomes | [2] |
ATGpr | Identifies translational initiation sites in cDNA sequences | Human | [3] |
Prodigal | Its name stands for Prokaryotic Dynamic Programming Genefinding Algorithm. It is based on log-likelihood functions and does not use Hidden or Interpolated Markov Models. | Prokaryotes, Metagenomes (metaProdigal) | [4] |
AUGUSTUS | Eukaryote gene predictor | Eukaryotes | [5] |
BGF | Hidden Markov model (HMM) and dynamic programming based ab initio gene prediction program | [6] | |
DIOGENES | Fast detection of coding regions in short genome sequences | ||
Dragon Promoter Finder | Program to recognize vertebrate RNA polymerase II promoters | Vertebrates | [7] |
EasyGene | The gene finder is based on a hidden Markov model (HMM) that is automatically estimated for a new genome. | Prokaryotes | [8] [9] |
EuGene | Integrative gene finding | Prokaryotes, Eukaryotes | [10] [11] |
FGENESH | HMM-based gene structure prediction: multiple genes, both chains | Eukaryotes | [12] |
FrameD | Find genes and frameshift in G+C rich prokaryote sequences | Prokaryotes, Eukaryotes | [13] |
GeMoMa | Homology-based gene prediction based on amino acid and intron position conservation as well as RNA-Seq data | [14] [15] | |
GENIUS II | Links ORFs in complete genomes to protein 3D structures | Prokaryotes, Eukaryotes | [16] |
geneid | Program to predict genes, exons, splice sites, and other signals along DNA sequences | Eukaryotes | [17] |
GeneParser | Parse DNA sequences into introns and exons | Eukaryotes | [18] |
GeneMark | Family of self-training gene prediction programs | Prokaryotes, Eukaryotes, Metagenomes | [19] [20] [21] [22] |
GeneTack | Predicts genes with frameshifts in prokaryote genomes | Prokaryotes | [23] |
GenomeScan | Predicts the locations and exon-intron structures of genes in genome sequences from a variety of organisms, GENSCAN server is the GenomeScan's predecessor | Vertebrate, Arabidopsis, Maize | [24] |
GENSCAN | Predicts the locations and exon-intron structures of genes in genome sequences from a variety of organisms | Vertebrate, Arabidopsis, Maize | [25] [26] [27] |
GLIMMER | Finds genes in microbial DNA | Prokaryotes | [28] [29] [30] |
GLIMMERHMM | Eukaryotic gene-finding system | Eukaryotes | [31] |
GrailEXP | Predicts exons, genes, promoters, polyas, CpG islands, EST similarities, and repeat elements in DNA sequence | Human, Mus musculus, Arabidopsis thaliana, Drosophila melanogaster | [32] [33] |
mGene | Support-vector machine (SVM) based system to find genes | Eukaryotes | [34] |
mGene.ngs | SVM based system to find genes using heterogeneous information: RNA-seq, tiling arrays | Eukaryotes | [35] |
MORGAN | Decision tree system to find genes in vertebrate DNA | Eukaryotes | [36] |
BioNIX | Web tool to combine results from different programs: GRAIL, FEX, HEXON, MZEF, GENEMARK, GENEFINDER, FGENE, BLAST, POLYAH, REPEATMASKER, TRNASCAN | Prokaryotes, Eukaryotes | [37] |
NNPP | Neural network promoter prediction | Prokaryotes, Eukaryotes | [38] |
NNSPLICE | Neural network splice site prediction | Drosophila, Human | [39] |
ORFfinder | Graphical analysis tool to find all open reading frames | Prokaryotes, Eukaryotes | [40] |
Regulatory Sequence Analysis Tools | Series of modular computer programs to detect regulatory signals in non-coding sequences | Fungi, Prokaryotes, Metazoa, Protist, Plants | [41] [42] |
PHANOTATE | A tool to annotate phage genomes. | Phages | [43] |
SplicePredictor | Method to identify potential splice sites in (plant) pre-mRNA by sequence inspection using Bayesian statistical models | Eukaryotes | [44] |
VEIL | Hidden Markov model to find genes in vertebrate DNA Server | Eukaryotes | [45] |
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is often referred to as computational biology, though the distinction between the two terms is often disputed.
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
UniProt is a freely accessible database of protein sequence and functional information, many entries being derived from genome sequencing projects. It contains a large amount of information about the biological function of proteins derived from the research literature. It is maintained by the UniProt consortium, which consists of several European bioinformatics organisations and a foundation from Washington, DC, USA.
Steven Lloyd Salzberg is an American computational biologist and computer scientist who is a Bloomberg Distinguished Professor of Biomedical Engineering, Computer Science, and Biostatistics at Johns Hopkins University, where he is also Director of the Center for Computational Biology.
Rfam is a database containing information about non-coding RNA (ncRNA) families and other structured RNA elements. It is an annotated, open access database originally developed at the Wellcome Trust Sanger Institute in collaboration with Janelia Farm, and currently hosted at the European Bioinformatics Institute. Rfam is designed to be similar to the Pfam database for annotating protein families.
A ribosome binding site, or ribosomal binding site (RBS), is a sequence of nucleotides upstream of the start codon of an mRNA transcript that is responsible for the recruitment of a ribosome during the initiation of translation. Mostly, RBS refers to bacterial sequences, although internal ribosome entry sites (IRES) have been described in mRNAs of eukaryotic cells or viruses that infect eukaryotes. Ribosome recruitment in eukaryotes is generally mediated by the 5' cap present on eukaryotic mRNAs.
MUMmer is a bioinformatics software system for sequence alignment. It is based on the suffix tree data structure. It has been used for comparing different genomes assemblies to one another, which allows scientists to determine how a genome has changed. The acronym "MUMmer" comes from "Maximal Unique Matches", or MUMs.
MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.
GeneMark is a generic name for a family of ab initio gene prediction algorithms and software programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae, and in 1996 for the first archaeal genome of Methanococcus jannaschii. The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type. The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" in each of six possible reading frames or being "non-coding". The original GeneMark was an HMM-like algorithm; it could be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM model of DNA sequence.
Anders Krogh is a bioinformatician at the University of Copenhagen, where he leads the university's bioinformatics center. He is known for his pioneering work on the use of hidden Markov models in bioinformatics, and is co-author of a widely used textbook in bioinformatics. In addition, he also co-authored one of the early textbooks on neural networks. His current research interests include promoter analysis, non-coding RNA, gene prediction and protein structure prediction.
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
SEA-PHAGES stands for Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science; it was formerly called the National Genomics Research Initiative. This was the first initiative launched by the Howard Hughes Medical Institute (HHMI) Science Education Alliance (SEA) by their director Tuajuanda C. Jordan in 2008 to improve the retention of Science, technology, engineering, and mathematics (STEM) students. SEA-PHAGES is a two-semester undergraduate research program administered by the University of Pittsburgh's Graham Hatfull's group and the Howard Hughes Medical Institute's Science Education Division. Students from over 100 universities nationwide engage in authentic individual research that includes a wet-bench laboratory and a bioinformatics component.
Wojciech Maciej Karlowski is a Polish biologist specializing in molecular biology and bioinformatics, and a full professor in biological sciences. He is Head of the Department of Computational Biology at the Faculty of Biology at the Adam Mickiewicz University in Poznan. His major scientific interests include identification of non-coding RNAs, genomics, high-throughput analyses, and functional annotation of biological sequences.
Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions. It depends on computational technology and bioinformatics tools. The mining process relies on a huge amount of data accessible in genomic databases. By applying data mining algorithms, the data can be used to generate new knowledge in several areas of medicinal chemistry, such as discovering novel natural products.