GeneMark

GeneMark
Original author(s)	Bioinformatics group of Mark Borodovsky
Developer(s)	Georgia Institute of Technology
Initial release	1993
Operating system	Linux, Windows, and Mac OS
License	Free binary-only for academic, non-profit or U.S. Government use
Website	https://exon.gatech.edu

Last updated December 14, 2024

GeneMark is a generic name for a family of ab initio gene prediction algorithms and software programs developed at the Georgia Institute of Technology in Atlanta. Developed in 1993, original GeneMark was used in 1995 as a primary gene prediction tool for annotation of the first completely sequenced bacterial genome of Haemophilus influenzae , and in 1996 for the first archaeal genome of Methanococcus jannaschii . The algorithm introduced inhomogeneous three-periodic Markov chain models of protein-coding DNA sequence that became standard in gene prediction as well as Bayesian approach to gene prediction in two DNA strands simultaneously. Species specific parameters of the models were estimated from training sets of sequences of known type (protein-coding and non-coding). The major step of the algorithm computes for a given DNA fragment posterior probabilities of either being "protein-coding" (carrying genetic code) in each of six possible reading frames (including three frames in the complementary DNA strand) or being "non-coding". The original GeneMark (developed before the advent of the HMM applications in Bioinformatics) was an HMM-like algorithm; it could be viewed as approximation to known in the HMM theory posterior decoding algorithm for appropriately defined HMM model of DNA sequence.

Further improvements in the algorithms for gene prediction in prokaryotic genomes
Heuristic Models and Gene Prediction in Metagenomes and Metatransciptomes
Eukaryotic gene prediction
GeneMark Family of Gene Prediction Programs
Bacteria, Archaea
Metagenomes and Metatranscriptomes
Eukaryotes
Viruses, phages and plasmids
Transcripts assembled from RNA-Seq read
See also
References
External links

Further improvements in the algorithms for gene prediction in prokaryotic genomes

The GeneMark.hmm algorithm (1998) was designed to improve accuracy of prediction of short genes and gene starts. The idea was to use the inhomogeneous Markov chain models introduced in GeneMark for computing likelihoods of the sequences emitted by the states of a hidden Markov model, or rather semi-Markov HMM, or generalized HMM describing the genomic sequence. The borders between coding and non-coding regions were formally interpreted as transitions between hidden states. Additionally, the ribosome binding site model was added to the GHMM model to improve accuracy of gene start prediction. The next important step in the algorithm development was introduction of self-training or unsupervised training of the model parameters in the new gene prediction tool GeneMarkS (2001). Rapid accumulation of prokaryotic genomes in the following years has shown that the structure of sequence patterns related to gene expression regulation signals near gene starts may vary. Also, it was observed that prokaryotic genome may exhibit GC content variability due to the lateral gene transfer. The new algorithm, GeneMarkS-2 was designed to make automatic adjustments to the types of gene expression patterns and the GC content changes along the genomic sequence. GeneMarkS and, then GeneMarkS-2 have been used in the NCBI pipeline for prokaryotic genomes annotation (PGAP). ( www.ncbi.nlm.nih.gov/genome/annotation_prok/process ).

Heuristic Models and Gene Prediction in Metagenomes and Metatransciptomes

Accurate identification of species specific parameters of a gene finding algorithm is a necessary condition for making accurate gene predictions. However, in the studies of viral genomes one needs to estimate parameters from a rather short sequence that has no large genomic context. Importantly, starting 2004, the same question had to be addressed for gene prediction in short metagenomic sequences. A surprisingly accurate answer was found by introduction of parameter generating functions depending on a single variable, the sequence G+C content ("heurisic method" 1999). Subsequently, analysis of several hundred prokaryotic genomes led to developing more advanced heuristic method in 2010 (implemented in MetaGeneMark). Further on, the need to predict genes in RNA transcripts led to development of GeneMarkS-T (2015), a tool that identifies intron-less genes in long transcript sequences assembled from RNA-Seq reads.

Eukaryotic gene prediction

In eukaryotic genomes modeling of exon borders with introns and intergenic regions present a major challenge. The GHMM architecture of eukaryotic GeneMark.hmm includes hidden states for initial, internal, and terminal exons, introns, intergenic regions and single exon genes located in both DNA strands. Initial version of the eukaryotic GeneMark.hmm needed manual compilation of training sets of protein-coding sequences for estimation of the algorithm parameters. However, in 2005, the first self-training eukaryotic gene finder, GeneMark-ES, was developed. A fungal version of GeneMark-ES developed in 2008 features a more complex intron model and hierarchical strategy of self-training. In 2014, in GeneMark-ET the self-training of parameters was aided by extrinsic hints generated by mapping to the genome short RNA-Seq reads. Extrinsic evidence is not limited to the 'native' RNA sequences. The cross-species proteins collected in the vast protein databases could be a source for external hints, if the homologous relationships between the already known proteins and the proteins encoded by yet unknown genes in the novel genome are established. This task was solved upon developing the new algorithm, GeneMark-EP+ (2020). Integration of the RNA and protein sources of the intrinsic hints was done in GeneMark-ETP (2023). Versatility and accuracy of the eukaryotic gene finders of the GeneMark family have led to their incorporation into number of pipelines of genome annotation. Also, since 2016, the pipelines BRAKER1, BRAKER2, BRAKER3 were developed to combine the strongest features of GeneMark and AUGUSTUS.

Notably, gene prediction in eukaryotic transcripts can be done by the new algorithm GeneMarkS-T (2015)

GeneMark Family of Gene Prediction Programs

Bacteria, Archaea

GeneMark
GeneMarkS
GeneMarkS-2

Metagenomes and Metatranscriptomes

MetaGeneMark
GeneMarkS-T

Eukaryotes

GeneMark
GeneMark.hmm ^[1]
GeneMark-ES: ab initio gene finding algorithm for eukaryotic genomes with automatic (unsupervised) training.^[2]
GeneMark-ET: augments GeneMark-ES by integrating RNA-Seq read alignments into the self-training procedure.^[3]
GeneMark-EP+: augments GeneMark-ES by iterative finding genes in a novel genome, detecting similarities of predicted genes to known proteins, splice-aligning of the known proteins to the genome and generating hints for the next round of prediction, and correction based on the external evidence.
GeneMark-ETP: integrates genomic, transcript and protein evidence into the gene prediction

Viruses, phages and plasmids

Heuristic models

Transcripts assembled from RNA-Seq read

GeneMarkS-T

Related Research Articles

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The process of analyzing and interpreting data can sometimes be referred to as computational biology, however this distinction between the two terms is often disputed. To some, the term computational biology refers to building and using models of biological systems.

The coding region of a gene, also known as the coding DNA sequence (CDS), is the portion of a gene's DNA or RNA that codes for a protein. Studying the length, composition, regulation, splicing, structures, and functions of coding regions compared to non-coding regions over different species and time periods can provide a significant amount of important information regarding gene organization and evolution of prokaryotes and eukaryotes. This can further assist in mapping the human genome and developing gene therapy.

In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an open reading frame (ORF) may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

In bioinformatics, GLIMMER is used to find genes in prokaryotic DNA. "It is effective at finding genes in bacteria, archea, viruses, typically finding 98-99% of all relatively long protein coding genes". GLIMMER was the first system that used the interpolated Markov model to identify coding regions. The GLIMMER software is open source and is maintained by Steven Salzberg, Art Delcher, and their colleagues at the Center for Computational Biology at Johns Hopkins University. The original GLIMMER algorithms and software were designed by Art Delcher, Simon Kasif and Steven Salzberg and applied to bacterial genome annotation in collaboration with Owen White.

A ribosome binding site, or ribosomal binding site (RBS), is a sequence of nucleotides upstream of the start codon of an mRNA transcript that is responsible for the recruitment of a ribosome during the initiation of translation. Mostly, RBS refers to bacterial sequences, although internal ribosome entry sites (IRES) have been described in mRNAs of eukaryotic cells or viruses that infect eukaryotes. Ribosome recruitment in eukaryotes is generally mediated by the 5' cap present on eukaryotic mRNAs.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

Anders Krogh is a bioinformatician at the University of Copenhagen, where he leads the university's bioinformatics center. He is known for his pioneering work on the use of hidden Markov models in bioinformatics, and is co-author of a widely used textbook in bioinformatics. In addition, he also co-authored one of the early textbooks on neural networks. His current research interests include promoter analysis, non-coding RNA, gene prediction and protein structure prediction.

Mark Borodovsky is a Regents' Professor at the Join Wallace H. Coulter Department of Biomedical Engineering of Georgia Institute of Technology and Emory University and Director of the Center for Bioinformatics and Computational Genomics at Georgia Tech. He has also been a Chair of the Department of Bioinformatics at the Moscow Institute of Physics and Technology in Moscow, Russia from 2012 to 2022.

GENCODE is a scientific project in genome research and part of the ENCODE scale-up project.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

TIGRFAMs is a database of protein families designed to support manual and automated genome annotation. Each entry includes a multiple sequence alignment and hidden Markov model (HMM) built from the alignment. Sequences that score above the defined cutoffs of a given TIGRFAMs HMM are assigned to that protein family and may be assigned the corresponding annotations. Most models describe protein families found in Bacteria and Archaea.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

SEA-PHAGES stands for Science Education Alliance-Phage Hunters Advancing Genomics and Evolutionary Science; it was formerly called the National Genomics Research Initiative. This was the first initiative launched by the Howard Hughes Medical Institute (HHMI) Science Education Alliance (SEA) by their director Tuajuanda C. Jordan in 2008 to improve the retention of Science, technology, engineering, and mathematics (STEM) students. SEA-PHAGES is a two-semester undergraduate research program administered by the University of Pittsburgh's Graham Hatfull's group and the Howard Hughes Medical Institute's Science Education Division. Students from over 100 universities nationwide engage in authentic individual research that includes a wet-bench laboratory and a bioinformatics component.

References

Borodovsky M. and McIninch J. "GeneMark: parallel gene recognition for both DNA strands." Computers & Chemistry (1993) 17 (2): 123–133. DOI
Lukashin A. and Borodovsky M. "GeneMark.hmm: new solutions for gene finding." Nucleic Acids Research (1998) 26 (4): 1107–1115. DOI PMID
Besemer J. and Borodovsky M. "Heuristic approach to deriving models for gene finding." Nucleic Acids Research (1999) 27 (19): 3911–3920. DOI PMID
Besemer J., Lomsadze A., and Borodovsky M. "GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions." Nucleic Acids Research (2001) 29 (12): 2607–2618. DOI PMID
Mills R., Rozanov M., Lomsadze A., Tatusova T., and Borodovsky M. "Improving gene annotation in complete viral genomes." Nucleic Acids Research (2003) 31 (23): 7041–7055. DOI PMID
Besemer J. and Borodovsky M. "GeneMark: web software for gene finding in prokaryotes, eukaryotes and viruses." Nucleic Acids Research (2005) 33 (Web Server Issue): W451-454. DOI PMID
Lomsadze A., Ter-Hovhannisyan V., Chernoff Y., and Borodovsky M. "Gene identification in novel eukaryotic genomes by self-training algorithm." Nucleic Acids Research (2005) 33 (20): 6494–6506. DOI PMID
Ter-Hovhannisyan V., Lomsadze A., Chernoff Y., and Borodovsky M. "Gene prediction in novel fungal genomes using an ab initio algorithm with unsupervised training." Genome Research (2008) 18 (12): 1979-1990. DOI PMID
Zhu W., Lomsadze A., and Borodovsky M. "Ab initio gene identification in metagenomic sequences." Nucleic Acids Research (2010) 38 (12): e132. DOI PMID
Lomsadze A., Burns P.D., and Borodovsky M. "Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm." Nucleic Acids Research (2014) 42 (15): e119. DOI PMID
Tang S., Lomsadze A., and Borodovsky M. "Identification of protein coding regions in RNA transcripts." Nucleic Acids Research (2015) 43 (12): e78. DOI PMID
Tatusova T., DiCuccio M., Badretdin A., Chetvernin V., Nawrocki E., Zaslavsky L., Lomsadze A., Pruitt K., Borodovsky M., and Ostell J. "NCBI prokaryotic genome annotation pipeline." Nucleic Acids Research (2016) 44 (14): 6614-6624. DOI PMID
Hoff K., Lange S., Lomsadze A., Borodovsky M., and Stanke M. "BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS." Bioinformatics (2016) 32 (5): 767-769. DOI PMID
Lomsadze A., Gemayel K., Tang S., and Borodovsky M. "Modeling leaderless transcription and atypical genes results in more accurate gene prediction in prokaryotes." Genome Research (2018) 28 (7): 1079-1089. DOI PMID
Bruna T., Hoff K., Lomsadze A., Stanke M., and Borodovsky M. "BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database." NAR Genomics and Bioinformatics (2021) 3 (1): lqaa108 DOI PMID
Bruna T., Lomsadze A., and Borodovsky M. "GeneMark-EP+: eukaryotic gene prediction with self-training in the space of genes and proteins." NAR Genomics and Bioinformatics (2022) 2 (2): lqaa026 DOI PMID
Bruna T., Lomsadze A., and Borodovsky M. "GeneMark-ETP: Automatic Gene Finding in Eukaryotic Genomes in Consistence with Extrinsic Data." bioRxiv (Jan 5, 2023) DOI PMID
Gabriel L., Brůna T., Hoff K., Ebel M., Lomsadze A., Borodovsky M., and Stanke M. "BRAKER3: Fully automated genome annotation using RNA-Seq and protein evidence with GeneMark-ETP, AUGUSTUS and TSEBRA." bioRxiv (Nov 27, 2023) DOI PMID

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "GeneMark.HMM eukaryotic".

[2] "GeneMark-ES".

[3] "GeneMark-ET – gene finding algorithm for eukaryotic genomes | RNA-Seq Blog". 9 July 2014.

[1]

[2]

[3]

v t e Genomics
Fields	Cognitive genomics Computational genomics Comparative genomics Functional genomics Genome project Human Genome Project Metagenomics Human Microbiome Project Pangenomics Personal genomics Population genomics Sociogenomics Structural genomics
Bioinformatics	Biochip Cheminformatics Chemogenomics Connectomics Human Connectome Project Epigenomics Human Epigenome Project Glycomics Immunomics Lipidomics Metabolomics Microbiomics Nutrigenomics Paleopolyploidy Pharmacogenetics Pharmacogenomics Systems biology Toxicogenomics Transcriptomics
Structural biology	Proteomics Human proteome project Call-map proteomics Structure-based drug design Expression proteomics
Research tools	2-D electrophoresis Mass spectrometer Electrospray ionization Matrix-assisted laser desorption ionization Matrix-assisted laser desorption ionization-time of flight mass spectrometer Microfluidic-based tools Isotope affinity tags Chromosome conformation capture
Organizations	DNA Data Bank of Japan (JP) European Molecular Biology Laboratory (EU) National Institutes of Health (USA) Wellcome Sanger Institute (UK)
List Category