GENSCAN

Last updated
GENSCAN
Developer(s) Christopher Burge
Available in English
Type Bioinformatics tool
Website genes.mit.edu/GENSCANinfo.html

In bioinformatics, GENSCAN is a program to identify complete gene structures in genomic DNA. It is a GHMM-based program that can be used to predict the location of genes and their exon-intron boundaries in genomic sequences from a variety of organisms. The GENSCAN Web server can be found at MIT. [1]

Contents

GENSCAN was developed by Christopher Burge in the research group of Samuel Karlin at Stanford University. [2] [3] [4]

History

In 2001, the world of human gene prediction entered into Comparative genomics. This resulted in the development of a program called TWINSCAN as an adaptation of GENSCAN with higher accuracy. Other programs like N-SCAN were later developed by further adapting the GHMM model. [5]

As of 2002, GENSCAN remained a popular tool in bioinformatics, becoming a standard feature for genomes released on University of California Santa Cruz and Ensembl Genome browser. [5]

Implementation

Genomic Model

The primary goal when developing a genomic sequence model for GENSCAN was to identify both the general and specific properties that compose the individual functional units of eukaryotic genes (e.g. exons, introns, splice sites, promoters). Particular focus was placed upon features that are recognizable by general transcriptional, splicing and translational machinery that processes the majority of all protein coding genes, as opposed to the signals associated with transcription or splicing of genes and gene families (e.g. TATA box). In addition, a general three-periodic fifth-order Markov model of coding regions is used as opposed to models of specific protein motifs or database homology information. In addition, the model factors in the structural and density differences between compositional regions of the human genome. [3]

Due to the usage of these elements, GENSCAN works without needing to reference similar genes in protein sequence databases. Instead, predictions produced by GENSCAN are complementary to those gathered by homology-based gene identification methods (e.g. querying protein databases with BLASTX). Overall, the structure of the model used in GENSCAN is similar to the General Hidden Markov Model. [3]

Features

GENSCAN's implementation differs from other programs in multiple ways. A notable difference is the fact that GENSCAN utilizes a genomic sequence model that exclusively focuses double-stranded DNA where genes that are present on both strands are simultaneously analyzed. Also, GENSCAN is capable of analyzing genomes in situations where there are partial genes or no genes, rather than only being able to analyze single and complete gene sequences like other programs at its time. These two factors contribute to GENSCAN being particularly useful in analyzing longer human genomes. In addition, GENSCAN employs the concept of Maximal Dependence Decomposition such that functional signals in DNA and protein sequences can be modeled, creating the possibility for dependencies between signal positions to be considered by the program. This is implemented in GENSCAN such that a model is generated of the donor splice signal, capturing dependences that are associated with the recognition mechanisms for donor splice sites in pre-mRNA sequences. [3]

GENSCAN has the capability of calculating the accuracy of each of its predictions by using the forward-backward algorithm. [3]

Predicting the structure and overall composition of human genes in regard to exon and gene locations in longer sequences is an additionally useful component of GENSCAN. There are several different features that come as a part of this. One of which being the capability of capturing differences in gene structure and composition between C + G regions in the human genome, using sets of empirically generated model parameters. Another derived feature is, as mentioned before, predicting multiple genes in a sequence in addition to having the ability of working with partial genes and double-stranded DNA. Lastly, this also allows GENSCAN to capture dependencies between signal positions with new models of donor and acceptor splice sites. [3]

Efficiency

The run time for GENSCAN scales almost linearly when provided realistically sized sequences (several kilobits minimum), but has a worst case of being quadratic. [3]

Supplemental Usage

GENSCAN, like other genome prediction programs, doesn't produce results that totally match those of other programs. This is due to a multitude of factors including, but not limited to: differences in algorithms, parameters, and training sets. Therefore, GENSCAN has been utilized in the practice of combining two gene prediction programs' results such that if one program in the combination is confident in a sequence prediction, that sequence is used. On the other hand, if neither program is confident in their predictions, the sequence predicted is only used if both programs agree on it. [6]

Accuracy

Tests were conducted to evaluate the accuracy of GENSCAN with short data sets. One test was done on the Burset/Guigó dataset containing 570 vertebrate multi-exon gene sequences. The data produced from this test is shown in the table below, along with the data produced by testing other programs with the same dataset. GENSCAN is shown in the table to be generally more accurate than its competitors at predicting sequences with both nucleotides and exons. [3]

GENSCAN Accuracy vs. Other Programs [3]
ProgramSequencesNucleotide SensitivityNucleotide SpecificityNucleotide Approximate CorrelationNucleotide Correlation CoefficientExon SensitivityExon SpecificityExon AverageMissed ExonsWrong Exons
GENSCAN5700.930.930.910.920.780.810.800.090.05
FGENEH5690.770.880.780.800.610.640.640.150.12
GeneID5700.630.810.670.650.440.460.450.280.24
Genie5700.760.770.72n/a0.550.480.510.170.33
GenLang5700.720.790.690.710.510.520.520.210.22
GeneParser25620.660.790.670.650.350.400.370.340.17
GRAIL25700.720.870.750.760.360.430.400.250.11
SORFIND5610.710.850.730.720.420.470.450.240.14
Xpound5700.610.870.680.690.150.180.170.330.13
GeneID+4780.910.910.880.880.730.700.710.070.13
GeneParser34780.860.910.860.850.560.580.570.140.09

Furthermore, the table shown below specifically describes the accuracy of GENSCAN in regard to genomic sequences organized by ranges of C + G and types of organisms. We can see in the data provided that GENSCAN's accuracy variation was rather insensitive to C + G content and organism type. This further demonstrates GENSCAN's independence of factors that would have impacted the results of comparable genome prediction programs. [3]

GENSCAN Accuracy for Sequences Organized by C+G Content and Organism [3]
SubsetSequencesNucleotide SensitivityNucleotide SpecificityNucleotide Approximate CorrelationNucleotide Correlation CoefficientExon SensitivityExon SpecificityExon AverageMissed ExonsWrong Exons
C + G <40860.900.950.900.930.780.870.840.140.05
C + G 40-502200.940.920.910.910.800.820.820.080.05
C + G 50-602080.930.930.900.920.750.770.770.080.05
C + G >60560.970.890.900.900.760.770.760.070.08
Primates2370.960.940.930.940.810.820.820.070.05
Rodents1910.900.930.890.910.750.800.780.110.05
Non-mamm. Vert.720.930.930.900.930.810.850.840.110.06

A separate test was conducted on GENSCAN's accuracy using two GeneParser data sets that are stripped of all genes that are more than 25% of a match regarding amino acids with those in previous GeneParser test sets. The resulting data of this test and of the same test performed on other programs is shown in the table below. We can see that there is little variation between the accuracy of GENSCAN under the aforementioned Burset/Guigó data set and the GeneParser data sets. However, certain data points with higher fluctuation (e.g. 98% CC on high C + G nucleotides in GeneParser set II vs. 90% CC on C + G >60 nucleotides in Burset/Guigó) may be attributed to the GeneParser data sets being much smaller in sample size. The tests on the aforementioned three data sets provided enough information to form respective conclusions. However, these datasets are not of realistic size, therefore, their reliability and scope are justifiably brought into question. [3]

GENSCAN vs. Other Programs Prediction Accuracy Under Data Sets I and II [3]
ProgramGeneID IGeneID IIGRAIL3 IGRAIL3 IIGeneParser2 IGeneParser2 IIGENSCAN IGENSCAN II
All sequences
Correlation0.690.550.830.750.780.800.930.93
Sensitivity0.690.500.830.680.870.820.980.95
Specificity0.770.750.870.910.760.860.900.94
Exons Correct0.420.330.520.310.470.460.790.76
Exons Overlapped0.730.640.810.580.870.760.960.91
High C + G
Correlation0.650.730.880.800.890.710.940.98
Sensitivity0.720.850.870.800.900.651.000.98
Specificity0.730.730.950.880.930.870.910.98
Exons Correct0.380.430.670.500.640.570.760.64
Exons Overlapped0.800.860.890.790.960.791.000.93
Medium C + G
Correlation0.670.520.830.750.750.820.930.94
Sensitivity0.650.470.860.680.860.840.970.95
Specificity0.770.760.840.910.700.870.900.95
Exons Correct0.370.290.510.320.410.460.790.79
Exons Overlapped0.670.620.830.280.840.790.960.93
Low C + G
Correlation0.810.620.620.620.720.670.920.81
Sensitivity0.820.560.510.450.790.710.930.80
Specificity0.850.710.870.890.750.670.940.84
Exons Correct0.800.470.250.160.400.370.850.68
Exons Overlapped0.850.630.550.420.850.580.850.74

In 1997, GENSCAN was found to have a higher accuracy than previous gene prediction programs. However, work still needed to be done due to how GENSCAN was shown to only predict 10-15% of genes accurately on realistic data sets. [5] Because of inaccuracies like this, any predictions given by GENSCAN and other programs must be verified by comparing them to a Complementary DNA sequence, a Expressed sequence tag (EST) sequence, or a known protein sequence. [6]

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Alternative splicing</span> Process by which a gene can code for multiple proteins

Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be included within or excluded from the final, processed messenger RNA (mRNA) produced from that gene. This means the exons are joined in different combinations, leading to different (alternative) mRNA strands. Consequently, the proteins translated from alternatively spliced mRNAs usually contain differences in their amino acid sequence and, often, in their biological functions.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

<span class="mw-page-title-main">Splice site mutation</span> Mutation at a location where intron splicing takes place

A splice site mutation is a genetic mutation that inserts, deletes or changes a number of nucleotides in the specific site at which splicing takes place during the processing of precursor messenger RNA into mature messenger RNA. Splice site consensus sequences that drive exon recognition are located at the very termini of introns. The deletion of the splicing site results in one or more introns remaining in mature mRNA and may lead to the production of abnormal proteins. When a splice site mutation occurs, the mRNA transcript possesses information from these introns that normally should not be included. Introns are supposed to be removed, while the exons are expressed.

<span class="mw-page-title-main">TMEM50A</span> Protein-coding gene in the species Homo sapiens

Transmembrane protein 50A is a protein that in humans is encoded by the TMEM50A gene.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier, and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation.

Periannan Senapathy is a molecular biologist, geneticist, author and entrepreneur. He is the founder, president and chief scientific officer at Genome International Corporation, a biotechnology, bioinformatics, and information technology firm based in Madison, Wisconsin, which develops computational genomics applications of next-generation DNA sequencing (NGS) and clinical decision support systems for analyzing patient genome data that aids in diagnosis and treatment of diseases.

Christopher Boyce Burge is Professor of Biology and Biological Engineering at Massachusetts Institute of Technology.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

Chimeric RNA, sometimes referred to as a fusion transcript, is composed of exons from two or more different genes that have the potential to encode novel proteins. These mRNAs are different from those produced by conventional splicing as they are produced by two or more gene loci.

<span class="mw-page-title-main">Minigene</span>

A minigene is a minimal gene fragment that includes an exon and the control regions necessary for the gene to express itself in the same way as a wild type gene fragment. This is a minigene in its most basic sense. More complex minigenes can be constructed containing multiple exons and intron(s). Minigenes provide a valuable tool for researchers evaluating splicing patterns both in vivo and in vitro biochemically assessed experiments. Specifically, minigenes are used as splice reporter vectors and act as a probe to determine which factors are important in splicing outcomes. They can be constructed to test the way both cis-regulatory elements and trans-regulatory elements affect gene expression.

WormBase is an online biological database about the biology and genome of the nematode model organism Caenorhabditis elegans and contains information about other related nematodes. WormBase is used by the C. elegans research community both as an information resource and as a place to publish and distribute their results. The database is regularly updated with new versions being released every two months. WormBase is one of the organizations participating in the Generic Model Organism Database (GMOD) project.

Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.

Centre for Genomic Regulation

The Centre for Genomic Regulation is a biomedical and genomics research centre based on Barcelona. Most of its facilities and laboratories are located in the Barcelona Biomedical Research Park, in front of Somorrostro beach.

ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.

References

  1. http://genes.mit.edu/GENSCAN.html Archived 2013-09-06 at the Wayback Machine The GENSCAN Web Server at MIT
  2. Burge, C. B. (1998) Modeling dependencies in pre-mRNA splicing signals. In Salzberg, S., Searls, D. and Kasif, S., eds. Computational Methods in Molecular Biology, Elsevier Science, Amsterdam, pp. 127-163. ISBN   978-0-444-50204-9
  3. 1 2 3 4 5 6 7 8 9 10 11 12 13 Burge, Christopher; Karlin, Samuel (1997). "Prediction of complete gene structures in human genomic DNA" (PDF). Journal of Molecular Biology. 268 (1): 78–94. CiteSeerX   10.1.1.115.3107 . doi:10.1006/jmbi.1997.0951. PMID   9149143. Archived from the original (PDF) on 2015-06-20.
  4. Burge, C.; Karlin, S. (1998). "Finding the genes in genomic DNA". Current Opinion in Structural Biology. 8 (3): 346–354. doi: 10.1016/S0959-440X(98)80069-9 . PMID   9666331.
  5. 1 2 3 Flicek, Paul (2007). "Gene prediction: compare and CONTRAST". Genome Biology. 8 (12): 233. doi: 10.1186/gb-2007-8-12-233 . ISSN   1474-760X. PMC   2246255 . PMID   18096089.
  6. 1 2 Rogic, S.; Ouellette, B.F. F.; Mackworth, A. K. (2002-08-01). "Improving gene recognition accuracy by combining predictions from two gene-finding programs". Bioinformatics. 18 (8): 1034–1045. doi: 10.1093/bioinformatics/18.8.1034 . ISSN   1367-4803. PMID   12176826.