Consensus CDS Project

Last updated

CCDS Project
Content
DescriptionConvergence towards a standard set of gene annotations
Contact
Research center National Center for Biotechnology Information
European Bioinformatics Institute
University of California, Santa Cruz
Wellcome Trust Sanger Institute
Authors Kim D. Pruitt
Primary citationPruitt KD, et al (2009) [1]
Release date2009
Access
Website https://www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrowse.cgi
Miscellaneous
VersionCCDS Release 24

The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier (CCDS ID), and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. [1] The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation. [2]

Contents

Motivation and background

Biological and biomedical research has come to rely on accurate and consistent annotation of genes and their products on genome assemblies. Reference annotations of genomes are available from various sources, each with their own independent goals and policies, which results in some annotation variation.

The CCDS project was established to identify a gold standard set of protein-coding gene annotations that are identically annotated on the human and mouse reference genome assemblies by the participating annotation groups. The CCDS gene sets that have been arrived at by consensus of the different partners [2] now consist of over 18,000 human and over 20,000 mouse genes (see CCDS release history). The CCDS dataset is increasingly representing more alternative splicing events with each new release. [3]

Contributing groups

Participating annotation groups include: [3]

Manual annotation is provided by:

Defining the CCDS gene set

"Consensus" is defined as protein-coding regions that agree at the start codon, stop codon, and splice junctions, and for which the prediction meets quality assurance benchmarks. [1] A combination of manual and automated genome annotations provided by (NCBI) and Ensembl (which incorporates manual HAVANA annotations) are compared to identify annotations with matching genomic coordinates.

Quality assurance testing

In order to ensure that CDSs are of high quality, multiple quality assurance (QA) tests are performed (Table 1). All tests are performed following the annotation comparison step of each CCDS build and are independent of individual annotation group QA tests performed prior to the annotation comparison. [3]

Table 1: Examples of the types of CCDS QA tests performed prior to acceptance of CCDS candidates [3]
QA testPurpose of the test
Subject to NMDChecks for transcripts that may be subject to nonsense-mediated decay (NMD)
Low qualityChecks for low coding propensity
Non-consensus splice sitesChecks for non-canonical splice sites
Predicted pseudogeneChecks for genes that are predicted to be pseudogenes by UCSC
Too shortChecks for transcripts or proteins that are unusually short, typically <100 amino acids
Ortholog not found/not conservedChecks for genes that are not conserved and/or are not in a HomoloGene cluster
CDS start or stop not in alignmentChecks for a start or stop codon in the reference genome sequence
Internal stopChecks for the presence of an internal stop codon in the genomic sequence
NCBI:Ensembl protein length differentChecks if the protein encoded by the NCBI RefSeq is the same length as the EBI/WTSI protein
NCBI:Ensembl low percent identityChecks for >99% overall identity between the NCBI and EBI/WTSI proteins
Gene discontinuedChecks if the GeneID is no longer valid

Annotations that fail QA tests undergo a round of manual checking that may improve results or reach a decision to reject annotation matches based on QA failure.

Review process

The CCDS database is unique in that the review process must be carried out by multiple collaborators, and agreement must be reached before any changes can be made. This is made possible with a collaborator coordination system that includes a work process flow and forums for analysis and discussion. The CCDS database operates an internal website that serves multiple purposes including curator communication, collaborator voting, providing special reports and tracking the status of CCDS representations. When a collaborating CCDS group member identifies a CCDS ID that may need review, a voting process is employed to decide on the final outcome.

Manual curation

Coordinated manual curation is supported by a restricted-access website and a discussion e-mail list. CCDS curation guidelines were established to address specific conflicts that were observed at a higher frequency. Establishment of CCDS curation guidelines has helped to make the CCDS curation process more efficient by reducing the number of conflicting votes and time spent in discussion to reach a consensus agreement. A link to the CCDS curation guidelines can be found here.

Curation policies established for the CCDS data set have been integrated in to the RefSeq and HAVANA annotation guidelines and thus, new annotations provided by both groups are more likely to be concordant and result in addition of a CCDS ID. These standards address specific problem areas, are not a comprehensive set of annotation guidelines, and do not restrict the annotation policies of any collaborating group. [2] Examples include, standardized curation guidelines for selection of the initiation codon and interpretation of upstream ORFs and transcripts that are predicted to be candidates for nonsense-mediated decay. Curation occurs continuously, and any of the collaborating centers can flag a CCDS ID as a potential update or withdrawal.

Conflicting opinions are addressed by consulting with scientific experts or other annotation curation groups such as the HUGO Gene Nomenclature Committee (HGNC) and Mouse Genome Informatics (MGI). If a conflict cannot be resolved, then collaborators agree to withdraw the CCDS ID until more information becomes available.

Curation challenges and annotation guidelines

Nonsense-mediated decay (NMD): NMD is the most powerful mRNA surveillance process. NMD eliminates defective mRNA before it can be translated into protein. [4] This is important because if the defective mRNA is translated, the truncated protein may cause disease. Different mechanisms have been proposed to explain NMD; one being the exon junction complex (EJC) model. In this model, if the stop codon is >50 nt upstream of the last exon-exon junction, the transcript is assumed to be a NMD candidate. [2] The CCDS collaborators use a conservative method, based on the EJC model, to screen mRNA transcripts. Any transcripts determined to be NMD candidates are excluded from the CCDS data set except in the following situations: [2]

  1. all transcripts at one particular locus are assessed to be NMD candidates however the locus is previously known to be protein coding region;
  2. there is experimental evidence suggesting that a functional protein is produced from the NMD candidate transcript.

Previously, NMD candidate transcripts were considered to be protein coding transcripts by both RefSeq and HAVANA, and thereby, these NMD candidate transcripts were represented in the CCDS data set. The RefSeq group and the HAVANA project have subsequently revised their annotation policies.

Multiple in-frame translation start sites: Multiple factors contribute to translation initiation, such as upstream open reading frames (uORFs), secondary structure and the sequence context around the translation initiation site. A common start site is defined within Kozak consensus sequence: (GCC) GCCACCAUGG in vertebrates. The sequence in brackets (GCC) is the motif with unknown biological impact. [5] There are variations within Kozak consensus sequence, such as G or A is observed three nucleotides upstream (at position -3) of AUG. Bases between positions -3 and +4 of Kozak sequence have the most significant impact on translational efficiency. Hence, a sequence (A/G)NNAUGG is defined as a strong Kozak signal in the CCDS project.

According to the scanning mechanism, the small ribosomal subunit can initiate translation from the first reached start codon. There are exceptions to the scanning model:

  1. when the initiation site is not surrounded by a strong Kozak signal, which results in leaky scanning. Thereby, the ribosome skips this AUG and initiates translation from a downstream start site;
  2. when a shorter ORF can allow the ribosome to re-initiate translation at a downstream ORF. [5]

According to the CCDS annotation guidelines, the longest ORF must be annotated except when there is experimental evidence that an internal start site is used to initiate translation. Additionally, other types of new data, such as ribosome profiling data, [6] can be used to identify start codons. The CCDS data set records one translation initiation site per CCDS ID. Any alternative start sites may be used for translation and will be stated in a CCDS public note.

Upstream open reading frames: AUG initiation codons located within transcript leaders are known as upstream AUGs (uAUGs). Sometimes, uAUGs are associated with uORFs . uORFs are found in approximately 50% of human and mouse transcripts. [7] The existence of uORFs are another challenge for the CCDS data set. The scanning mechanism for translation initiation suggests that small ribosomal subunits (40S) bind at the 5’ end of a nascent mRNA transcript and scan for the first AUG start codon. [5] It is possible that an uAUG is recognised first, and the corresponding uORF is then translated. The translated uORF could be a NMD candidate, although studies have shown that some uORFs can avoid NMD. The average size limit for uORFs that will escape NMD is approximately 35 amino acids. [2] [8] It also has been suggested that uORFs inhibit translation of the downstream gene by trapping a ribosome initiation complex and causing the ribosome to dissociate from the mRNA transcript before it reaches the protein-coding regions. [4] [7] Currently, no studies have reported the global impact of uORFs on translational regulation.

The current CCDS annotation guidelines allow the inclusion of mRNA transcripts containing uORFs if they meet the following two biological requirements: [2]

  1. the mRNA transcript has a strong Kozak signal;
  2. the mRNA transcript is either ≥ 35 amino acids or overlaps with the primary open reading frame.

Read-through transcripts: Read-through transcripts are also known as conjoined genes or co-transcribed genes. Read-through transcripts are defined as transcripts combining at least part of one exon from each of two or more distinct known (partner) genes which lie on the same chromosome in the same orientation. [9] The biological function of read-through transcripts and their corresponding protein molecules remain unknown. However, the definition of a read-through gene in the CCDS data set is that the individual partner genes must be distinct, and the read-through transcripts must share ≥ 1 exon (or ≥ 2 splice sites except in the case of a shared terminal exon) with each of the distinct shorter loci. [2] Transcripts are not considered to be read-through transcripts in the following circumstances:

  1. when transcripts are produced from overlapping genes but do not share same splice sites;
  2. when transcripts are translated from genes that have nested structures relative to each other. In this instance, the CCDS collaborators and the HGNC have agreed that the read-through transcript be represented as a separate locus.

Quality of reference genome sequence: As the CCDS data set is built to represent genomic annotations of human and mouse, the quality problems with the human and mouse reference genome sequences become another challenge. Quality problems occur when the reference genome is misassembled. Thereby the misassembled genome may contain premature stop codons, frame-shift indels, or likely polymorphic pseudogenes. Once these quality problems are identified, the CCDS collaborators report the issues to the Genome Reference Consortium, which investigates and makes the necessary corrections.

Access to CCDS data

The CCDS project is available from the NCBI CCDS data set page (here), which provides FTP download links and a query interface to acquire information about CCDS sequences and locations. CCDS reports can be obtained by using the query interface, which is located at the top of the CCDS data set page. Users can select various types of identifiers such as CCDS ID, gene ID, gene symbol, nucleotide ID and protein ID to search for specific CCDS information. [1] The CCDS reports (Figure 1) are presented in a table format, providing links to specific resources, such as a history report, Entrez Gene [10] or re-query the CCDS data set. The sequence identifiers table presents transcript information in VEGA, Ensembl and Blink. The chromosome location table includes the genomic coordinates for each individual exon of the specific coding sequence. This table also provides links to several different genome browsers, which allow you to visualise the structure of the coding region. [1] Exact nucleotide sequence and protein sequence of the specific coding sequence are also displayed in the section of CCDS sequence data.

Figure 1. The CCDS data set screenshot showing the report for Itm2a protein (CCDS 30349). Itm2a report.jpg
Figure 1. The CCDS data set screenshot showing the report for Itm2a protein (CCDS 30349).

Current applications

The CCDS dataset is an integral part of the GENCODE gene annotation project [11] and it is used as a standard for high-quality coding exon definition in various research fields, including clinical studies, large-scale epigenomic studies, exome projects and exon array design. [3] Due to the consensus annotation of CCDS exons by the independent annotation groups, exome projects in particular have regarded CCDS coding exons as reliable targets for downstream studies (e.g., for single nucleotide variant detection), and these exons have been used as coding region targets in commercially available exome kits. [12]

CCDS release history

The CCDS data set size has continued to increase with both the computational genome annotation updates, which integrate new data sets submitted to the International Nucleotide Sequence Database Collaboration (INSDC), and on ongoing curation activities that supplement or improve upon that annotation. Table 2 summarises the key statistics for each CCDS build where Public CCDS IDs are all those that were not under review or pending an update or withdrawal at the time of the current release date.

Table 2. Summary statistics for past CCDS releases.
ReleaseSpeciesAssembly namePublic CCDS ID countGene ID countCurrent release date
1Homo sapiensNCBI3513,74012,950Mar 14, 2007
2Mus musculusMGSCv3613,21813,012Nov 28, 2007
3Homo sapiensNCBI3617,49415,805May 1, 2008
4Mus musculusMGSCv3717, 08216,888Jan 24, 2011
5Homo sapiensNCBI3619,39317,053Sep 2, 2009
6Homo sapiensGRCh3722,91218,174Apr 20, 2011
7Mus musculusMGSCv3721,87419,507Aug 14, 2012
8Homo sapiensGRCh37.p225,35418,407Sep 6, 2011
9Homo sapiensGRCh37.p526,25418,474Oct 25, 2012
10Mus musculusGRCm3822,93419,945Aug 5, 2013
11Homo sapiensGRCh37.p927,37718,535Apr 29, 2013
12Homo sapiensGRCh37.p1027,65518,607Oct 24, 2013
13Mus musculusGRCm38.p123,01019,990Apr 7, 2014
14Homo sapiensGRCh37.p1328,64918,673Nov 29, 2013
15Homo sapiensGRCh37.p1328,89718,681Aug 7, 2014
16Mus musculusGRCm38.p223,83520,079Sep 10, 2014
17Homo sapiensGRCh3830,46118,800Sep 10, 2014
18Homo sapiensGRCh38.p231,37118,826May 12, 2015
19Mus musculusGRCm38.p324,83420,215July 30, 2015
20Homo sapiensGRCh38.p732,52418,892Sep 8, 2016
21Mus musculusGRCm38.p425,75720,354Dec 8, 2016
22Homo sapiensGRCh38.p1233,39719,033Jun 14, 2018
23Mus musculusGRCm38.p627,21920,486Oct 24, 2019
24Homo sapiensGRCh38.p1435,60819,107Oct 26, 2022

The complete set of release statistics can be found at the official CCDS website on their Releases & Statistics page.

Future prospects

Long-term goals include the addition of attributes that indicate where transcript annotation is also identical (including the UTRs) and to indicate splice variants with different UTRs that have the same CCDS ID. It is also anticipated that as more complete and high-quality genome sequence data become available for other organisms, annotations from these organisms may be in scope for CCDS representation.

The CCDS set will become more complete as the independent curation groups agree on cases where they initially differ, as additional experimental validation of weakly supported genes occurs, and as automatic annotation methods continue to improve. Communication among the CCDS collaborating groups is ongoing and will resolve differences and identify refinements between CCDS update cycles. Human updates are expected to occur roughly every 6 months and mouse releases yearly. [3]

See also

Related Research Articles

<span class="mw-page-title-main">Messenger RNA</span> RNA that is read by the ribosome to produce a protein

In molecular biology, messenger ribonucleic acid (mRNA) is a single-stranded molecule of RNA that corresponds to the genetic sequence of a gene, and is read by a ribosome in the process of synthesizing a protein.

<span class="mw-page-title-main">Translation (biology)</span> Cellular process of protein synthesis

In biology, translation is the process in living cells in which proteins are produced using RNA molecules as templates. The generated protein is a sequence of amino acids. This sequence is determined by the sequence of nucleotides in the RNA. The nucleotides are considered three at a time. Each such triple results in addition of one specific amino acid to the protein being generated. The matching from nucleotide triple to amino acid is called the genetic code. The translation is performed by a large complex of functional RNA and proteins called ribosomes. The entire process is called gene expression.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">SR protein</span>

SR proteins are a conserved family of proteins involved in RNA splicing. SR proteins are named because they contain a protein domain with long repeats of serine and arginine amino acid residues, whose standard abbreviations are "S" and "R" respectively. SR proteins are ~200-600 amino acids in length and composed of two domains, the RNA recognition motif (RRM) region and the RS domain. SR proteins are more commonly found in the nucleus than the cytoplasm, but several SR proteins are known to shuttle between the nucleus and the cytoplasm.

In molecular biology, reading frames are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

The 5′ untranslated region is the region of a messenger RNA (mRNA) that is directly upstream from the initiation codon. This region is important for the regulation of translation of a transcript by differing mechanisms in viruses, prokaryotes and eukaryotes. While called untranslated, the 5′ UTR or a portion of it is sometimes translated into a protein product. This product can then regulate the translation of the main coding sequence of the mRNA. In many organisms, however, the 5′ UTR is completely untranslated, instead forming a complex secondary structure to regulate translation.

<span class="mw-page-title-main">Start codon</span> First codon of a messenger RNA translated by a ribosome

The start codon is the first codon of a messenger RNA (mRNA) transcript translated by a ribosome. The start codon always codes for methionine in eukaryotes and archaea and a N-formylmethionine (fMet) in bacteria, mitochondria and plastids.

<span class="mw-page-title-main">Nonsense-mediated decay</span> Elimination of mRNA with premature stop codons in eukaryotes

Nonsense-mediated mRNA decay (NMD) is a surveillance pathway that exists in all eukaryotes. Its main function is to reduce errors in gene expression by eliminating mRNA transcripts that contain premature stop codons. Translation of these aberrant mRNAs could, in some cases, lead to deleterious gain-of-function or dominant-negative activity of the resulting proteins.

The Kozak consensus sequence is a nucleic acid motif that functions as the protein translation initiation site in most eukaryotic mRNA transcripts. Regarded as the optimum sequence for initiating translation in eukaryotes, the sequence is an integral aspect of protein regulation and overall cellular health as well as having implications in human disease. It ensures that a protein is correctly translated from the genetic message, mediating ribosome assembly and translation initiation. A wrong start site can result in non-functional proteins. As it has become more studied, expansions of the nucleotide sequence, bases of importance, and notable exceptions have arisen. The sequence was named after the scientist who discovered it, Marilyn Kozak. Kozak discovered the sequence through a detailed analysis of DNA genomic sequences.

Gene structure is the organisation of specialised sequence elements within a gene. Genes contain most of the information necessary for living cells to survive and reproduce. In most organisms, genes are made of DNA, where the particular DNA sequence determines the function of the gene. A gene is transcribed (copied) from DNA into RNA, which can either be non-coding (ncRNA) with a direct function, or an intermediate messenger (mRNA) that is then translated into protein. Each of these steps is controlled by specific sequence elements, or regions, within the gene. Every gene, therefore, requires multiple sequence elements to be functional. This includes the sequence that actually encodes the functional protein or ncRNA, as well as multiple regulatory sequence regions. These regions may be as short as a few base pairs, up to many thousands of base pairs long.

Eukaryotic chromosome fine structure refers to the structure of sequences for eukaryotic chromosomes. Some fine sequences are included in more than one class, so the classification listed is not intended to be completely separate.

A ribosome binding site, or ribosomal binding site (RBS), is a sequence of nucleotides upstream of the start codon of an mRNA transcript that is responsible for the recruitment of a ribosome during the initiation of translation. Mostly, RBS refers to bacterial sequences, although internal ribosome entry sites (IRES) have been described in mRNAs of eukaryotic cells or viruses that infect eukaryotes. Ribosome recruitment in eukaryotes is generally mediated by the 5' cap present on eukaryotic mRNAs.

<span class="mw-page-title-main">Untranslated region</span> Non-coding regions on either end of mRNA

In molecular genetics, an untranslated region refers to either of two sections, one on each side of a coding sequence on a strand of mRNA. If it is found on the 5' side, it is called the 5' UTR, or if it is found on the 3' side, it is called the 3' UTR. mRNA is RNA that carries information from DNA to the ribosome, the site of protein synthesis (translation) within a cell. The mRNA is initially transcribed from the corresponding DNA sequence and then translated into protein. However, several regions of the mRNA are usually not translated into protein, including the 5' and 3' UTRs.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

mRNA surveillance mechanisms are pathways utilized by organisms to ensure fidelity and quality of messenger RNA (mRNA) molecules. There are a number of surveillance mechanisms present within cells. These mechanisms function at various steps of the mRNA biogenesis pathway to detect and degrade transcripts that have not properly been processed.

Leaky scanning is a mechanism used during the initiation phase of eukaryotic translation that enables regulation of gene expression. During initiation, the small 40S ribosomal subunit "scans" or moves in a 5' --> 3' direction along the 5'UTR to locate a start codon to commence elongation. Sometimes, the scanning ribosome bypasses the initial AUG start codon and begins translation at further downstream AUG start codons. Translation in eukaryotic cells according to most scanning mechanisms occurs at the AUG start codon proximal to the 5' end of mRNA; however, the scanning ribosome may encounter an “unfavorable nucleotide context” around the start codon and continue scanning.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

WormBase is an online biological database about the biology and genome of the nematode model organism Caenorhabditis elegans and contains information about other related nematodes. WormBase is used by the C. elegans research community both as an information resource and as a place to publish and distribute their results. The database is regularly updated with new versions being released every two months. WormBase is one of the organizations participating in the Generic Model Organism Database (GMOD) project.

The split gene theory is a theory of the origin of introns, long non-coding sequences in eukaryotic genes between the exons. The theory holds that the randomness of primordial DNA sequences would only permit small (< 600bp) open reading frames (ORFs), and that important intron structures and regulatory sequences are derived from stop codons. In this introns-first framework, the spliceosomal machinery and the nucleus evolved due to the necessity to join these ORFs into larger proteins, and that intronless bacterial genes are less ancestral than the split eukaryotic genes. The theory originated with Periannan Senapathy.

<span class="mw-page-title-main">Translation regulation by 5′ transcript leader cis-elements</span>

Translation regulation by 5′ transcript leader cis-elements is a process in cellular translation.

References

  1. 1 2 3 4 5 Pruitt KD, Harrow J, Harte RA, Wallin C, Diekhans M, Maglott DR, Searle S, Farrell CM, Loveland JE, Ruef BJ, Hart E, Suner MM, Landrum MJ, Aken B, Ayling S, Baertsch R, Fernandez-Banet J, Cherry JL, Curwen V, Dicuccio M, Kellis M, Lee J, Lin MF, Schuster M, Shkeda A, Amid C, Brown G, Dukhanina O, Frankish A, Hart J, Maidak BL, Mudge J, Murphy MR, Murphy T, Rajan J, Rajput B, Riddick LD, Snow C, Steward C, Webb D, Weber JA, Wilming L, Wu W, Birney E, Haussler D, Hubbard T, Ostell J, Durbin R, Lipman D (2009). "The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes". Genome Res. 19 (7): 1316–23. doi:10.1101/gr.080531.108. PMC   2704439 . PMID   19498102.
  2. 1 2 3 4 5 6 7 8 Harte, RA; Farrell, CM; Loveland, JE; Suner, MM; Wilming, L; Aken, B; Barrell, D; Frankish, A; Wallin, C; Searle, S; Diekhans, M; Harrow, J; Pruitt, KD (2012). "Tracking and coordinating an international curation effort for the CCDS project". Database. 2012: bas008. doi:10.1093/database/bas008. PMC   3308164 . PMID   22434842.
  3. 1 2 3 4 5 6 Farrell, CM; O'Leary, NA; Harte, RA; Loveland, JE; Wilming, LG; Wallin, C; Diehans, M; Barrell, D; Searle, SM; Aken, B; Hiatt, SM; Frankish, A; Suner, MM; Rajput, B; Steward, CA; Brown, GR; Bennet, R; Murphy, M; Wu, W; Kay, MP; Hart, J; Rajan, J; Weber, J; Snow, C; Riddick, LD; Hunt, T; Webb, D; Thomas, M; Tamez, P; Rangwala, SH; McGarvey, KM; Pujar, S; Shkeda, A; Mudge, JM; Gonzale, JM; Gilbert, JG; Trevaion, SJ; Baetsch, R; Harrow, JL; Hubbard, T; Ostell, JM; Haussler, D; Pruitt, KD (2014). "Current status and new features of the Consensus Coding Sequence database". Nucleic Acids Res. 42 (D1): D865–D872. doi:10.1093/nar/gkt1059. PMC   3965069 . PMID   24217909.
  4. 1 2 Alberts, B; Johnson, A; Lewis, J; Raff, M; Roberts, K; Walter, P (2002). Molecular Biology of the Cell 5th edn. New York: Garland Science.
  5. 1 2 3 Kozak, M (2002). "Pushing the limits of the scanning mechanism for initiation of translation". Gene. 299 (1–2): 1–34. doi:10.1016/S0378-1119(02)01056-9. PMC   7126118 . PMID   12459250.
  6. Ingolia, NT; Brar, GA; Rouskin, S; McGeachy, AM; Weissman, JS (2014). "Genome-wide Annotation and Quantitation of Translation by Ribosome Profiling". Curr. Protoc. Mol. Biol. Chapter 4: 4.18.1–4.18.19. doi:10.1002/0471142727.mb0418s103. ISBN   9780471142720. PMC   3775365 . PMID   23821443.
  7. 1 2 Calvo, SE; Pagliarni, DJ; Mootha, VK (2009). "Upstream open reading frames cause widespread reduction of protein expression and are polymorphic among humans" (PDF). Proc. Natl. Acad. Sci. U.S.A. 106 (18): 7507–12. Bibcode:2009PNAS..106.7507C. doi: 10.1073/pnas.0810916106 . PMC   2669787 . PMID   19372376.
  8. Silva, AL; Pereira, FJC; Morgado, A; Kong, J; Martins, R; Faustino, P; Liebhaber, SA; Romao, L (2006). "The canonical UPF1-dependent nonsense-mediated mRNA decay is inhibited in transcripts carrying a short open reading frame independent of sequence context". RNA. 12 (12): 2160–70. doi:10.1261/rna.201406. PMC   1664719 . PMID   17077274.
  9. Prakash, Tulika; Sharma, Vineet K.; Adati, Naoki; Ozawa, Ritsuko; Kumar, Naveen; Nishida, Yuichiro; Fujikake, Takayoshi; Takeda, Tadayuki; Taylor, Todd D.; Michalak, Pawel (12 October 2010). "Expression of Conjoined Genes: Another Mechanism for Gene Regulation in Eukaryotes". PLOS ONE. 5 (10): e13284. Bibcode:2010PLoSO...513284P. doi: 10.1371/journal.pone.0013284 . PMC   2953495 . PMID   20967262.
  10. Maglott, D.; Ostell, J.; Pruitt, K. D.; Tatusova, T. (28 November 2010). "Entrez Gene: gene-centered information at NCBI". Nucleic Acids Res. 39 (Database): D52–D57. doi:10.1093/nar/gkq1237. PMC   3013746 . PMID   21115458.
  11. Harrow, J.; Frankish, A.; Gonzalez, J. M.; Tapanari, E.; Diekhans, M.; Kokocinski, F.; Aken, B. L.; Barrell, D.; Zadissa, A.; Searle, S.; Barnes, I.; Bignell, A.; Boychenko, V.; Hunt, T.; Kay, M.; Mukherjee, G.; Rajan, J.; Despacio-Reyes, G.; Saunders, G.; Steward, C.; Harte, R.; Lin, M.; Howald, C.; Tanzer, A.; Derrien, T.; Chrast, J.; Walters, N.; Balasubramanian, S.; Pei, B.; Tress, M.; Rodriguez, J. M.; Ezkurdia, I.; van Baren, J.; Brent, M.; Haussler, D.; Kellis, M.; Valencia, A.; Reymond, A.; Gerstein, M.; Guigo, R.; Hubbard, T. J. (5 September 2012). "GENCODE: The reference human genome annotation for The ENCODE Project". Genome Res. 22 (9): 1760–1774. doi:10.1101/gr.135350.111. PMC   3431492 . PMID   22955987.
  12. Parla, Jennifer S; Iossifov, Ivan; Grabill, Ian; Spector, Mona S; Kramer, Melissa; McCombie, W Richard (2011). "A comparative analysis of exome capture". Genome Biol. 12 (9): R97. doi: 10.1186/gb-2011-12-9-r97 . PMC   3308060 . PMID   21958622.