In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. [1] ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. [2] The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases (e.g. GenBank 1 January 2013, all species). EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.
An EST results from one-shot sequencing of a cloned cDNA. The cDNAs used for EST generation are typically individual clones from a cDNA library. The resulting sequence is a relatively low-quality fragment whose length is limited by current technology to approximately 500 to 800 nucleotides. Because these clones consist of DNA that is complementary to mRNA, the ESTs represent portions of expressed genes. They may be represented in databases as either cDNA/mRNA sequence or as the reverse complement of the mRNA, the template strand.
One can map ESTs to specific chromosome locations using physical mapping techniques, such as radiation hybrid mapping, HAPPY mapping, or FISH. Alternatively, if the genome of the organism that originated the EST has been sequenced, one can align the EST sequence to that genome using a computer.
The current understanding of the human set of genes (as of 2006 [update] ) includes the existence of thousands of genes based solely on EST evidence. In this respect, ESTs have become a tool to refine the predicted transcripts for those genes, which leads to the prediction of their protein products and ultimately of their function. Moreover, the situation in which those ESTs are obtained (tissue, organ, disease state - e.g. cancer) gives information on the conditions in which the corresponding gene is acting. ESTs contain enough information to permit the design of precise probes for DNA microarrays that then can be used to determine gene expression profiles.
Some authors use the term "EST" to describe genes for which little or no further information exists besides the tag. [3]
In 1979, teams at Harvard and Caltech extended the basic idea of making DNA copies of mRNAs in vitro to amplifying a library of such in bacterial plasmids. [4]
In 1982, the idea of selecting random or semi-random clones from such a cDNA library for sequencing was explored by Greg Sutcliffe and coworkers. [5]
In 1983, Putney et al. sequenced 178 clones from a rabbit muscle cDNA library. [6]
In 1991, Adams and co-workers coined the term EST and initiated more systematic sequencing as a project (starting with 600 brain cDNAs). [2]
The dbEST is a division of Genbank established in 1992. As for GenBank, data in dbEST is directly submitted by laboratories worldwide and is not curated.
Because of the way ESTs are sequenced, many distinct expressed sequence tags are often partial sequences that correspond to the same mRNA of an organism. In an effort to reduce the number of expressed sequence tags for downstream gene discovery analyses, several groups assembled expressed sequence tags into EST contigs. Example of resources that provide EST contigs include: TIGR gene indices, [7] Unigene, [8] and STACK [9]
Constructing EST contigs is not trivial and may yield artifacts (contigs that contain two distinct gene products). When the complete genome sequence of an organism is available and transcripts are annotated, it is possible to bypass contig assembly and directly match transcripts with ESTs. This approach is used in the TissueInfo system (see below) and makes it easy to link annotations in the genomic database to tissue information provided by EST data.
High-throughput analyses of ESTs often encounter similar data management challenges. A first challenge is that tissue provenance of EST libraries is described in plain English in dbEST. [10] This makes it difficult to write programs that can unambiguously determine that two EST libraries were sequenced from the same tissue. Similarly, disease conditions for the tissue are not annotated in a computationally friendly manner. For instance, cancer origin of a library is often mixed with the tissue name (e.g., the tissue name "glioblastoma" indicates that the EST library was sequenced from brain tissue and the disease condition is cancer). [11] With the notable exception of cancer, the disease condition is often not recorded in dbEST entries. The TissueInfo project was started in 2000 to help with these challenges. The project provides curated data (updated daily) to disambiguate tissue origin and disease state (cancer/non cancer), offers a tissue ontology that links tissues and organs by "is part of" relationships (i.e., formalizes knowledge that hypothalamus is part of brain, and that brain is part of the central nervous system) and distributes open-source software for linking transcript annotations from sequenced genomes to tissue expression profiles calculated with data in dbEST. [12]
In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, proteins or non-coding RNA, and ultimately affect a phenotype. These products are often proteins, but in non-protein-coding genes such as transfer RNA (tRNA) and small nuclear RNA (snRNA), the product is a functional non-coding RNA. The process of gene expression is used by all known life—eukaryotes, prokaryotes, and utilized by viruses—to generate the macromolecular machinery for life.
A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
In molecular biology, a library is a collection of genetic material fragments that are stored and propagated in a population of microbes through the process of molecular cloning. There are different types of DNA libraries, including cDNA libraries, genomic libraries and randomized mutant libraries. DNA library technology is a mainstay of current molecular biology, genetic engineering, and protein engineering, and the applications of these libraries depend on the source of the original DNA fragments. There are differences in the cloning vectors and techniques used in library preparation, but in general each DNA fragment is uniquely inserted into a cloning vector and the pool of recombinant DNA molecules is then transferred into a population of bacteria or yeast such that each organism contains on average one construct. As the population of organisms is grown in culture, the DNA molecules contained within them are copied and propagated.
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.
Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts. Several variants have been developed since, most notably a more robust version, LongSAGE, RL-SAGE and the most recent SuperSAGE. Many of these have improved the technique with the capture of longer tags, enabling more confident identification of a source gene.
In the fields of bioinformatics and computational biology, Genome survey sequences (GSS) are nucleotide sequences similar to expressed sequence tags (ESTs) that the only difference is that most of them are genomic in origin, rather than mRNA.
RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.
Paired-end tags (PET) are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making the sequence of the DNA in between them available upon search or upon further sequencing. Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely. However, longer sequences are more practical for mapping reads uniquely. The endonucleases used to produce PETs give longer tags but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency. After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the Sanger method, which has a longer read length. Since the tag sequences are short, individual PETs are well suited for next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.
The human gene Chromosome 3 open reading frame 14 is a gene of uncertain function located at 3p14.2 near fragile site FRBA3—which falls between this gene and the centromere. Its protein is expected to localize to the nucleus and bind DNA. Orthologs have been identified in all of the major animal groups, minus amphibians and insects, tracing as far back as the sea anemone; indicating an origin of over 1000 mya, highlighting its importance in the animal genome.
De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.
Digital transcriptome subtraction (DTS) is a bioinformatics method to detect the presence of novel pathogen transcripts through computational removal of the host sequences. DTS is the direct in silico analogue of the wet-lab approach representational difference analysis (RDA), and is made possible by unbiased high-throughput sequencing and the availability of a high-quality, annotated reference genome of the host. The method specifically examines the etiological agent of infectious diseases and is best known for discovering Merkel cell polyomavirus, the suspect causative agent in Merkel-cell carcinoma.
Chimeric RNA, sometimes referred to as a fusion transcript, is composed of exons from two or more different genes that have the potential to encode novel proteins. These mRNAs are different from those produced by conventional splicing as they are produced by two or more gene loci.
The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.
In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.
Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.
FANTOM is an international research consortium first established in 2000 as part of the RIKEN research institute in Japan. The original meeting gathered international scientists from diverse backgrounds to help annotate the function of mouse cDNA clones generated by the Hayashizaki group. Since the initial FANTOM1 effort, the consortium has released multiple projects that look to understand the mechanisms governing the regulation of mammalian genomes. Their work has generated a large collection of shared data and helped advance biochemical and bioinformatic methodologies in genomics research.
This publication is provided for historical reference only and the information may be out of date
{{cite journal}}
: CS1 maint: numeric names: authors list (link){{cite web}}
: CS1 maint: unfit URL (link)Curated EST tissue provenance, tissue ontology, open-source software