Digital transcriptome subtraction (DTS) is a bioinformatics method to detect the presence of novel pathogen transcripts through computational removal of the host sequences. DTS is the direct in silico analogue of the wet-lab approach representational difference analysis (RDA), and is made possible by unbiased high-throughput sequencing and the availability of a high-quality, annotated reference genome of the host. The method specifically examines the etiological agent of infectious diseases and is best known for discovering Merkel cell polyomavirus, the suspect causative agent in Merkel-cell carcinoma. [1]
Using computational subtraction to discover novel pathogens was first proposed in 2002 by Meyerson et al. [2] using human expressed sequence tag (EST) datasets. In a proof of principle experiment, Meyerson et al. demonstrated that it was a feasible approach using Epstein–Barr virus-infected lymphocytes in post-transplant lymphoproliferative disorder (PTLD). [3]
In 2007, the term "Digital Transcriptome Subtraction" was coined by the Chang-Moore group, [4] and was used to discover Merkel cell polymavirus in Merkel-cell carcinoma. [1]
Simultaneously to the MCV discovery, this approach was used to implicate a novel arenavirus as cause of fatality in a case where three patients died of similar illnesses shortly following organ transplantations from a single donor. [5]
After treatment with DNase I to eliminate human genomic DNA, total RNA is extracted from primary infected tissue. Messenger RNA is then purified using an oligo-dT column that binds to the poly-A tail, a signal specifically found on transcribed genes. Using random hexamers priming, reverse transcriptase (RT) convert all mRNA into cDNA and cloned into bacterial vectors. Bacteria, usually E. coli , are then transformed using the cDNA vectors and selected using a marker, the collection of transformed clones is the cDNA library. This generates a snap-shot of tissue mRNA that is stable and can be sequenced at a later stage.
The cDNA library must be sequenced to great depth (i.e. number of clones sequenced) in order to detect a theoretical rare pathogen sequence (Table 1), especially if the foreign sequence is novel. Chang-Moore recommend a sequencing depth of 200,000 transcripts or greater using multiple sequencing platforms. [1]
% Viral | 5,000 clones | 10,000 clones | 20,000 clones | 50,000 clones |
---|---|---|---|---|
0.001% | 4.9% | 9.5% | 18.1% | 39.3% |
0.01% | 39.3% | 32.2% | 86.5% | 99.3% |
0.02% | 63.2% | 86.5% | 98.2% | >99.995% |
0.03% | 77.7% | 95.5% | 99.8% | >99.995% |
0.04% | 86.5% | 98.2% | >99.995% | >99.995% |
0.1% | 99.3% | >99.995% | >99.995% | >99.995% |
Stringent quality control are then applied to the raw sequences to minimize false-positive results. The initial quality screen uses several general parameters to exclude ambiguous sequences, leaving behind a dataset of high-fidelity (Hi-Fi) reads.
Using MEGABLAST, Hi-Fi reads are then matched to sequences in annotated databases and any positive matches are then subtracted from the dataset. Minimum hit length for a positive match of human sequence is typically 30 consecutive identical bases, which equates to a BLAST score of 60; generally, the remaining sequence is BLAST again with less stringent parameters to allow for slight mismatches (1 in 20 nucleotide). The vast majority of sequences (>99%) should be removed from the dataset at this stage.
Subtracted sequences typically include:
After stringent rounds of subtraction, the remaining sequences are clustered into non-redundant contigs and aligned to known pathogen sequences using low-stringency parameters. As pathogen genomes mutates quickly, nucleotide-nucleotide alignments, or blastn [ broken anchor ], is usually uninformative as it is possible to have mutations at certain bases without changing the amino acid residue due to codon degeneracy. Matching the in silico translated protein sequences of all 6 open reading frames to the amino acid sequence to annotated proteins, or blastx [ broken anchor ], is the preferred alignment method as it increases the likelihood of identifying a novel pathogen by matching to a related strain/species. [5] Experimental extension of candidate sequences might also be used at this stage to maximize chances of a positive match. [6]
In cases where alignment to known pathogens is uninformative or ambiguous, contigs of candidate sequence can be used as templates for primer walking in primary infected tissue to generate the complete pathogen genome sequence. [1] [5] As viral transcripts are exceedingly rare ratio tissue mRNA (10 transcripts in 1 million), [1] it is unlikely to generate a transcriptome based on the original candidate sequences alone due to low coverage.
Once a putative pathogen has been identified in the high-throughput sequencing data, it is imperative to validate the presence of pathogen in infected patients using more sensitive techniques, such as:
The primary application for DTS lies in identification of pathogenic viruses in cancer. [1] [4] It can also be used to identify viral pathogens in non-cancer related disease. [5] Future clinical applications could include the use of DTS on a routine basis in individuals. DTS could also apply to agriculture, identifying pathogens that have an effect on output. Computation subtraction was already used in a metagenomics study that associated viral infection by IAPV with colony collapse disorder in honey bees. [7]
Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to produce different splice variants. For example, some exons of a gene may be included within or excluded from the final RNA product of the gene. This means the exons are joined in different combinations, leading to different splice variants. In the case of protein-coding genes, the proteins translated from these splice variants may contain differences in their amino acid sequence and in their biological functions.
In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.
Polyomaviridae is a family of viruses whose natural hosts are primarily mammals and birds. As of 2024, there are eight recognized genera. 14 species are known to infect humans, while others, such as Simian Virus 40, have been identified in humans to a lesser extent. Most of these viruses are very common and typically asymptomatic in most human populations studied. BK virus is associated with nephropathy in renal transplant and non-renal solid organ transplant patients, JC virus with progressive multifocal leukoencephalopathy, and Merkel cell virus with Merkel cell cancer.
The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.
An oncovirus or oncogenic virus is a virus that can cause cancer. This term originated from studies of acutely transforming retroviruses in the 1950–60s, when the term "oncornaviruses" was used to denote their RNA virus origin. With the letters "RNA" removed, it now refers to any virus with a DNA or RNA genome causing cancer and is synonymous with "tumor virus" or "cancer virus". The vast majority of human and animal viruses do not cause cancer, probably because of longstanding co-evolution between the virus and its host. Oncoviruses have been important not only in epidemiology, but also in investigations of cell cycle control mechanisms such as the retinoblastoma protein.
Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts. Several variants have been developed since, most notably a more robust version, LongSAGE, RL-SAGE and the most recent SuperSAGE. Many of these have improved the technique with the capture of longer tags, enabling more confident identification of a source gene.
Merkel cell polyomavirus was first described in January 2008 in Pittsburgh, Pennsylvania. It was the first example of a human viral pathogen discovered using unbiased metagenomic next-generation sequencing with a technique called digital transcriptome subtraction. MCV is one of seven currently known human oncoviruses. It is suspected to cause the majority of cases of Merkel cell carcinoma, a rare but aggressive form of skin cancer. Approximately 80% of Merkel cell carcinoma (MCC) tumors have been found to be infected with MCV. MCV appears to be a common—if not universal—infection of older children and adults. It is found in respiratory secretions, suggesting that it might be transmitted via a respiratory route. However, it has also been found elsewhere, such as in shedded healthy skin and gastrointestinal tract tissues, thus its precise mode of transmission remains unknown. In addition, recent studies suggest that this virus may latently infect the human sera and peripheral blood mononuclear cells.
RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.
SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.
Paired-end tags (PET) are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making the sequence of the DNA in between them available upon search or upon further sequencing. Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely. However, longer sequences are more practical for mapping reads uniquely. The endonucleases used to produce PETs give longer tags but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency. After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the Sanger method, which has a longer read length. Since the tag sequences are short, individual PETs are well suited for next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.
De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.
Single-cell sequencing examines the nucleic acid sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. For example, in cancer, sequencing the DNA of individual cells can give information about mutations carried by small populations of cells. In development, sequencing the RNAs expressed by individual cells can give insight into the existence and behavior of different cell types. In microbial systems, a population of the same species can appear genetically clonal. Still, single-cell sequencing of RNA or epigenetic modifications can reveal cell-to-cell variability that may help populations rapidly adapt to survive in changing environments.
The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.
Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.
In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:
Third-generation sequencing is a class of DNA sequencing methods which produce longer sequence reads, under active development since 2008.
Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.
FANTOM is an international research consortium first established in 2000 as part of the RIKEN research institute in Japan. The original meeting gathered international scientists from diverse backgrounds to help annotate the function of mouse cDNA clones generated by the Hayashizaki group. Since the initial FANTOM1 effort, the consortium has released multiple projects that look to understand the mechanisms governing the regulation of mammalian genomes. Their work has generated a large collection of shared data and helped advance biochemical and bioinformatic methodologies in genomics research.
Spatial transcriptomics is a method for assigning cell types to their locations in the histological sections. Recent work demonstrated that the subcellular localization of mRNA molecules, for example, in the nucleus can also be studied.