Digital transcriptome subtraction

Last updated
Fig 1. Digital Transcriptome Subtraction Digital Transcriptome Sequencing.jpg
Fig 1. Digital Transcriptome Subtraction

Digital transcriptome subtraction (DTS) is a bioinformatics method to detect the presence of novel pathogen transcripts through computational removal of the host sequences. DTS is the direct in silico analogue of the wet-lab approach representational difference analysis (RDA), and is made possible by unbiased high-throughput sequencing and the availability of a high-quality, annotated reference genome of the host. The method specifically examines the etiological agent of infectious diseases and is best known for discovering Merkel cell polyomavirus, the suspect causative agent in Merkel-cell carcinoma. [1]

Contents

History

Using computational subtraction to discover novel pathogens was first proposed in 2002 by Meyerson et al. [2] using human expressed sequence tag (EST) datasets. In a proof of principle experiment, Meyerson et al. demonstrated that it was a feasible approach using Epstein–Barr virus-infected lymphocytes in post-transplant lymphoproliferative disorder (PTLD). [3]

In 2007, the term "Digital Transcriptome Subtraction" was coined by the Chang-Moore group, [4] and was used to discover Merkel cell polymavirus in Merkel-cell carcinoma. [1]

Simultaneously to the MCV discovery, this approach was used to implicate a novel arenavirus as cause of fatality in a case where three patients died of similar illnesses shortly following organ transplantations from a single donor. [5]

Method

Fig. 2. Raw transcript breakdown from sequencing 20,000 clones derived from virus-infected human tissues. Viral transcripts were present at 0.03% of the total sequence reads. DTSpiechart.png
Fig. 2. Raw transcript breakdown from sequencing 20,000 clones derived from virus-infected human tissues. Viral transcripts were present at 0.03% of the total sequence reads.

Construction of cDNA library

After treatment with DNase I to eliminate human genomic DNA, total RNA is extracted from primary infected tissue. Messenger RNA is then purified using an oligo-dT column that binds to the poly-A tail, a signal specifically found on transcribed genes. Using random hexamers priming, reverse transcriptase (RT) convert all mRNA into cDNA and cloned into bacterial vectors. Bacteria, usually E. coli , are then transformed using the cDNA vectors and selected using a marker, the collection of transformed clones is the cDNA library. This generates a snap-shot of tissue mRNA that is stable and can be sequenced at a later stage.

Sequencing and quality control

The cDNA library must be sequenced to great depth (i.e. number of clones sequenced) in order to detect a theoretical rare pathogen sequence (Table 1), especially if the foreign sequence is novel. Chang-Moore recommend a sequencing depth of 200,000 transcripts or greater using multiple sequencing platforms. [1]

Table 1. Probability of capturing >1 viral transcript(s) in human tissue-derived libraries. [2]
 % Viral5,000 clones10,000 clones20,000 clones50,000 clones
0.001%4.9%9.5%18.1%39.3%
0.01%39.3%32.2%86.5%99.3%
0.02%63.2%86.5%98.2%>99.995%
0.03%77.7%95.5%99.8%>99.995%
0.04%86.5%98.2%>99.995%>99.995%
0.1%99.3%>99.995%>99.995%>99.995%

Stringent quality control are then applied to the raw sequences to minimize false-positive results. The initial quality screen uses several general parameters to exclude ambiguous sequences, leaving behind a dataset of high-fidelity (Hi-Fi) reads.

BLAST to host genome

Using MEGABLAST, Hi-Fi reads are then matched to sequences in annotated databases and any positive matches are then subtracted from the dataset. Minimum hit length for a positive match of human sequence is typically 30 consecutive identical bases, which equates to a BLAST score of 60; generally, the remaining sequence is BLAST again with less stringent parameters to allow for slight mismatches (1 in 20 nucleotide). The vast majority of sequences (>99%) should be removed from the dataset at this stage.

Subtracted sequences typically include:

Analysis of "non-host" candidates

Alignment to pathogen databases

After stringent rounds of subtraction, the remaining sequences are clustered into non-redundant contigs and aligned to known pathogen sequences using low-stringency parameters. As pathogen genomes mutates quickly, nucleotide-nucleotide alignments, or blastn, is usually uninformative as it is possible to have mutations at certain bases without changing the amino acid residue due to codon degeneracy. Matching the in silico translated protein sequences of all 6 open reading frames to the amino acid sequence to annotated proteins, or blastx, is the preferred alignment method as it increases the likelihood of identifying a novel pathogen by matching to a related strain/species. [5] Experimental extension of candidate sequences might also be used at this stage to maximize chances of a positive match. [6]

De novo assembly

In cases where alignment to known pathogens is uninformative or ambiguous, contigs of candidate sequence can be used as templates for primer walking in primary infected tissue to generate the complete pathogen genome sequence. [1] [5] As viral transcripts are exceedingly rare ratio tissue mRNA (10 transcripts in 1 million), [1] it is unlikely to generate a transcriptome based on the original candidate sequences alone due to low coverage.

Validation of pathogen

Once a putative pathogen has been identified in the high-throughput sequencing data, it is imperative to validate the presence of pathogen in infected patients using more sensitive techniques, such as:

Applications

The primary application for DTS lies in identification of pathogenic viruses in cancer. [1] [4] It can also be used to identify viral pathogens in non-cancer related disease. [5] Future clinical applications could include the use of DTS on a routine basis in individuals. DTS could also apply to agriculture, identifying pathogens that have an effect on output. Computation subtraction was already used in a metagenomics study that associated viral infection by IAPV with colony collapse disorder in honey bees. [7]

Advantages

Disadvantages

Related Research Articles

<span class="mw-page-title-main">Alternative splicing</span> Process by which a gene can code for multiple proteins

Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be included within or excluded from the final, processed messenger RNA (mRNA) produced from that gene. This means the exons are joined in different combinations, leading to different (alternative) mRNA strands. Consequently, the proteins translated from alternatively spliced mRNAs usually contain differences in their amino acid sequence and, often, in their biological functions.

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.

<i>Polyomaviridae</i> Family of viruses

Polyomaviridae is a family of viruses whose natural hosts are primarily mammals and birds. As of 2020, there are six recognized genera and 117 species, five of which are unassigned to a genus. 14 species are known to infect humans, while others, such as Simian Virus 40, have been identified in humans to a lesser extent. Most of these viruses are very common and typically asymptomatic in most human populations studied. BK virus is associated with nephropathy in renal transplant and non-renal solid organ transplant patients, JC virus with progressive multifocal leukoencephalopathy, and Merkel cell virus with Merkel cell cancer.

The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.

<span class="mw-page-title-main">Oncovirus</span> Viruses that can cause cancer

An oncovirus or oncogenic virus is a virus that can cause cancer. This term originated from studies of acutely transforming retroviruses in the 1950–60s, when the term "oncornaviruses" was used to denote their RNA virus origin. With the letters "RNA" removed, it now refers to any virus with a DNA or RNA genome causing cancer and is synonymous with "tumor virus" or "cancer virus". The vast majority of human and animal viruses do not cause cancer, probably because of longstanding co-evolution between the virus and its host. Oncoviruses have been important not only in epidemiology, but also in investigations of cell cycle control mechanisms such as the retinoblastoma protein.

<span class="mw-page-title-main">Serial analysis of gene expression</span> Molecular biology technique

Serial Analysis of Gene Expression (SAGE) is a transcriptomic technique used by molecular biologists to produce a snapshot of the messenger RNA population in a sample of interest in the form of small tags that correspond to fragments of those transcripts. Several variants have been developed since, most notably a more robust version, LongSAGE, RL-SAGE and the most recent SuperSAGE. Many of these have improved the technique with the capture of longer tags, enabling more confident identification of a source gene.

<span class="mw-page-title-main">Patrick S. Moore</span>

Patrick S. Moore is an Irish and American virologist and epidemiologist who co-discovered together with his wife, Yuan Chang, two different human viruses causing the AIDS-related cancer Kaposi's sarcoma and the skin cancer Merkel cell carcinoma. The couple met while in medical school together and were married in 1989 while they pursued fellowships at different universities.

Merkel cell polyomavirus was first described in January 2008 in Pittsburgh, Pennsylvania. It was the first example of a human viral pathogen discovered using unbiased metagenomic next-generation sequencing with a technique called digital transcriptome subtraction. MCV is one of seven currently known human oncoviruses. It is suspected to cause the majority of cases of Merkel cell carcinoma, a rare but aggressive form of skin cancer. Approximately 80% of Merkel cell carcinoma (MCC) tumors have been found to be infected with MCV. MCV appears to be a common—if not universal—infection of older children and adults. It is found in respiratory secretions suggesting that it may be transmitted by a respiratory route. But it also can be found shedding from healthy skin, and in gastrointestinal tract tissues and elsewhere, and so its precise mode of transmission remains unknown. In addition, recent studies suggest that this virus may latently infect the human sera and peripheral blood mononuclear cells.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

Paired-end tags (PET) are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making the sequence of the DNA in between them available upon search or upon further sequencing. Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely. However, longer sequences are more practical for mapping reads uniquely. The endonucleases used to produce PETs give longer tags but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency. After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the Sanger method, which has a longer read length. Since the tag sequences are short, individual PETs are well suited for next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

Single-cell sequencing examines the nucleic acid sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. For example, in cancer, sequencing the DNA of individual cells can give information about mutations carried by small populations of cells. In development, sequencing the RNAs expressed by individual cells can give insight into the existence and behavior of different cell types. In microbial systems, a population of the same species can appear genetically clonal. Still, single-cell sequencing of RNA or epigenetic modifications can reveal cell-to-cell variability that may help populations rapidly adapt to survive in changing environments.

The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.

Metatranscriptomics is the set of techniques used to study gene expression of microbes within natural environments, i.e., the metatranscriptome.

In genetics, coverage is one of several measures of the depth or completeness of DNA sequencing, and is more specifically expressed in any of the following terms:

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

<span class="mw-page-title-main">FANTOM</span>

FANTOM is an international research consortium first established in 2000 as part of the RIKEN research institute in Japan. The original meeting gathered international scientists from diverse backgrounds to help annotate the function of mouse cDNA clones generated by the Hayashizaki group. Since the initial FANTOM1 effort, the consortium has released multiple projects that look to understand the mechanisms governing the regulation of mammalian genomes. Their work has generated a large collection of shared data and helped advance biochemical and bioinformatic methodologies in genomics research.

<span class="mw-page-title-main">Spatial transcriptomics</span> Range of methods designed for assigning cell types

Spatial transcriptomics is a method for assigning cell types to their locations in the histological sections. This method can also be used to determine subcellular localization of mRNA molecules. The term is a variation of Spatial Genomics, first described by Doyle, et al., in 2000 and then expanded upon by Ståhl et al. in a technique developed in 2016, which has since undergone a variety of improvements and modifications.

References

  1. 1 2 3 4 5 6 Feng H, Shuda M, Chang Y, Moore PS (Jan 2008). "Clonal integration of a polyomavirus in human Merkel cell carcinoma". Science. 5866. 319 (5866): 1096–1100. Bibcode:2008Sci...319.1096F. doi:10.1126/science.1152586. PMC   2740911 . PMID   18202256.
  2. 1 2 Weber G, Shendure J, Tanenbaum DM, Church GM, Meyerson M (Feb 2002). "Identification of foreign gene sequences by transcript filtering against the human genome". Nat Genet. 2. 30 (2): 141–142. doi:10.1038/ng818. PMID   11788827. S2CID   21842679.
  3. 1 2 Xu Y, Stange-Thomann N, Weber G, Bo R, Dodge S, David RG, Foley K, Beheshti J, Harris NL, Birren B, Lander ES, Meyerson M (Mar 2003). "Pathogen discovery from human tissue by sequence-based computational subtraction". Genomics. 3. 81 (3): 329–335. doi:10.1016/S0888-7543(02)00043-5. PMID   12659816.
  4. 1 2 Feng H, Taylor JL, Benos PV, Newton R, Waddell K, Lucas SB, Chang Y, Moore PS (August 2007). "Human Transcriptome Subtraction by Using Short Sequence Tags To Search for Tumor Viruses in Conjunctival Carcinoma". J Virol. 20. 81 (20): 11332–11340. doi:10.1128/JVI.00875-07. PMC   2045575 . PMID   17686852.
  5. 1 2 3 4 Palacios G, Druce J, Du L, Tran T, Birch C, Briese T, Conlan S, Quan PL, Hui J, Marshall J, Simons JF, Egholm M, Paddock CD, Shieh WJ, Goldsmith CS, Zaki SR, Catton M, Lipkin WI (Mar 2008). "A new arenavirus in a cluster of fatal transplant-associated diseases". N Engl J Med. 10. 358 (10): 991–998. CiteSeerX   10.1.1.453.2859 . doi:10.1056/NEJMoa073785. PMID   18256387.
  6. Chang Y, Moore PS. "New Pathogen Discovery: Digital Transcriptome Subtraction". Archived from the original on 25 January 2010. Retrieved 1 March 2012.
  7. Cox-Foster DL, Conlan S, Holmes EC, Palacios G, Evans JD, Moran NA, Quan PL, Briese T, Hornig M, Geiser DM, Martinson V, vanEngelsdorp D, Kalkstein AL, Drysdale A, Hui J, Zhai J, Cui L, Hutchison SK, Simons JF, Egholm M, Pettis JS, Lipkin WI (Oct 2007). "A metagenomic survey of microbes in honey bee colony collapse disorder". Science. 5848. 318 (5848): 283–287. Bibcode:2007Sci...318..283C. doi: 10.1126/science.1146498 . PMID   17823314. S2CID   14013425.
  8. 1 2 3 4 MacConaill L, Meyerson M (Apr 2008). "Adding pathogens by genomic subtraction". Nat Genet. 4. 40 (4): 380–382. doi:10.1038/ng0408-380. PMID   18368124.