Chimera (molecular biology)

Last updated

In molecular biology, and more importantly high-throughput DNA sequencing, a chimera is a single DNA sequence originating when multiple transcripts or DNA sequences get joined. Chimeras can be considered artifacts and be filtered out from the data during processing [1] to prevent spurious inferences of biological variation. [2] However, chimeras should not be confused with chimeric reads, who are generally used by structural variant callers to detect structural variation events [3] and are not always an indication of the presence of a chimeric transcript or gene.

Contents

In a different context, the deliberate creation of artificial chimeras can also be a useful tool in molecular biology. For example, in protein engineering, "chimeragenesis" (forming chimeras between proteins that are encoded by homologous cDNAs) [4] is one of the "two major techniques used to manipulate cDNA sequences". [4] For gene fusions that occur through natural processes, see chimeric genes and fusion genes.

Description

Transcript chimera

A chimera can occur as a single cDNA sequence originating from two transcripts. It is usually considered to be a contaminant in transcript and expressed sequence tag (which results in the moniker of EST chimera) databases. [5] It is estimated that approximately 1% of all transcripts in the National Center for Biotechnology Information's Unigene database contain a "chimeric sequence". [6]

PCR chimera

A chimera can also be an artifact of PCR amplification. It occurs when the extension of an amplicon is aborted, and the aborted product functions as a primer in the next PCR cycle. The aborted product anneals to the wrong template and continues to extend, thereby synthesizing a single sequence sourced from two different templates. [7]

PCR chimeras are an important issue to take into account during metabarcoding, where DNA sequences from environmental samples are used to determine biodiversity. A chimera is a novel sequence that will most probably not match to any known organism. Hence, it might be interpreted as a new species thereby overinflating the diversity.

PCR chimeras also occur in DNA sequencing. In this case, the most common mechanism of chimera formation is that incomplete extension during the PCR results in partial sequence strands that can act as primers in subsequent PCR cycles on similar but non identical sequences. Extension of such hybrid priming events causes the formation of chimeric sequences. [1]

Some computational methods have been devised to detect and remove chimeras, like:

Chimeric read

A read is a sequence of nucleic acids determined through high-throughput DNA or RNA sequencing, corresponding to a DNA or RNA fragment. A chimeric read or split read means that multiple subsections of that read align to different positions in a reference genome. [15] They are not always a sign of the presence of a PCR chimera and often used to detect structural variations. [3]

Examples

See also

Related Research Articles

<span class="mw-page-title-main">Complementary DNA</span> Single-stranded DNA synthesized from RNA

In genetics, complementary DNA (cDNA) is DNA synthesized from a single-stranded RNA template in a reaction catalyzed by the enzyme reverse transcriptase. cDNA is often used to express a specific protein in a cell that does not normally express that protein, or to sequence or quantify mRNA molecules using DNA based methods. cDNA that codes for a specific protein can be transferred to a recipient cell for expression, often bacterial or yeast expression systems. cDNA is also generated to analyze transcriptomic profiles in bulk tissue, single cells, or single nuclei in assays such as microarrays, qPCR, and RNA-seq.

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

In molecular biology, an amplicon is a piece of DNA or RNA that is the source and/or product of amplification or replication events. It can be formed artificially, using various methods including polymerase chain reactions (PCR) or ligase chain reactions (LCR), or naturally through gene duplication. In this context, amplification refers to the production of one or more copies of a genetic fragment or target sequence, specifically the amplicon. As it refers to the product of an amplification reaction, amplicon is used interchangeably with common laboratory terms, such as "PCR product."

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

Pseudomonas citronellolis is a Gram-negative, bacillus bacterium that is used to study the mechanisms of pyruvate carboxylase. It was first isolated from forest soil, under pine trees, in northern Virginia, United States.

<span class="mw-page-title-main">16S ribosomal RNA</span> RNA component

16S ribosomal RNA is the RNA component of the 30S subunit of a prokaryotic ribosome. It binds to the Shine-Dalgarno sequence and provides most of the SSU structure.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

Cap analysis of gene expression (CAGE) is a gene expression technique used in molecular biology to produce a snapshot of the 5′ end of the messenger RNA population in a biological sample. The small fragments from the very beginnings of mRNAs are extracted, reverse-transcribed to cDNA, PCR amplified and sequenced. CAGE was first published by Hayashizaki, Carninci and co-workers in 2003. CAGE has been extensively used within the FANTOM research projects.

Paired-end tags (PET) are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making the sequence of the DNA in between them available upon search or upon further sequencing. Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely. However, longer sequences are more practical for mapping reads uniquely. The endonucleases used to produce PETs give longer tags but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency. After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the Sanger method, which has a longer read length. Since the tag sequences are short, individual PETs are well suited for next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.

Chimeric RNA, sometimes referred to as a fusion transcript, is composed of exons from two or more different genes that have the potential to encode novel proteins. These mRNAs are different from those produced by conventional splicing as they are produced by two or more gene loci.

<span class="mw-page-title-main">Viral metagenomics</span>

Viral metagenomics uses metagenomic technologies to detect viral genomic material from diverse environmental and clinical samples. Viruses are the most abundant biological entity and are extremely diverse; however, only a small fraction of viruses have been sequenced and only an even smaller fraction have been isolated and cultured. Sequencing viruses can be challenging because viruses lack a universally conserved marker gene so gene-based approaches are limited. Metagenomics can be used to study and analyze unculturable viruses and has been an important tool in understanding viral diversity and abundance and in the discovery of novel viruses. For example, metagenomics methods have been used to describe viruses associated with cancerous tumors and in terrestrial ecosystems.

Single-cell sequencing examines the nucleic acid sequence information from individual cells with optimized next-generation sequencing technologies, providing a higher resolution of cellular differences and a better understanding of the function of an individual cell in the context of its microenvironment. For example, in cancer, sequencing the DNA of individual cells can give information about mutations carried by small populations of cells. In development, sequencing the RNAs expressed by individual cells can give insight into the existence and behavior of different cell types. In microbial systems, a population of the same species can appear genetically clonal. Still, single-cell sequencing of RNA or epigenetic modifications can reveal cell-to-cell variability that may help populations rapidly adapt to survive in changing environments.

Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.

DECIPHER is a software toolset that can be used to decipher and manage biological sequences efficiently using the programming language R. Some functions of the program are accessible online through web tools.

G&T-seq is a novel form of single cell sequencing technique allowing one to simultaneously obtain both transcriptomic and genomic data from single cells, allowing for direct comparison of gene expression data to its corresponding genomic data in the same cell...

Third-generation sequencing is a class of DNA sequencing methods currently under active development.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

Clinical metagenomic next-generation sequencing (mNGS) is the comprehensive analysis of microbial and host genetic material in clinical samples from patients by next-generation sequencing. It uses the techniques of metagenomics to identify and characterize the genome of bacteria, fungi, parasites, and viruses without the need for a prior knowledge of a specific pathogen directly from clinical specimens. The capacity to detect all the potential pathogens in a sample makes metagenomic next generation sequencing a potent tool in the diagnosis of infectious disease especially when other more directed assays, such as PCR, fail. Its limitations include clinical utility, laboratory validity, sense and sensitivity, cost and regulatory considerations.

References

  1. 1 2 "Chimeras". www.drive5.com. Retrieved 2022-10-27.
  2. Edgar, Robert C. (2016-09-12). "UCHIME2: improved chimera prediction for amplicon sequencing". BioRXiv. Cold Spring Harbor Laboratory: 074252. doi:10.1101/074252. S2CID   88955007.
  3. 1 2 Kosugi, Shunichi; Momozawa, Yukihide; Liu, Xiaoxi; Terao, Chikashi; Kubo, Michiaki; Kamatani, Yoichiro (December 2019). "Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing". Genome Biology. 20 (1): 117. doi: 10.1186/s13059-019-1720-5 . ISSN   1474-760X. PMC   6547561 . PMID   31159850.
  4. 1 2 Lajtha A, Reith ME (2007). Handbook of Neurochemistry and Molecular Neurobiology Neural Membranes and Transport. Boston, MA: Springer Science+Business Media, LLC. p. 485. ISBN   978-0-387-30347-5. p. 424
  5. Unneberg P, Claverie JM (February 2007). Hoheisel J (ed.). "Tentative mapping of transcription-induced interchromosomal interaction using chimeric EST and mRNA data". PLOS ONE. 2 (2): e254. Bibcode:2007PLoSO...2..254U. doi: 10.1371/journal.pone.0000254 . PMC   1804257 . PMID   17330142. Open Access logo PLoS transparent.svg
  6. Nelson C. "EST Assembly for the Creation of Oligonucleotide Probe Targets" (PDF). Agilent Technologies. Archived from the original (PDF) on 23 February 2012. Retrieved May 12, 2009.
  7. 1 2 Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, et al. (March 2011). "Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons". Genome Research. 21 (3): 494–504. doi:10.1101/gr.112730.110. PMC   3044863 . PMID   21212162.
  8. Maidak BL, Olsen GJ, Larsen N, Overbeek R, McCaughey MJ, Woese CR (January 1996). "The Ribosomal Database Project (RDP)". Nucleic Acids Research. 24 (1): 82–85. doi:10.1093/nar/24.1.82. PMC   145599 . PMID   8594608.
  9. "Chimera checking sequences with QIIME". Quantitative Insights Into Microbial Ecology (QIIME). Retrieved 2019-01-10.
  10. Edgar R. "UCHIME algorithm". drive5.com. Retrieved 2019-01-10.
  11. "removeBimeraDenovo function". R Documentation. www.rdocumentation.org. Retrieved 2019-01-10.
  12. Huber T, Faulkner G, Hugenholtz P (September 2004). "Bellerophon: a program to detect chimeric sequences in multiple sequence alignments". Bioinformatics. 20 (14): 2317–2319. doi: 10.1093/bioinformatics/bth226 . PMID   15073015.
  13. Mysara M, Saeys Y, Leys N, Raes J, Monsieurs P (March 2015). Wommack KE (ed.). "CATCh, an ensemble classifier for chimera detection in 16S rRNA sequencing studies". Applied and Environmental Microbiology. 81 (5): 1573–1584. Bibcode:2015ApEnM..81.1573M. doi:10.1128/AEM.02896-14. PMC   4325141 . PMID   25527546.
  14. Wright ES, Yilmaz LS, Noguera DR (February 2012). "DECIPHER, a search-based approach to chimera identification for 16S rRNA sequences". Applied and Environmental Microbiology. 78 (3): 717–725. Bibcode:2012ApEnM..78..717W. doi:10.1128/AEM.06516-11. PMC   3264099 . PMID   22101057.
  15. "SAM Format specifications" (PDF). Retrieved 2023-05-31.
  16. "Entrez Gene: CYP2C18 cytochrome P450, family 2, subfamily C, polypeptide 18". National Center for Biotechnology Information . Retrieved May 12, 2009.