SNV calling from NGS data is any of a range of methods for identifying the existence of single nucleotide variants (SNVs) from the results of next generation sequencing (NGS) experiments. These are computational techniques, and are in contrast to special experimental methods based on known population-wide single nucleotide polymorphisms (see SNP genotyping). Due to the increasing abundance of NGS data, these techniques are becoming increasingly popular for performing SNP genotyping, with a wide variety of algorithms designed for specific experimental designs and applications. [1] In addition to the usual application domain of SNP genotyping, these techniques have been successfully adapted to identify rare SNPs within a population, [2] as well as detecting somatic SNVs within an individual using multiple tissue samples. [3]
Most NGS based methods for SNV detection are designed to detect germline variations in the individual's genome. These are the mutations that an individual biologically inherits from their parents, and are the usual type of variants searched for when performing such analysis (except for certain specific applications where somatic mutations are sought). Very often, the searched for variants occur with some (possibly rare) frequency, throughout the population, in which case they may be referred to as single nucleotide polymorphisms (SNPs). Technically the term SNP only refers to these kinds of variations, however in practice they are often used synonymously with SNV in the literature on variant calling. In addition, since the detection of germline SNVs requires determining the individual's genotype at each locus, the phrase "SNP genotyping" may also be used to refer to this process. However this phrase may also refer to wet-lab experimental procedures for classifying genotypes at a set of known SNP locations.
The usual process of such techniques are based around: [1]
The usual output of these procedures is a VCF file.
In an ideal error free world with high read coverage, the task of variant calling from the results of a NGS data alignment would be simple; at each locus (position on the genome) the number of occurrences of each distinct nucleotide among the reads aligned at that position can be counted, and the true genotype would be obvious; either AA if all nucleotides match allele A, BB if they match allele B, or AB if there is a mixture. However, when working with real NGS data this sort of naive approach is not used, as it cannot account for the noise in the input data. [4] The nucleotide counts used for base calling contain errors and bias, both due do the sequenced reads themselves, and the alignment process. This issue can be mitigated to some extent by sequencing to a greater depth of read coverage, however this is often expensive, and many practical studies require making inferences on low coverage data. [1]
Probabilistic methods aim to overcome the above issue, by producing robust estimates of the probabilities of each of the possible genotypes, taking into account noise, as well as other available prior information that can be used to improve estimates. A genotype can then be predicted based on these probabilities, often according to the MAP estimate.
Probabilistic methods for variant calling are based on Bayes' theorem. In the context of variant calling, Bayes' Theorem defines the probability of each genotype being the true genotype given the observed data, in terms of the prior probabilities of each possible genotype, and the probability distribution of the data given each possible genotype. The formula is:
In the above equation:
Given the above framework, different software solutions for detecting SNVs vary based on how they calculate the prior probabilities , the error model used to model the probabilities , and the partitioning of the overall genotypes into separate sub-genotypes, whose probabilities can be individually estimated in this framework. [5]
The calculation of prior probabilities depends on available data from the genome being studied, and the type of analysis being performed. For studies where good reference data containing frequencies of known mutations is available (for example, in studying human genome data), these known frequencies of genotypes in the population can be used to estimate priors. Given population wide allele frequencies, prior genotype probabilities can be calculated at each locus according to the Hardy–Weinberg equilibrium. [6] In the absence of such data, constant priors can be used, independent of the locus. These can be set using heuristically chosen values, possibly informed by the kind of variations being sought by the study. Alternatively, supervised machine-learning procedures have been investigated that seek to learn optimal prior values for individuals in a sample, using supplied NGS data from these individuals. [4]
The error model used in creating a probabilistic method for variant calling is the basis for calculating the term used in Bayes' theorem. If the data was assumed to be error free, then the distribution of observed nucleotide counts at each locus would follow a Binomial Distribution, with 100% of nucleotides matching the A or B allele respectively in the AA and BB cases, and a 50% chance of each nucleotide matching either A or B in the AB case. However, in presence of noise in the read data this assumption is violated, and the values need to account for the possibility that erroneous nucleotides are present in the aligned reads at each locus.
A simple error model is to introduce a small error to the data probability term in the homozygous cases, allowing a small constant probability that nucleotides which don't match the A allele are observed in the AA case, and respectively a small constant probability that nucleotides not matching the B allele are observed in the BB case. However more sophisticated procedures are available which attempt to more realistically replicate the actual error patterns observed in real data in calculating the conditional data probabilities. For instance, estimations of read quality (measured as Phred quality scores) have been incorporated in these calculations, taking into account the expected error rate in each individual read at a locus. [7] Another technique that has successfully been incorporated into error models is base quality recalibration, where separate error rates are calculated – based on prior known information about error patterns – for each possible nucleotide substitution. Research shows that each possible nucleotide substitution is not equally likely to show up as an error in sequencing data, and so base quality recalibration has been applied to improve error probability estimates. [6]
In the above discussion, it has been assumed that the genotype probabilities at each locus are calculated independently; that is, the entire genotype is partitioned into independent genotypes at each locus, whose probabilities are calculated independently. However, due to linkage disequilibrium the genotypes of nearby loci are in general not independent. As a result, partitioning the overall genotype instead into a sequence of overlapping haplotypes allows these correlations to be modelled, resulting in more precise probability estimates through the incorporation of population-wide haplotype frequencies in the prior. The use of haplotypes to improve variant detection accuracy has been applied successfully, for instance in the 1000 Genomes Project. [8]
As an alternative to probabilistic methods, heuristic methods exist for performing variant calling on NGS data. Instead of modelling the distribution of the observed data and using Bayesian statistics to calculate genotype probabilities, variant calls are made based on a variety of heuristic factors, such as minimum allele counts, read quality cut-offs, bounds on read depth, etc. Although they have been relatively unpopular in practice in comparison to probabilistic methods, in practice due to their use of bounds and cut-offs they can be robust to outlying data that violate the assumptions of probabilistic models. [9]
An important part of the design of variant calling methods using NGS data is the DNA sequence used as a reference to which the NGS reads are aligned. In human genetics studies, high quality references are available, from sources such as the HapMap project, [10] which can substantially improve the accuracy of the variant calls made by variant calling algorithms. As a bonus, such references can be a source of prior genotype probabilities for Bayesian-based analysis. However, in the absence of such a high quality reference, experimentally obtained reads can first be assembled in order to create a reference sequence for alignment. [1]
Various methods exist for filtering data in variant calling experiments, in order to remove sources of error/bias. This can involve the removal of suspicious reads before performing alignment and/or filtering of the list of variants returned by the variant calling algorithm.
Depending on the sequencing platform used, various biases may exist within the set of sequenced reads. For instance, strand bias can occur, where there is a highly unequal distribution of forward vs reverse directions in the reads aligned in some neighborhood. Additionally, there may occur an unusually high duplication of some reads (for instance due to bias in PCR). Such biases can result in dubious variant calls – for instance if a fragment containing a PCR error at some locus is over amplified due to PCR bias, that locus will have a high count of the false allele, and may be called as a SNV – and so analysis pipelines frequently filter calls based on these biases. [1]
In addition to methods that align reads from individual sample(s) to a reference genome in order to detect germline genetic variants, reads from multiple tissue samples within a single individual can be aligned and compared in order to detect somatic variants. These variants correspond to mutations that have occurred de novo within groups of somatic cells within an individual (that is, they are not present within the individual's germline cells). This form of analysis has been frequently applied to the study of cancer, where many studies are designed around investigating the profile of somatic mutations within cancerous tissues. Such investigations have resulted in diagnostic tools that have seen clinical application, and are used to improve scientific understanding of the disease, for instance by the discovery of new cancer-related genes, identification of involved gene regulatory networks and metabolic pathways, and by informing models of how tumors grow and evolve. [11]
Until recently, software tools for carrying out this form of analysis have been heavily underdeveloped, and were based on the same algorithms used to detect germline variations. Such procedures are not optimized for this task, because they do not adequately model the statistical correlation between the genotypes present in multiple tissue samples from the same individual. [3]
More recent investigations have resulted in the development of software tools especially optimized for the detection of somatic mutations from multiple tissue samples. Probabilistic techniques have been developed that pool allele counts from all tissue samples at each locus, and using statistical models for the likelihoods of joint-genotypes for all the tissues, and the distribution of allele counts given the genotype, are able to calculate relatively robust probabilities of somatic mutations at each locus using all available data. [3] [12] In addition there has recently been some investigation in machine learning based techniques for performing this analysis. [13] [14] [15] [16]
In 2021, the Sequencing Quality Control Phase 2 Consortium [17] has published a number of studies that investigated the effects of sample preparations, sequencing library kits, sequencing platforms, and bioinformatics workflows on the accuracy of somatic SNV detection [18] based on a pair of tumor-normal cell lines that the Consortium has established as the reference samples, data, and call sets. [19]
In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.
A haplotype is a group of alleles in an organism that are inherited together from a single parent.
A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.
Genotyping is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. It reveals the alleles an individual has inherited from their parents. Traditionally genotyping is the use of DNA sequences to define biological populations by use of molecular tools. It does not usually involve defining the genes of an individual.
A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.
SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common types of genetic variation. An SNP is a single base pair mutation at a specific locus, usually consisting of two alleles. SNPs are found to be involved in the etiology of many human diseases and are becoming of particular interest in pharmacogenetics. Because SNPs are conserved during evolution, they have been proposed as markers for use in quantitative trait loci (QTL) analysis and in association studies in place of microsatellites. The use of SNPs is being extended in the HapMap project, which aims to provide the minimal set of SNPs needed to genotype the human genome. SNPs can also provide a genetic fingerprint for use in identity testing. The increase of interest in SNPs has been reflected by the furious development of a diverse range of SNP genotyping methods.
In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.
The 1000 Genomes Project, launched in January 2008, was an international research effort to establish by far the most detailed catalogue of human genetic variation. Scientists planned to sequence the genomes of at least one thousand anonymous participants from a number of different ethnic groups within the following three years, using newly developed technologies which were faster and less expensive. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.
The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.
Molecular Inversion Probe (MIP) belongs to the class of Capture by Circularization molecular techniques for performing genomic partitioning, a process through which one captures and enriches specific regions of the genome. Probes used in this technique are single stranded DNA molecules and, similar to other genomic partitioning techniques, contain sequences that are complementary to the target in the genome; these probes hybridize to and capture the genomic target. MIP stands unique from other genomic partitioning strategies in that MIP probes share the common design of two genomic target complementary segments separated by a linker region. With this design, when the probe hybridizes to the target, it undergoes an inversion in configuration and circularizes. Specifically, the two target complementary regions at the 5’ and 3’ ends of the probe become adjacent to one another while the internal linker region forms a free hanging loop. The technology has been used extensively in the HapMap project for large-scale SNP genotyping as well as for studying gene copy alterations and characteristics of specific genomic loci to identify biomarkers for different diseases such as cancer. Key strengths of the MIP technology include its high specificity to the target and its scalability for high-throughput, multiplexed analyses where tens of thousands of genomic loci are assayed simultaneously.
Cancer genome sequencing is the whole genome sequencing of a single, homogeneous or heterogeneous group of cancer cells. It is a biochemical laboratory method for the characterization and identification of the DNA or RNA sequences of cancer cell(s).
COLD-PCR is a modified polymerase chain reaction (PCR) protocol that enriches variant alleles from a mixture of wildtype and mutation-containing DNA. The ability to preferentially amplify and identify minority alleles and low-level somatic DNA mutations in the presence of excess wildtype alleles is useful for the detection of mutations. Detection of mutations is important in the case of early cancer detection from tissue biopsies and body fluids such as blood plasma or serum, assessment of residual disease after surgery or chemotherapy, disease staging and molecular profiling for prognosis or tailoring therapy to individual patients, and monitoring of therapy outcome and cancer remission or relapse. Common PCR will amplify both the major (wildtype) and minor (mutant) alleles with the same efficiency, occluding the ability to easily detect the presence of low-level mutations. The capacity to detect a mutation in a mixture of variant/wildtype DNA is valuable because this mixture of variant DNAs can occur when provided with a heterogeneous sample – as is often the case with cancer biopsies. Currently, traditional PCR is used in tandem with a number of different downstream assays for genotyping or the detection of somatic mutations. These can include the use of amplified DNA for RFLP analysis, MALDI-TOF genotyping, or direct sequencing for detection of mutations by Sanger sequencing or pyrosequencing. Replacing traditional PCR with COLD-PCR for these downstream assays will increase the reliability in detecting mutations from mixed samples, including tumors and body fluids.
Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
In bioinformatics, a DNA read error occurs when a sequence assembler changes one DNA base for a different base. The reads from the sequence assembler can then be used to create a de Bruijn graph, which can be used in various ways to find errors.
Duplex sequencing is a library preparation and analysis method for next-generation sequencing (NGS) platforms that employs random tagging of double-stranded DNA to detect mutations with higher accuracy and lower error rates.
PyClone is a software that implements a Hierarchical Bayes statistical model to estimate cellular frequency patterns of mutations in a population of cancer cells using observed alternate allele frequencies, copy number, and loss of heterozygosity (LOH) information. PyClone outputs clusters of variants based on calculated cellular frequencies of mutations.
ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.
Human somatic variations are somatic mutations both at early stages of development and in adult cells. These variations can lead either to pathogenic phenotypes or not, even if their function in healthy conditions is not completely clear yet.
{{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link){{cite journal}}
: CS1 maint: multiple names: authors list (link)