Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome (known as the exome). [1] It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology. [2]
The goal of this approach is to identify genetic variants that alter protein sequences, and to do this at a much lower cost than whole-genome sequencing. Since these variants can be responsible for both Mendelian and common polygenic diseases, such as Alzheimer's disease, whole exome sequencing has been applied both in academic research and as a clinical diagnostic.
Exome sequencing is especially effective in the study of rare Mendelian diseases, because it is an efficient way to identify the genetic variants in all of an individual's genes. These diseases are most often caused by very rare genetic variants that are only present in a tiny number of individuals; [3] by contrast, techniques such as SNP arrays can only detect shared genetic variants that are common to many individuals in the wider population. [4] Furthermore, because severe disease-causing variants are much more likely (but by no means exclusively) to be in the protein coding sequence, [5] [6] focusing on this 1% costs far less than whole genome sequencing but still detects a high yield of relevant variants.
In the past, clinical genetic tests were chosen based on the clinical presentation of the patient (i.e. focused on one gene or a small number known to be associated with a particular syndrome), or surveyed only certain types of variation (e.g. comparative genomic hybridization) but provided definitive genetic diagnoses in fewer than half of all patients. [7] Exome sequencing is now increasingly used to complement these other tests: both to find mutations in genes already known to cause disease as well as to identify novel genes by comparing exomes from patients with similar features.[ citation needed ]
Target-enrichment methods allow one to selectively capture genomic regions of interest from a DNA sample prior to sequencing. Several target-enrichment strategies have been developed since the original description of the direct genomic selection (DGS) method in 2005. [8]
Though many techniques have been described for targeted capture, only a few of these have been extended to capture entire exomes. [9] The first target enrichment strategy to be applied to whole exome sequencing was the array-based hybrid capture method in 2007, but in-solution capture has gained popularity in recent years.
Microarrays contain single-stranded oligonucleotides with sequences from the human genome to tile the region of interest fixed to the surface. Genomic DNA is sheared to form double-stranded fragments. The fragments undergo end-repair to produce blunt ends and adaptors with universal priming sequences are added. These fragments are hybridized to oligos on the microarray. Unhybridized fragments are washed away and the desired fragments are eluted. The fragments are then amplified using PCR. [10] [11]
Roche NimbleGen was first to take the original DGS technology [8] and adapt it for next-generation sequencing. They developed the Sequence Capture Human Exome 2.1M Array to capture ~180,000 coding exons. [12] This method is both time-saving and cost-effective compared to PCR based methods. The Agilent Capture Array and the comparative genomic hybridization array are other methods that can be used for hybrid capture of target sequences. Limitations in this technique include the need for expensive hardware as well as a relatively large amount of DNA. [13]
To capture genomic regions of interest using in-solution capture, a pool of custom oligonucleotides (probes) is synthesized and hybridized in solution to a fragmented genomic DNA sample. The probes (labeled with beads) selectively hybridize to the genomic regions of interest after which the beads (now including the DNA fragments of interest) can be pulled down and washed to clear excess material. The beads are then removed and the genomic fragments can be sequenced allowing for selective DNA sequencing of genomic regions (e.g., exons) of interest.
This method was developed to improve on the hybridization capture target-enrichment method. In solution capture (as opposed to hybrid capture) there is an excess of probes to target regions of interest over the amount of template required. [13] The optimal target size is about 3.5 megabases and yields excellent sequence coverage of the target regions. The preferred method is dependent on several factors including: number of base pairs in the region of interest, demands for reads on target, equipment in house, etc. [14]
There are many Next Generation Sequencing sequencing platforms available, postdating classical Sanger sequencing methodologies. Other platforms include Roche 454 sequencer and Life Technologies SOLiD systems, the Life Technologies Ion Torrent and Illumina's Illumina Genome Analyzer II (defunct) and subsequent Illumina MiSeq, HiSeq, and NovaSeq series instruments, all of which can be used for massively parallel exome sequencing. These 'short read' NGS systems are particularly well suited to analyse many relatively short stretches of DNA sequence, as found in human exons.
There are multiple technologies available that identify genetic variants. Each technology has advantages and disadvantages in terms of technical and financial factors. Two such technologies are microarrays and whole-genome sequencing.
Microarrays use hybridization probes to test the prevalence of known DNA sequences, thus they cannot be used to identify unexpected genetic changes. [13] In contrast, the high-throughput sequencing technologies used in exome sequencing directly provide the nucleotide sequences of DNA at the thousands of exonic loci tested. [15] Hence, WES addresses some of the present limitations of hybridization genotyping arrays.
Although exome sequencing is more expensive than hybridization-based technologies on a per-sample basis, its cost has been decreasing due to the falling cost and increased throughput of whole genome sequencing.[ citation needed ]
Exome sequencing is only able to identify those variants found in the coding region of genes which affect protein function. It is not able to identify the structural and non-coding variants associated with the disease, which can be found using other methods such as whole genome sequencing. [2] There remains 99% of the human genome that is not covered using exome sequencing, and exome sequencing allows sequencing of portions of the genome over at least 20 times as many samples compared to whole genome sequencing. [2] For translation of identified rare variants into the clinic, sample size and the ability to interpret the results to provide a clinical diagnosis indicates that with the current knowledge in genetics, there are reports of exome sequencing being used for assisting diagnosis. [12] The cost of exome sequencing is typically lower than whole genome sequencing. [16]
The statistical analysis of the large quantity of data generated from sequencing approaches is a challenge. Even by only sequencing the exomes of individuals, a large quantity of data and sequence information is generated which requires a significant amount of data analysis. Challenges associated with the analysis of this data include changes in programs used to align and assemble sequence reads. [13] Various sequencing technologies also have different error rates and generate various read-lengths which can pose challenges in comparing results from different sequencing platforms.
False positive and false negative findings are associated with genomic resequencing approaches and are critical issues. A few strategies have been developed to improve the quality of exome data such as:
Rare recessive disorders may not have single nucleotide polymorphisms (SNPs) in public databases such as dbSNP. More common recessive phenotypes would be more likely to have disease-causing variants reported in dbSNP. For example, the most common cystic fibrosis variant has an allele frequency of about 3% in most populations. Screening out such variants might erroneously exclude such genes from consideration. Genes for recessive disorders are usually easier to identify than dominant disorders because the genes are less likely to have more than one rare nonsynonymous variant. [2] The system that screens common genetic variants relies on dbSNP which may not have accurate information about the variation of alleles. Using lists of common variation from a study exome or genome-wide sequenced individual would be more reliable. A challenge in this approach is that as the number of exomes sequenced increases, dbSNP will also increase in the number of uncommon variants. It will be necessary to develop thresholds to define the common variants that are unlikely to be associated with a disease phenotype. [15]
Genetic heterogeneity and population ethnicity are also major limitations as they may increase the number of false positive and false negative findings which will make the identification of candidate genes more difficult. Of course, it is possible to reduce the stringency of the thresholds in the presence of heterogeneity and ethnicity, however this will reduce the power to detect variants as well. Using a genotype-first approach to identify candidate genes might also offer a solution to overcome these limitations.
Unlike common variant analysis, the analysis of rare variants in whole-exome sequencing studies evaluates variant sets rather than single variants. [17] [18] Functional annotations predict the effect or function of rare variants and help prioritize rare functional variants. Incorporating these annotations can effectively boost the power of genetic association of rare variants analysis of whole genome sequencing studies. [19] Some methods and tools have been developed to perform functionally-informed rare variant association analysis by incorporating functional annotations to empower analysis in whole exome sequencing studies. [20] [21]
New technologies in genomics have changed the way researchers approach both basic and translational research. With approaches such as exome sequencing, it is possible to significantly enhance the data generated from individual genomes which has put forth a series of questions on how to deal with the vast amount of information. Should the individuals in these studies be allowed to have access to their sequencing information? Should this information be shared with insurance companies? This data can lead to unexpected findings and complicate clinical utility and patient benefit. This area of genomics still remains a challenge and researchers are looking into how to address these questions. [15]
By using exome sequencing, fixed-cost studies can sequence samples to much higher depth than could be achieved with whole genome sequencing. This additional depth makes exome sequencing well suited to several applications that need reliable variant calls.
Current association studies have focused on common variation across the genome, as these are the easiest to identify with our current assays. However, disease-causing variants of large effect have been found to lie within exomes in candidate gene studies, and because of negative selection, are found in much lower allele frequencies and may remain untyped in current standard genotyping assays. Whole genome sequencing is a potential method to assay novel variant across the genome. However, in complex disorders (such as autism), a large number of genes are thought to be associated with disease risk. [1] [22] This heterogeneity of underlying risk means that very large sample sizes are required for gene discovery, and thus whole genome sequencing is not particularly cost-effective. This sample size issue is alleviated by the development of novel advanced analytic methods, which effectively map disease genes despite the genetic mutations are rare at variant level. [22] In addition, variants in coding regions have been much more extensively studied and their functional implications are much easier to derive, making the practical applications of variants within the targeted exome region more immediately accessible.
Exome sequencing in rare variant gene discovery remains a very active and ongoing area of research, and there is growing evidence that a significant burden of risk is observed across sets of genes. The exome sequencing has been reported rare variants in KRT82 gene in the autoimmune disorder Alopecia Areata. [1]
In Mendelian disorders of large effect, findings thus far suggest one or a very small number of variants within coding genes underlie the entire condition. Because of the severity of these disorders, the few causal variants are presumed to be extremely rare or novel in the population, and would be missed by any standard genotyping assay. Exome sequencing provides high coverage variant calls across coding regions, which are needed to separate true variants from noise. A successful model of Mendelian gene discovery involves the discovery of de novo variants using trio sequencing, where parents and proband are genotyped.
A study published in September 2009 discussed a proof of concept experiment to determine if it was possible to identify causal genetic variants using exome sequencing. They sequenced four individuals with Freeman–Sheldon syndrome (FSS) (OMIM 193700), a rare autosomal dominant disorder known to be caused by a mutation in the gene MYH3. [2] Eight HapMap individuals were also sequenced to remove common variants in order to identify the causal gene for FSS. After exclusion of common variants, the authors were able to identify MYH3, which confirms that exome sequencing can be used to identify causal variants of rare disorders. [2] This was the first reported study that used exome sequencing as an approach to identify an unknown causal gene for a rare mendelian disorder.
Subsequently, another group reported successful clinical diagnosis of a suspected Bartter syndrome patient of Turkish origin. [12] Bartter syndrome is a renal salt-wasting disease. Exome sequencing revealed an unexpected well-conserved recessive mutation in a gene called SLC26A3 which is associated with congenital chloride diarrhea (CLD). This molecular diagnosis of CLD was confirmed by the referring clinician. This example provided proof of concept of the use of whole-exome sequencing as a clinical tool in evaluation of patients with undiagnosed genetic illnesses. This report is regarded as the first application of next generation sequencing technology for molecular diagnosis of a patient.
A second report was conducted on exome sequencing of individuals with a mendelian disorder known as Miller syndrome (MIM#263750), a rare disorder of autosomal recessive inheritance. Two siblings and two unrelated individuals with Miller syndrome were studied. They looked at variants that have the potential to be pathogenic such as non-synonymous mutations, splice acceptor and donor sites and short coding insertions or deletions. [3] Since Miller syndrome is a rare disorder, it is expected that the causal variant has not been previously identified. Previous exome sequencing studies of common single nucleotide polymorphisms (SNPs) in public SNP databases were used to further exclude candidate genes. After exclusion of these genes, the authors found mutations in DHODH that were shared among individuals with Miller syndrome. Each individual with Miller syndrome was a compound heterozygote for the DHODH mutations which were inherited as each parent of an affected individual was found to be a carrier. [3]
This was the first time exome sequencing was shown to identify a novel gene responsible for a rare mendelian disease. This exciting finding demonstrates that exome sequencing has the potential to locate causative genes in complex diseases, which previously has not been possible due to limitations in traditional methods. Targeted capture and massively parallel sequencing represents a cost-effective, reproducible and robust strategy with high sensitivity and specificity to detect variants causing protein-coding changes in individual human genomes.
Exome sequencing can be used to diagnose the genetic cause of disease in a patient. Identification of the underlying disease gene mutation(s) can have major implications for diagnostic and therapeutic approaches, can guide prediction of disease natural history, and makes it possible to test at-risk family members. [2] [3] [12] [23] [24] [25] There are many factors that make exome sequencing superior to single gene analysis including the ability to identify mutations in genes that were not tested due to an atypical clinical presentation [25] or the ability to identify clinical cases where mutations from different genes contribute to the different phenotypes in the same patient. [3]
Having diagnosed a genetic cause of a disease, this information may guide the selection of appropriate treatment. The first time this strategy was performed successfully in the clinic was in the treatment of an infant with inflammatory bowel disease. [24] [26] A number of conventional diagnostics had previously been used, but the results could not explain the infant's symptoms. Analysis of exome sequencing data identified a mutation in the XIAP gene. Knowledge of this gene's function guided the infant's treatment, leading to a bone marrow transplantation which cured the child of disease. [24]
Researchers have used exome sequencing to identify the underlying mutation for a patient with Bartter Syndrome and congenital chloride diarrhea. [12] Bilgular's group also used exome sequencing and identified the underlying mutation for a patient with severe brain malformations, stating "[These findings] highlight the use of whole exome sequencing to identify disease loci in settings in which traditional methods have proved challenging... Our results demonstrate that this technology will be particularly valuable for gene discovery in those conditions in which mapping has been confounded by locus heterogeneity and uncertainty about the boundaries of diagnostic classification, pointing to a bright future for its broad application to medicine". [23]
Researchers at University of Cape Town, South Africa used exome sequencing to discover the genetic mutation of CDH2 as the underlying cause of a genetic disorder known as arrhythmogenic right ventricle cardiomyopathy (ARVC)‚ which increases the risk of heart disease and cardiac arrest.
Multiple companies have offered exome sequencing to consumers. Knome was the first company to offer exome sequencing services to consumers[ when? ], at a cost of several thousand dollars. [27] Later, 23andMe ran a pilot WES program that was announced in September 2011 and was discontinued in 2012. Consumers could obtain exome data at a cost of $999. The company provided raw data, and did not offer analysis. [27] [28] [29]
In November 2012, DNADTC, a division of Gene by Gene started offering exomes at 80X coverage and introductory price of $695. [30] This price per DNADTC web site is currently $895. In October 2013, BGI announced a promotion for personal whole exome sequencing at 50X coverage for $499. [31] In June 2016 Genos was able to achieve an even lower price of $399 with a CLIA-certified 75X consumer exome sequenced from saliva. [32] [33] [34]
A 2018 review of 36 studies found the cost for exome sequencing to range from $555 USD to $5,169 USD, with a diagnostic yield ranging from 3% to 79% depending on patient groups. [16]
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.
Genotyping is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. It reveals the alleles an individual has inherited from their parents. Traditionally genotyping is the use of DNA sequences to define biological populations by use of molecular tools. It does not usually involve defining the genes of an individual.
Personal genomics or consumer genetics is the branch of genomics concerned with the sequencing, analysis and interpretation of the genome of an individual. The genotyping stage employs different techniques, including single-nucleotide polymorphism (SNP) analysis chips, or partial or full genome sequencing. Once the genotypes are known, the individual's variations can be compared with the published literature to determine likelihood of trait expression, ancestry inference and disease risk.
The 1000 Genomes Project (1KGP), taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least one thousand anonymous healthy participants from a number of different ethnic groups within the following three years, using advancements in newly developed technologies. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.
Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.
The exome is composed of all of the exons within the genome, the sequences which, when transcribed, remain within the mature RNA after introns are removed by RNA splicing. This includes untranslated regions of messenger RNA (mRNA), and coding regions. Exome sequencing has proven to be an efficient method of determining the genetic basis of more than two dozen Mendelian or single gene disorders.
Complete Genomics is a life sciences company that has developed and commercialized a DNA sequencing platform for human genome sequencing and analysis. This solution combines the company's proprietary human genome sequencing technology with its informatics and data management software to provide finished variant reports and assemblies at Complete Genomics’ commercial genome center in Mountain View, California.
Cancer genome sequencing is the whole genome sequencing of a single, homogeneous or heterogeneous group of cancer cells. It is a biochemical laboratory method for the characterization and identification of the DNA or RNA sequences of cancer cell(s).
Disease gene identification is a process by which scientists identify the mutant genotypes responsible for an inherited genetic disorder. Mutations in these genes can include single nucleotide substitutions, single nucleotide additions/deletions, deletion of the entire gene, and other genetic abnormalities.
Project MinE is an independent large scale whole genome research project that was initiated by 2 patients with amyotrophic lateral sclerosis and started on World ALS Day, June 21, 2013.
A rare variant is a genetic variant which occurs at low frequency in a population. Rare variants play a significant role in both complex and Mendelian disease and are responsible for a portion of the missing heritability of complex diseases. The theoretical case for a significant role of rare variants is that alleles that strongly predispose an individual to disease will be kept at low frequencies in populations by purifying selection. Rare variants are increasingly being studied, as a consequence of whole exome and whole genome sequencing efforts. While these variants are individually infrequent in populations, there are many in human populations, and they can be unique to specific populations. They are more likely to be deleterious than common variants, as a result of rapid population growth and weak purifying selection. They have been suspected of acting independently or along with common variants to cause disease states.
Single nucleotide polymorphism annotation is the process of predicting the effect or function of an individual SNP using SNP annotation tools. In SNP annotation the biological information is extracted, collected and displayed in a clear form amenable to query. SNP functional annotation is typically performed based on the available information on nucleic acid and protein sequences.
Circulating tumor DNA (ctDNA) is tumor-derived fragmented DNA in the bloodstream that is not associated with cells. ctDNA should not be confused with cell-free DNA (cfDNA), a broader term which describes DNA that is freely circulating in the bloodstream, but is not necessarily of tumor origin. Because ctDNA may reflect the entire tumor genome, it has gained traction for its potential clinical utility; "liquid biopsies" in the form of blood draws may be taken at various time points to monitor tumor progression throughout the treatment regimen.
A variant of uncertainsignificance (VUS) is a genetic variant that has been identified through genetic testing but whose significance to the function or health of an organism is not known. Two related terms are "gene of uncertain significance" (GUS), which refers to a gene that has been identified through genome sequencing but whose connection to a human disease has not been established, and "insignificant mutation", referring to a gene variant that has no impact on the health or function of an organism. The term "variant' is favored in clinical practice over "mutation" because it can be used to describe an allele more precisely. When the variant has no impact on health, it is called a "benign variant". When it is associated with a disease, it is called a "pathogenic variant". A "pharmacogenomic variant" has an effect only when an individual takes a particular drug and therefore is neither benign nor pathogenic.
Elective genetic and genomic testing are DNA tests performed for an individual who does not have an indication for testing. An elective genetic test analyzes selected sites in the human genome while an elective genomic test analyzes the entire human genome. Some elective genetic and genomic tests require a physician to order the test to ensure that individuals understand the risks and benefits of testing as well as the results. Other DNA-based tests, such as a genealogical DNA test do not require a physician's order. Elective testing is generally not paid for by health insurance companies. With the advent of personalized medicine, also called precision medicine, an increasing number of individuals are undertaking elective genetic and genomic testing.
ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.
Personalized onco-genomics (POG) is the field of oncology and genomics that is focused on using whole genome analysis to make personalized clinical treatment decisions. The program was devised at British Columbia's BC Cancer Agency and is currently being led by Marco Marra and Janessa Laskin. Genome instability has been identified as one of the underlying hallmarks of cancer. The genetic diversity of cancer cells promotes multiple other cancer hallmark functions that help them survive in their microenvironment and eventually metastasise. The pronounced genomic heterogeneity of tumours has led researchers to develop an approach that assesses each individual's cancer to identify targeted therapies that can halt cancer growth. Identification of these "drivers" and corresponding medications used to possibly halt these pathways are important in cancer treatment.
Personalized genomics is the human genetics-derived study of analyzing and interpreting individualized genetic information by genome sequencing to identify genetic variations compared to the library of known sequences. International genetics communities have spared no effort from the past and have gradually cooperated to prosecute research projects to determine DNA sequences of the human genome using DNA sequencing techniques. The methods that are the most commonly used are whole exome sequencing and whole genome sequencing. Both approaches are used to identify genetic variations. Genome sequencing became more cost-effective over time, and made it applicable in the medical field, allowing scientists to understand which genes are attributed to specific diseases.