Genotyping by sequencing

Last updated

In the field of genetic sequencing, genotyping by sequencing, also called GBS, is a method to discover single nucleotide polymorphisms (SNP) in order to perform genotyping studies, such as genome-wide association studies (GWAS). [1] GBS uses restriction enzymes to reduce genome complexity and genotype multiple DNA samples. [2] After digestion, PCR is performed to increase fragments pool and then GBS libraries are sequenced using next generation sequencing technologies, usually resulting in about 100bp single-end reads. [3] It is relatively inexpensive and has been used in plant breeding. [2] Although GBS presents an approach similar to restriction-site-associated DNA sequencing (RAD-seq) method, they differ in some substantial ways. [4] [5] [6]

Contents

Methods

GBS is a robust, simple, and affordable procedure for SNP discovery and mapping. Overall, this approach reduces genome complexity with restriction enzymes (REs) in high-diversity, large genomes species for efficient high-throughput, highly multiplexed sequencing. By using appropriate REs, repetitive regions of genomes can be avoided and lower copy regions can be targeted, which reduces alignments problems in genetically highly diverse species. The method was first described by Elshire et al. (2011). [1] In summary, high molecular weight DNAs are extracted and digested using a specific RE previously defined by cutting frequently [7] in the major repetitive fraction of the genome. ApeKI is the most used RE. Barcode adapters are then ligated to sticky ends and PCR amplification is performed. Next-generation sequencing technology is performed resulting in about 100 bp single-end reads. Raw sequence data are filtered and aligned to a reference genome using usually Burrows–Wheeler alignment tool (BWA) or Bowtie 2. The next step is to identify SNPs from aligned tags and score all discovered SNPs for various coverage, depth and genotypic statistics. Once a large-scale, species-wide SNP production has been run, it is possible to quickly call known SNPs in newly sequenced samples. [8]

When initially developed, the GBS approach was tested and validated in recombinant inbred lines (RILs) from a high-resolution maize mapping population (IBM) and doubled haploid (DH) barley lines from the Oregon Wolfe Barley (OWB) mapping population. Up to 96 RE (ApeKI)-digested DNA samples were pooled and processed simultaneously during the GBS library construction, which was checked on a Genome Analyzer II (Illumina, Inc.). Overall, 25,185 biallelic tags were mapped in maize, while 24,186 sequence tags were mapped in barley. Barley GBS marker validation using a single DH line (OWB003) showed 99% agreement between the reference markers and the mapped GBS reads. Although barley lacks a complete genome sequence, GBS does not require a reference genome for sequence tag mapping, the reference is developed during the process of sampling genotyping. Tags can also be treated as dominant markers for alternative genetic analysis in the absence of a reference genome. Other than the multiplex GBS skimming, imputation of missing SNPs has the potential to further reduce GBS costs. GBS is a versatile and cost-effective procedure that will allow mining genomes of any species without prior knowledge of its genome structure. [1]

See also

Related Research Articles

In molecular biology, restriction fragment length polymorphism (RFLP) is a technique that exploits variations in homologous DNA sequences, known as polymorphisms, populations, or species or to pinpoint the locations of genes within a sequence. The term may refer to a polymorphism itself, as detected through the differing locations of restriction enzyme sites, or to a related laboratory technique by which such differences can be illustrated. In RFLP analysis, a DNA sample is digested into fragments by one or more restriction enzymes, and the resulting restriction fragments are then separated by gel electrophoresis according to their size.

<span class="mw-page-title-main">DNA sequencer</span> A scientific instrument used to automate the DNA sequencing process

A DNA sequencer is a scientific instrument used to automate the DNA sequencing process. Given a sample of DNA, a DNA sequencer is used to determine the order of the four bases: G (guanine), C (cytosine), A (adenine) and T (thymine). This is then reported as a text string, called a read. Some DNA sequencers can be also considered optical instruments as they analyze light signals originating from fluorochromes attached to nucleotides.

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.

<span class="mw-page-title-main">Gene mapping</span> Process of locating specific genes

Gene mapping or genome mapping describes the methods used to identify the location of a gene on a chromosome and the distances between genes. Gene mapping can also describe the distances between different sites within a gene.

<span class="mw-page-title-main">Ancestry-informative marker</span>

In population genetics, an ancestry-informative marker (AIM) is a single-nucleotide polymorphism that exhibits substantially different frequencies between different populations. A set of many AIMs can be used to estimate the proportion of ancestry of an individual derived from each population.

Genotyping is the process of determining differences in the genetic make-up (genotype) of an individual by examining the individual's DNA sequence using biological assays and comparing it to another individual's sequence or a reference sequence. It reveals the alleles an individual has inherited from their parents. Traditionally genotyping is the use of DNA sequences to define biological populations by use of molecular tools. It does not usually involve defining the genes of an individual.

A molecular marker is a molecule, sampled from some source, that gives information about its source. For example, DNA is a molecular marker that gives information about the organism from which it was taken. For another example, some proteins can be molecular markers of Alzheimer's disease in a person from which they are taken. Molecular markers may be non-biological. Non-biological markers are often used in environmental studies.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

SNP genotyping is the measurement of genetic variations of single nucleotide polymorphisms (SNPs) between members of a species. It is a form of genotyping, which is the measurement of more general genetic variation. SNPs are one of the most common types of genetic variation. An SNP is a single base pair mutation at a specific locus, usually consisting of two alleles. SNPs are found to be involved in the etiology of many human diseases and are becoming of particular interest in pharmacogenetics. Because SNPs are conserved during evolution, they have been proposed as markers for use in quantitative trait loci (QTL) analysis and in association studies in place of microsatellites. The use of SNPs is being extended in the HapMap project, which aims to provide the minimal set of SNPs needed to genotype the human genome. SNPs can also provide a genetic fingerprint for use in identity testing. The increase of interest in SNPs has been reflected by the furious development of a diverse range of SNP genotyping methods.

Population genomics is the large-scale comparison of DNA sequences of populations. Population genomics is a neologism that is associated with population genetics. Population genomics studies genome-wide effects to improve our understanding of microevolution so that we may learn the phylogenetic history and demography of a population.

dbSNP Genetics database

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.

Diversity Arrays Technology (DArT) is a high-throughput genetic marker technique that can detect allelic variations to provides comprehensive genome coverage without any DNA sequence information for genotyping and other genetic analysis. The general steps involve reducing the complexity of the genomic DNA with specific restriction enzymes, choosing diverse fragments to serve as representations for the parent genomes, amplify via polymerase chain reaction (PCR), insert fragments into a vector to be placed as probes within a microarray, then fluorescent targets from a reference sequence will be allowed to hybridize with probes and put through an imaging system. The objective is to identify and quantify various forms of DNA polymorphism within genomic DNA of sampled species.

In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes to genotypes, uncovering genetic associations.

Nested association mapping (NAM) is a technique designed by the labs of Edward Buckler, James Holland, and Michael McMullen for identifying and dissecting the genetic architecture of complex traits in corn. It is important to note that nested association mapping is a specific technique that cannot be performed outside of a specifically designed population such as the Maize NAM population, the details of which are described below.

Molecular Inversion Probe (MIP) belongs to the class of Capture by Circularization molecular techniques for performing genomic partitioning, a process through which one captures and enriches specific regions of the genome. Probes used in this technique are single stranded DNA molecules and, similar to other genomic partitioning techniques, contain sequences that are complementary to the target in the genome; these probes hybridize to and capture the genomic target. MIP stands unique from other genomic partitioning strategies in that MIP probes share the common design of two genomic target complementary segments separated by a linker region. With this design, when the probe hybridizes to the target, it undergoes an inversion in configuration and circularizes. Specifically, the two target complementary regions at the 5’ and 3’ ends of the probe become adjacent to one another while the internal linker region forms a free hanging loop. The technology has been used extensively in the HapMap project for large-scale SNP genotyping as well as for studying gene copy alterations and characteristics of specific genomic loci to identify biomarkers for different diseases such as cancer. Key strengths of the MIP technology include its high specificity to the target and its scalability for high-throughput, multiplexed analyses where tens of thousands of genomic loci are assayed simultaneously.

<span class="mw-page-title-main">Exome sequencing</span> Sequencing of all the exons of a genome

Exome sequencing, also known as whole exome sequencing (WES), is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.

<span class="mw-page-title-main">Restriction site associated DNA markers</span> Type of genetic marker

Restriction site associated DNA (RAD) markers are a type of genetic marker which are useful for association mapping, QTL-mapping, population genetics, ecological genetics and evolutionary genetics. The use of RAD markers for genetic mapping is often called RAD mapping. An important aspect of RAD markers and mapping is the process of isolating RAD tags, which are the DNA sequences that immediately flank each instance of a particular restriction site of a restriction enzyme throughout the genome. Once RAD tags have been isolated, they can be used to identify and genotype DNA sequence polymorphisms mainly in form of single nucleotide polymorphisms (SNPs). Polymorphisms that are identified and genotyped by isolating and analyzing RAD tags are referred to as RAD markers. Although genotyping by sequencing presents an approach similar to the RAD-seq method, they differ in some substantial ways.

Disease gene identification is a process by which scientists identify the mutant genotypes responsible for an inherited genetic disorder. Mutations in these genes can include single nucleotide substitutions, single nucleotide additions/deletions, deletion of the entire gene, and other genetic abnormalities.

Imputation in genetics refers to the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.

<span class="mw-page-title-main">Edward Buckler</span> Plant geneticist

Edward S. Buckler is a plant geneticist with the USDA Agricultural Research Service and holds an adjunct appointment at Cornell University. His work focuses on both quantitative and statistical genetics in maize as well as other crops such as cassava. He originated the concept of Nested association mapping and created the first population designed for this type of quantitative genetic analysis. Buckler was elected an American Association for the Advancement of Science Fellow in 2012. In 2014, he was elected to the National Academy of Sciences. In 2017, he received the NAS prize in Food and Agricultural Science for his work using natural genetic diversity to develop varieties of maize with fifteen times more vitamin A than existing varieties.

References

  1. 1 2 3 Elshire, Robert J.; Glaubitz, Jeffrey C.; Sun, Qi; Poland, Jesse A.; Kawamoto, Ken; Buckler, Edward S.; Mitchell, Sharon E. (2011-05-04). "A Robust, Simple Genotyping-by-Sequencing (GBS) Approach for High Diversity Species". PLOS ONE. 6 (5): e19379. Bibcode:2011PLoSO...619379E. doi: 10.1371/journal.pone.0019379 . ISSN   1932-6203. PMC   3087801 . PMID   21573248.
  2. 1 2 He, Jiangfeng; Zhao, Xiaoqing; Laroche, André; Lu, Zhen-Xiang; Liu, HongKui; Li, Ziqin (2014-01-01). "Genotyping-by-sequencing (GBS), an ultimate marker-assisted selection (MAS) tool to accelerate plant breeding". Frontiers in Plant Science. 5: 484. doi: 10.3389/fpls.2014.00484 . PMC   4179701 . PMID   25324846.
  3. Liu, Hui; Bayer, Micha; Druka, Arnis; Russell, Joanne R.; Hackett, Christine A.; Poland, Jesse; Ramsay, Luke; Hedley, Pete E.; Waugh, Robbie (2014-01-01). "An evaluation of genotyping by sequencing (GBS) to map the Breviaristatum-e (ari-e) locus in cultivated barley". BMC Genomics. 15: 104. doi: 10.1186/1471-2164-15-104 . ISSN   1471-2164. PMC   3922333 . PMID   24498911.
  4. Davey, John W.; Hohenlohe, Paul A.; Etter, Paul D.; Boone, Jason Q.; Catchen, Julian M.; Blaxter, Mark L. (2011-07-01). "Genome-wide genetic marker discovery and genotyping using next-generation sequencing". Nature Reviews Genetics. 12 (7): 499–510. doi:10.1038/nrg3012. ISSN   1471-0056. PMID   21681211. S2CID   15080731.
  5. Campbell, Erin O.; Brunet, Byran M.T.; Dupuis, Julian R.; Sperling, Felix A.H. (2018). "Would an RRS by any other name sound as RAD?". Methods in Ecology and Evolution. 9 (9): 1920–1927. doi: 10.1111/2041-210X.13038 .
  6. Vaux, Felix; Dutoit, Ludovic; Fraser, Ceridwen I.; Waters, Jonathan M. (2022). "Genotyping-by-sequencing for biogeography". Journal of Biogeography. 50 (2): 262–281. doi: 10.1111/jbi.14516 .
  7. Heffelfinger, Christopher; Fragoso, Christopher A.; Moreno, Maria A.; Overton, John D.; Mottinger, John P.; Zhao, Hongyu; Tohme, Joe; Dellaporta, Stephen L. (2014). "Flexible and scalable genotyping-by-sequencing strategies for population studies". BMC Genomics. 15 (1): 979. doi: 10.1186/1471-2164-15-979 . PMC   4253001 . PMID   25406744.
  8. "Tassel 5 GBS v2 Pipeline". Tassel 5 Source. Retrieved 20 May 2016.