Tag SNP

Last updated

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

Contents

Introduction

Linkage Disequilibrium

Within a family, linkage occurs when two genetic markers (points on a chromosome) remain linked on a chromosome rather than being broken apart by recombination events during meiosis, shown as red lines. In a population, contiguous stretches of founder chromosomes from the initial generation are sequentially reduced in size by recombination events. Over time, a pair of markers or points on a chromosome in the population move from linkage disequilibrium to linkage equilibrium, as recombination events eventually occur between every possible point on the chromosome. Linkage and Linkage Disequilibrium.png
Within a family, linkage occurs when two genetic markers (points on a chromosome) remain linked on a chromosome rather than being broken apart by recombination events during meiosis, shown as red lines. In a population, contiguous stretches of founder chromosomes from the initial generation are sequentially reduced in size by recombination events. Over time, a pair of markers or points on a chromosome in the population move from linkage disequilibrium to linkage equilibrium, as recombination events eventually occur between every possible point on the chromosome.

Two loci are said to be in linkage equilibrium (LE) if their inheritance is an independent event. If the alleles at those loci are non-randomly inherited then we say that they are at linkage disequilibrium (LD). LD is most commonly caused by physical linkage of genes. When two genes are inherited on the same chromosome, depending on their distance and the likelihood of recombination between the loci they can be at high LD. However, LD can be also observed due to functional interactions where even genes from different chromosomes can jointly confer an evolutionarily selected phenotype or can affect the viability of potential offspring.

In families LD is highest because of the lowest numbers of recombination events (fewest meiosis events). This is especially true between inbred lines. In populations LD exists because of selection, physical closeness of the genes that causes low recombination rates or due to recent crossing or migration. On a population level, processes that influence linkage disequilibrium include genetic linkage, epistatic natural selection, rate of recombination, mutation, genetic drift, random mating, genetic hitchhiking and gene flow. [2]

When a group of SNPs are inherited together because of high LD there tends to be redundant information. The selection of a tag SNP as a representative of these groups reduces the amount of redundancy when analyzing parts of the genome associated with traits/diseases. [3] The regions of the genome in high LD that harbor a specific set of SNPs that are inherited together are also known as haplotypes. Therefore, tag SNPs are representative of all SNPs within a haplotype.

Haplotypes

The selection of tag SNPs is dependent on the haplotypes present in the genome. Most sequencing technologies provide the genotypic information and not the haplotypes i.e. they provide information on the specific bases that are present but do not provide phasic information (at which specific chromosome each of the bases appear). [4] Determination of haplotypes can be done through molecular methods (Allele Specific PCR, Somatic cell hybrids). These methods distinguish which allele is present at which chromosome by separating the chromosomes before genotyping. They can be very time-consuming and expensive, so statistical inference methods have been developed as a less expensive and automated option. These statistical-inference software packages utilize parsimony, maximum likelihood, and Bayesian algorithms to determine haplotypes. Disadvantage of statistical-inference is that a proportion of the inferred haplotypes could be wrong. [5]

Population differences

When haplotypes are used for genome wide association studies, it is important to note the population being studied. Often different populations will have different patterns of LD. One example of differentiating patterns are African-descended populations vs. European and Asian-descended populations. Since humans originated in Africa and spread into Europe and then the Asian and American continents, the African populations are the most genetically diverse and have smaller regions of LD while European and Asian-descended populations have larger regions of LD due to founder effect. When LD patterns differ in populations, SNPs can become disassociated with each other due to the changes in haplotype blocks. This means that tag SNPs, as representatives of the haplotype blocks, are unique in populations and population differences should be taken into account when performing association studies. [6]

Application

LD plot of SNPs with top-ranked bayes factors in CHB of 1000 Genome Phase I. The colors indicate the strength of pairwise LD according to r2 metrics. The SNPs marked with asterisks represent independent strong associations. Tag SNPs are shadowed in pink. LD plot of SNPs with top-ranked BFs in CHB of 1000 Genome Phase I..png
LD plot of SNPs with top-ranked bayes factors in CHB of 1000 Genome Phase I. The colors indicate the strength of pairwise LD according to r2 metrics. The SNPs marked with asterisks represent independent strong associations. Tag SNPs are shadowed in pink.

GWAS

Almost every trait has both genetic and environmental influence. Heritability is the proportion of phenotypic variance that is inherited from our ancestors. Association studies are used to determine the genetic influence on phenotypic presentation. Although mostly used for mapping diseases to genomic areas, they can also be used to map heritability of any phenotype like height, eye color etc.

Genome-wide association studies (GWAS) use single-nucleotide polymorphisms (SNPs) to identify genetic associations with clinical conditions and phenotypic traits. [8] They are hypothesis free and use a whole-genome approach to investigate traits by comparing a large group of individuals that express a phenotype with a large group of people that don't. The ultimate goal of GWAS is to determine genetic risk factors that can be used to make predictions about who is at risk for a disease, what are the biological underpinnings of disease susceptibility and creating new prevention and treatment strategies. [1] The National Human Genome Research Institute and the European Bioinformatics Institute publishes the GWAS Catalog, a catalog of published genome-wide association studies that highlights statistically significant associations between hundreds of SNPs with a broad range of phenotypes. [9]

Two Affymetrix chips Affymetrix-microarray.jpg
Two Affymetrix chips

Due to the large number of possible SNP variants (more than 149 million as of June 2015 [10] [11] ) it is still very expensive to sequence all SNPs. That is why GWAS use customizable arrays (SNP chips) to genotype only a subset of the variants identified as tag snps. Most GWAS use products from the two primary genotyping platforms. The Affymetrix platform prints DNA probes on a glass or silicone chip that hybridize to specific alleles in the sample DNA. The Illumina platform uses bead-based technology, with longer DNA sequences and produces better specificity. [1] Both platforms are able to genotype more than a million tag SNPs using either pre-made or custom DNA oligos.

Genome-wide studies are predicated on the common disease-common variant (CD/CV) hypothesis which states that common disorders are influenced by common genetic variation. Effect size (penetrance) of the common variants needs to be smaller relative to those found in rare disorders. That means that the common SNP can explain only a small portion of the variance due to genetic factors and that common diseases are influenced by multiple common alleles of small effect size. Another hypothesis is that common diseases are caused by rare variants that are synthetically linked to common variants. In that case the signal produced from GWAS is an indirect (synthetic) association between one or more rare causal variants in linkage disequilibrium. It is important to recognize that this phenomenon is possible when selecting a group for tag SNPs. When a disease is found to be associated with a haplotype, some SNPs in that haplotype will have synthetic association with the disease. To pinpoint the causal SNPs we need a greater resolution in the selection of haplotype blocks. Since whole genome sequencing technologies are rapidly changing and becoming less expensive it is likely that they will replace the current genotyping technologies providing the resolution needed to pinpoint causal variants.

HapMap

Because whole genome sequencing of individuals is still cost prohibitive, the international HapMap Project was constructed with a goal to map the human genome to haplotype groupings (haplotype blocks) that can describe common patterns of human genetic variation. By mapping the entire genome to haplotypes, tag SNPs can be identified to represent the haplotype blocks examined by genetic studies. An important factor to consider when planning a genetic study is the frequency and risk incurred by specific alleles. These factors can vary in different populations so the HapMap project used a variety of sequencing techniques to discover and catalog SNPs from different sets of populations. Initially the project sequenced individuals from Yoruba population of African origin (YRI), residents of Utah with western European ancestry (CEU), unrelated individuals from Tokyo, Japan (JPT) and unrelated Han Chinese individuals from Beijing, China (CHB). Recently their datasets have been expanded to include other populations (11 groups). [1]

Selection and evaluation

Steps for tag SNP selection

Selection of maximum informative tag SNPs is an NP complete problem. However, algorithms can be devised to provide approximate solution within a margin of error. [12] The criteria that are needed to define each tag SNP selection algorithm is the following:

  1. Define area to search - the algorithm will attempt to locate tag SNPs in neighborhood N(t) of a target SNP t
  2. Define a metric to assess the quality of tagging - the metric needs to measure how well a target SNP t can be predicted using a set of its neighbors N(t) i.e. how well a tag SNP as a representative of the SNPs in a neighborhood N(t) can predict a target SNP t. It can be defined as a probability that the target SNP t has different values for any pair of haplotypes i and j where the value of the SNP s is also different for the same haplotypes. The informativeness of the metric can be represented in terms of a graph theory, where every SNP s is represented as a graph Gs whose nodes are haplotypes. Gs has an edge between the nodes (i,j) if and only if the values of s are different for the haplotypes Hi, Hj. [12]
  3. Derive the algorithm to find representative SNPs - the goal of the algorithm is to find the minimal subset of tag SNPs selected with maximum informativeness between each tag SNP with every other target SNP
  4. Validate the algorithm

Feature selection

Methods for selecting features fall into two categories: filter methods and wrapper methods. Filter algorithms are general preprocessing algorithms that do not assume the use of a specific classification method. Wrapper algorithms, in contrast, “wrap” the feature selection around a specific classifier and select a subset of features based on the classifier's accuracy using cross-validation. [13]

The feature selection method suitable for selecting tag SNPs must have the following characteristics:

Selection algorithms

Several algorithms have been proposed for selecting tag SNPs. The first approach was based on the measure of goodness of SNP sets and searched for SNP subsets that are small but attain high value of the defined measure. Examining every SNP subset to find good ones is computationally feasible only for small data sets.

Another approach uses principal component analysis (PCA) to find subsets of SNPs capturing majority of the data variance. A sliding windows method is employed to repeatedly apply PCA to short chromosomal regions. This reduces the data produced and also does not require exponential search time. Yet it is not feasible to apply the PCA method to large chromosomal data sets as it is computationally complex. [13]

The most commonly used approach, block-based method, exploits the principle of linkage disequilibrium observed within haplotype blocks. [12] Several algorithms have been devised to partition chromosomal regions into haplotype blocks which are based on haplotype diversity, LD, four-gamete test and information complexity and tag SNPs are selected from all SNPs that belong to that block. The main presumption in this algorithm is that the SNPs are biallelic. [14] The main drawback is that the definition of blocks is not always straightforward. Even though there is a list of criteria for forming the haplotype blocks, there is no consensus on the same. Also, local correlations based selection of tag SNPs ignores inter-block correlations. [12]

Unlike the block-based approach, a block-free approach does not rely on the block structure. The SNP frequency and recombination rates are known to vary across the genome and some studies have reported LD distances much longer than the reported maximum block sizes. Setting a strict border for the neighborhood is not desired and the block-free approach looks for tag SNPs globally. There are several algorithms to perform this. In one algorithm, the non-tagging SNPs are represented as boolean functions of tag SNPs and set theory techniques are used to reduce search space. Another algorithm searches for subsets of markers that can come from non-consecutive blocks. Due to the marker neighborhood, the search space is reduced. [13]

Optimizations

With the number of individuals genotyped and number of SNPs in databases growing, tag SNP selection takes too much time to compute. In order to improve the efficiency of the tag SNP selection method, the algorithm first ignores the SNPs being biallelic, and then compresses the length (SNP number) of the haplotype matrix by grouping the SNP sites with the same information. The SNP sites that partition the haplotypes into the same group are called redundant sites. The SNP sites which contain distinct information within a block are called non-redundant sites (NRS). In order to further compress the haplotype matrix, the algorithm needs to find the tag SNPs such that all haplotypes of the matrix can be distinguished. By using the idea of joint partition, an efficient tag SNPs selection algorithm is provided. [14]

Validation of the accuracy of the algorithm

Depending on how the tag SNPs are selected, different prediction methods have been used during the cross-validation process. Machine learning method was employed to predict the left-out haplotype. Another approach predicted the alleles of a non-tagging SNP n from the tag SNPs that had the highestcorrelation coefficient with n. If a single highly correlated tag SNP t is found, the alleles are assigned so their frequencies agree with the allele frequencies of t. When multiple tagging SNPs have the same (high) correlation coefficient with n, the common allele of n has advantage. It is easy to see that in this case the prediction method agrees well with the selection method, which uses PCA on the matrix of correlation coefficients between SNPs. [13]

There are other ways to assess the accuracy of a tag SNP selection method. The accuracy can be evaluated by the quality measure R2, which is the measure of association between the true numbers of haplotype copies defined over the full set of SNPs and the predicted number of haplotype copies where the prediction is based on the subset of tagging SNPs. This measure assumes diploid data and explicit inference of haplotypes from genotypes. [13]

Another assessment method due to Clayton is based on a measure of the diversity of haplotypes. The diversity is defined as the total number of differences in all pairwise comparison between haplotypes. The difference between a pair of haplotypes is the sum of differences over all the SNPs. The Clayton's diversity measure can be used to define how well a set of tag SNPs differentiate different haplotypes. This measure is suitable only for haplotype blocks with limited haplotype diversity and it is not clear how to use it for large data sets consisting of multiple haplotype blocks. [13]

Some recent works evaluate tag SNPs selection algorithms based on how well the tagging SNPs can be used to predict non-tagging SNPs. The prediction accuracy is determined using cross-validation such as leave-one-out or hold out. In leave-one-out cross-validation, for each sequence in the data set, the algorithm is run on the rest of the data set to select a minimum set of tagging SNPs. [13]

Tools

Tagger

Tagger is a web tool available for evaluating and selecting tag SNPs from genotypic data such as the International HapMap Project. It utilizes pairwise methods and multimarker haplotype approaches. Users can upload HapMap genotype data or pedigree format and the linkage disequilibrium patterns will be calculated. Tagger options allow for the user to specify chromosomal landmarks, which indicate regions of interest in the genome for picking tag SNPs. The program then produces a list of tag SNPs and their statistical test values as well as a coverage report. It is developed by Paul de Bakker in the labs of David Altshuler and Mark Daly at the Center for Human Genetic Research of Massachusetts General Hospital and Harvard Medical School, at the Broad Institute. [15]

CLUSTAG and WCLUSTAG

In the freeware CLUSTAG and WCLUSTAG, there contain cluster and set-cover algorithms to obtain a set of tag SNPs that can represent all the known SNPs in a chromosomal region. The programs are implemented with Java, and they can run in Windows platform as well as the Unix environment. They are developed by SIO-IONG AO et al. in The University of Hong Kong. [16] [17]

See also

Related Research Articles

<span class="mw-page-title-main">Single-nucleotide polymorphism</span> Single nucleotide in genomic DNA at which different sequence alternatives exist

In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.

<span class="mw-page-title-main">Haplotype</span> Group of genes from one parent

A haplotype is a group of alleles in an organism that are inherited together from a single parent.

In population genetics, linkage disequilibrium (LD) is a measure of non-random association between segments of DNA (alleles) at different positions on the chromosome (loci) in a given population based on a comparison between the frequency at which two alleles are detected together at the same loci and the frequencies at which each allele is detected at that loci overall, whether it occurs with or without the other allele of interest. Loci are said to be in linkage disequilibrium when the frequency of being detected together is higher or lower than expected if the loci were independent and associated randomly.

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available for research.

<span class="mw-page-title-main">Identity by descent</span> Identical nucleotide sequence due to inheritance without recombination from a common ancestor

A DNA segment is identical by state (IBS) in two or more individuals if they have identical nucleotide sequences in this segment. An IBS segment is identical by descent (IBD) in two or more individuals if they have inherited it from a common ancestor without recombination, that is, the segment has the same ancestral origin in these individuals. DNA segments that are IBD are IBS per definition, but segments that are not IBD can still be IBS due to the same mutations in different individuals or recombinations that do not alter the segment.

Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.

In molecular biology, SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. Around 335 million SNPs have been identified in the human genome, 15 million of which are present at frequencies of 1% or higher across different populations worldwide.

Haploview is a commonly used bioinformatics software which is designed to analyze and visualize patterns of linkage disequilibrium (LD) in genetic data. Haploview can also perform association studies, choosing tagSNPs and estimating haplotype frequencies. Haploview is developed and maintained by Dr. Mark Daly's lab at the MIT/Harvard Broad Institute.

<span class="mw-page-title-main">Genome-wide association study</span> Study of genetic variants in different individuals

In genomics, a genome-wide association study, is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. GWA studies typically focus on associations between single-nucleotide polymorphisms (SNPs) and traits like major human diseases, but can equally be applied to any other genetic variants and any other organisms.

The dog leukocyte antigen (DLA) is a part of the major histocompatibility complex (MHC) in dogs, encoding genes in the MHC. The DLA and MHC system are interchangeable terms in canines. The MHC plays a critical role in the immune response system and consists of three regions: class I, class II and class III. DLA genes belong to the first two classes, which are involved in the regulation of antigens in the immune system. The class II genes are highly polymorphic, with many different alleles/haplotypes that have been linked to diseases, allergies, and autoimmune conditions such as diabetes, polyarthritis, and hypothyroidism in canines.

WGAViewer is a bioinformatics software tool which is designed to visualize, annotate, and help interpret the results generated from a genome wide association study (GWAS). Alongside the P values of association, WGAViewer allows a researcher to visualize and consider other supporting evidence, such as the genomic context of the SNP, linkage disequilibrium (LD) with ungenotyped SNPs, gene expression database, and the evidence from other GWAS projects, when determining the potential importance of an individual SNP.

In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes to genotypes, uncovering genetic associations.

Quantitative trait loci mapping or QTL mapping is the process of identifying genomic regions that potentially contain genes responsible for important economic, health or environmental characters. Mapping QTLs is an important activity that plant breeders and geneticists routinely use to associate potential causal genes with phenotypes of interest. Family-based QTL mapping is a variant of QTL mapping where multiple-families are used.

In genetics, haplotype estimation refers to the process of statistical estimation of haplotypes from genotype data. The most common situation arises when genotypes are collected at a set of polymorphic sites from a group of individuals. For example in human genetics, genome-wide association studies collect genotypes in thousands of individuals at between 200,000-5,000,000 SNPs using microarrays. Haplotype estimation methods are used in the analysis of these datasets and allow genotype imputation of alleles from reference databases such as the HapMap Project and the 1000 Genomes Project.

In genetics, imputation is the statistical inference of unobserved genotypes. It is achieved by using known haplotypes in a population, for instance from the HapMap or the 1000 Genomes Project in humans, thereby allowing to test for association between a trait of interest and experimentally untyped genetic variants, but whose genotypes have been statistically inferred ("imputed"). Genotype imputation is usually performed on SNPs, the most common kind of genetic variation.

Genome-wide complex trait analysis (GCTA) Genome-based restricted maximum likelihood (GREML) is a statistical method for heritability estimation in genetics, which quantifies the total additive contribution of a set of genetic variants to a trait. GCTA is typically applied to common single nucleotide polymorphisms (SNPs) on a genotyping array and thus termed "chip" or "SNP" heritability.

<span class="mw-page-title-main">Polygenic score</span> Numerical score aimed at predicting a trait based on variation in multiple genetic loci

In genetics, a polygenic score (PGS) is a number that summarizes the estimated effect of many genetic variants on an individual's phenotype. The PGS is also called the polygenic index (PGI) or genome-wide score; in the context of disease risk, it is called a polygenic risk score or genetic risk score. The score reflects an individual's estimated genetic predisposition for a given trait and can be used as a predictor for that trait. It gives an estimate of how likely an individual is to have a given trait based only on genetics, without taking environmental factors into account; and it is typically calculated as a weighted sum of trait-associated alleles.

In genetics, a haplotype block is a region of an organism's genome in which there is little evidence of a history of genetic recombination, and which contain only a small number of distinct haplotypes. According to the haplotype-block model, such blocks should show high levels of linkage disequilibrium and be separated from one another by numerous recombination events. The boundaries of haplotype blocks cannot be directly observed; they must instead be inferred indirectly through the use of algorithms. However, some evidence suggests that different algorithms for identifying haplotype blocks give very different results when used on the same data, though another study suggests that their results are generally consistent. The National Institutes of Health funded the HapMap project to catalog haplotype blocks throughout the human genome.

Snagger is a bioinformatics software program for selecting tag SNPs using pairwise r2 linkage disequilibrium. It is implemented as extension to the popular software, Haploview, and is freely available under the MIT License. Snagger distinguishes itself from existing single nucleotide polymorphism (SNP) selection algorithms, including Tagger, by providing user options that allow for:

<span class="mw-page-title-main">Interferon Lambda 4</span> Protein-coding gene in the species Homo sapiens

Interferon lambda 4 is one of the most recently discovered human genes and the newest addition to the interferon lambda protein family. This gene encodes the IFNL4 protein, which is involved in immune response to viral infection.

References

  1. 1 2 3 4 Bush, William S.; Moore, Jason H.; Lewitter, Fran; Kann, Maricel (27 December 2012). "Chapter 11: Genome-Wide Association Studies". PLOS Computational Biology. 8 (12): e1002822. Bibcode:2012PLSCB...8E2822B. doi: 10.1371/journal.pcbi.1002822 . PMC   3531285 . PMID   23300413.
  2. van der Werf, Julius. "Basics of Linkage and Gene Mapping" (PDF). Retrieved 30 April 2014.
  3. Lewontin, R.C. (1988). "On measures of gametic disequilibrium". Genetics. 120 (3): 849–852. doi:10.1093/genetics/120.3.849. PMC   1203562 . PMID   3224810.
  4. Halperin, E.; Kimmel, G.; Shamir, R. (16 June 2005). "Tag SNP selection in genotype data for maximizing SNP prediction accuracy". Bioinformatics. 21 (Suppl 1): i195–i203. doi:10.1093/bioinformatics/bti1021. PMID   15961458.
  5. Crawford, Dana C.; Nickerson, Deborah A. (2005). "Definition and Clinical Importance of Haplotypes". Annual Review of Medicine. 56 (1): 303–320. doi:10.1146/annurev.med.56.082103.104540. PMID   15660514.
  6. Teo, YY; Sim, X (Apr 2010). "Patterns of linkage disequilibrium in different populations: implications and opportunities for lipid-associated loci identified from genome-wide association studies". Current Opinion in Lipidology. 21 (2): 104–15. doi:10.1097/MOL.0b013e3283369e5b. PMID   20125009. S2CID   21217250.
  7. Shou, Weihua; Wang, Dazhi; Zhang, Kaiyue; Wang, Beilan; Wang, Zhimin; Shi, Jinxiu; Huang, Wei; Huang, Qingyang (26 September 2012). "Gene-Wide Characterization of Common Quantitative Trait Loci for ABCB1 mRNA Expression in Normal Liver Tissues in the Chinese Population". PLOS ONE. 7 (9): e46295. Bibcode:2012PLoSO...746295S. doi: 10.1371/journal.pone.0046295 . PMC   3458811 . PMID   23050008.
  8. Welter, D.; MacArthur, J.; Morales, J.; Burdett, T.; Hall, P.; Junkins, H.; Klemm, A.; Flicek, P.; Manolio, T.; Hindorff, L.; Parkinson, H. (6 December 2013). "The NHGRI GWAS Catalog, a curated resource of SNP-trait associations". Nucleic Acids Research. 42 (D1): D1001–D1006. doi:10.1093/nar/gkt1229. PMC   3965119 . PMID   24316577.
  9. Witte, John S.; Hoffmann, Thomas J. (2011). "Polygenic Modeling of Genome-Wide Association Studies: An Application to Prostate and Breast Cancer". OMICS: A Journal of Integrative Biology. 15 (6): 393–398. doi:10.1089/omi.2010.0090. PMC   3125548 . PMID   21348634.
  10. dbSNP Data Statistics. National Center for Biotechnology Information (US). 2005.
  11. "dbSNP Summary".
  12. 1 2 3 4 Tarvo, Alex. "Tutorial on haplotype tagging" (PDF). Retrieved 1 May 2014.
  13. 1 2 3 4 5 6 7 Phuong, TM; Lin, Z; Altman, RB (Apr 2006). "Choosing SNPs using feature selection". Journal of Bioinformatics and Computational Biology. 4 (2): 241–57. CiteSeerX   10.1.1.128.1909 . doi:10.1109/csb.2005.22. PMID   16819782. S2CID   821959.
  14. 1 2 Chen, WP; Hung, CL; Tsai, SJ; Lin, YL (2014). "Novel and efficient tag SNPs selection algorithms". Bio-Medical Materials and Engineering. 24 (1): 1383–9. doi:10.3233/BME-130942. PMID   24212035.
  15. "Tagger" . Retrieved 1 May 2014.
  16. "CLUSTAG" . Retrieved 9 March 2024.
  17. "WCLUSTAG" . Retrieved 9 March 2024.