# Haplotype

Last updated

A haplotype (haploid genotype) is a group of alleles in an organism that are inherited together from a single parent. [1] [2] However, there are other uses of this term. First, it is used to mean a collection of specific alleles (that is, specific DNA sequences) in a cluster of tightly linked genes on a chromosome that are likely to be inherited together—that is, they are likely to be conserved as a sequence that survives the descent of many generations of reproduction. [3] [4] A second use is to mean a set of linked single-nucleotide polymorphism (SNP) alleles that tend to always occur together (i.e., that are associated statistically). It is thought that identifying these statistical associations and few alleles of a specific haplotype sequence can facilitate identifying all other such polymorphic sites that are nearby on the chromosome. Such information is critical for investigating the genetics of common diseases; which in fact have been investigated in humans by the International HapMap Project. [5] [6] Thirdly, many human genetic testing companies use the term in a third way: to refer to an individual collection of specific mutations within a given genetic segment; (see short tandem repeat mutation).

The genotype is the part of the genetic makeup of a cell, and therefore of any individual, which determines one of its characteristics (phenotype). The term was coined by the Danish botanist, plant physiologist and geneticist Wilhelm Johannsen in 1903.

An allele is a variant form of a given gene. Sometimes, the presence of different alleles of the same gene can result in different observable phenotypic traits, such as different pigmentation. A notable example of this trait of color variation is Gregor Mendel's discovery that the white and purple flower colors in pea plants were the result of "pure line" traits which could be used as a control for future experiments. However, most genetic variations result in little or no observable variation.

In biology, an organism is any individual entity that exhibits the properties of life. It is a synonym for "life form".

## Contents

The term 'haplogroup' refers to the SNP/unique-event polymorphism (UEP) mutations that represent the clade to which a collection of particular human haplotypes belong. (Clade here refers to a set of haplotypes sharing a common ancestor.) [7] A haplogroup is a group of similar haplotypes that share a common ancestor with a single-nucleotide polymorphism mutation. [8] [9] Mitochondrial DNA passes along a maternal lineage that can date back thousands of years. [8]

A haplotype is a group of alleles in an organism that are inherited together from a single parent, and a haplogroup is a group of similar haplotypes that share a common ancestor with a single-nucleotide polymorphism mutation. More specifically, a haplogroup is a combination of alleles at different chromosomes regions that are closely linked and that tend to be inherited together. As a haplogroup consists of similar haplotypes, it is usually possible to predict a haplogroup from haplotypes. Haplogroups pertain to a single line of descent. As such, membership of a haplogroup, by any individual, relies on a relatively small proportion of the genetic material possessed by that individual.

In genetic genealogy, a unique-event polymorphism (UEP) is a genetic marker that corresponds to a mutation that is likely to occur so infrequently that it is believed overwhelmingly probable that all the individuals who share the marker, worldwide, will have inherited it from the same common ancestor, and the same single mutation event.

A clade, also known as monophyletic group, is a group of organisms that consists of a common ancestor and all its lineal descendants, and represents a single "branch" on the "tree of life".

## Haplotype resolution

An organism's genotype may not define its haplotype uniquely. For example, consider a diploid organism and two bi-allelic loci (such as SNPs) on the same chromosome. Assume the first locus has alleles A or T and the second locus G or C. Both loci, then, have three possible genotypes: (AA, AT, and TT) and (GG, GC, and CC), respectively. For a given individual, there are nine possible configurations (haplotypes) at these two loci (shown in the Punnett square below). For individuals who are homozygous at one or both loci, the haplotypes are unambiguous - meaning that there is not any differentiation of haplotype T1T2 vs haplotype T2T1; where T1 and T2 are labeled to show that they are the same locus, but labeled as such to show it doesn't matter which order you consider them in, the end result is two T loci. For individuals heterozygous at both loci, the gametic phase is ambiguous - in these cases, you don't know which haplotype you have, e.g., TA vs AT.

A locus in genetics is a fixed position on a chromosome, like the position of a gene or a marker. Each chromosome carries many genes; human's estimated 'haploid' protein coding genes are 19,000–20,000, on the 23 different chromosomes. A variant of the similar DNA sequence located at a given locus is called an allele. The ordered list of loci known for a particular genome is called a gene map. Gene mapping is the process of determining the locus for a particular biological trait.

A single-nucleotide polymorphism, often abbreviated to SNP, is a substitution of a single nucleotide that occurs at a specific position in the genome, where each variation is present to some appreciable degree within a population.

The Punnett square is a square diagram that is used to predict the genotypes of a particular cross or breeding experiment. It is named after Reginald C. Punnett, who devised the approach. The diagram is used by biologists to determine the probability of an offspring having a particular genotype. The Punnett square is a tabular summary of possible combinations of maternal alleles with paternal alleles. These tables can be used to examine the genotypical outcome probabilities of the offspring of a single trait (allele), or when crossing multiple traits from the parents. The Punnett Square is a visual representation of Mendelian inheritance. It is important to understand the terms "heterozygous", "homozygous", "double heterozygote", "dominant allele" and "recessive allele" when using the Punnett square method. For multiple traits, using the "forked-line method" is typically much easier than the Punnett square. Phenotypes may be predicted with at least better-than-chance accuracy using a Punnett square, but the phenotype that may appear in the presence of a given genotype can in some instances be influenced by many other factors, as when polygenic inheritance and/or epigenetics are at work.

AAATTT
GGAG AGAG TGTG TG
GCAG ACAG TC
or
AC TG
TG TC
CCAC ACAC TCTC TC

The only unequivocal method of resolving phase ambiguity is by sequencing. However, it is possible to estimate the probability of a particular haplotype when phase is ambiguous using a sample of individuals.

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

Given the genotypes for a number of individuals, the haplotypes can be inferred by haplotype resolution or haplotype phasing techniques. These methods work by applying the observation that certain haplotypes are common in certain genomic regions. Therefore, given a set of possible haplotype resolutions, these methods choose those that use fewer different haplotypes overall. The specifics of these methods vary - some are based on combinatorial approaches (e.g., parsimony), whereas others use likelihood functions based on different models and assumptions such as the Hardy-Weinberg principle, the coalescent theory model, or perfect phylogeny. The parameters in these models are then estimated using algorithms such as the expectation-maximization algorithm (EM), Markov chain Monte Carlo (MCMC), or hidden Markov models (HMM).

Occam's razor is the problem-solving principle that essentially states that "simpler solutions are more likely to be correct than complex ones." When presented with competing hypotheses to solve a problem, one should select the solution with the fewest assumptions. The idea is attributed to English Franciscan friar William of Ockham, a scholastic philosopher and theologian.

Coalescent theory is a model of how gene variants sampled from a population may have originated from a common ancestor. In the simplest case, coalescent theory assumes no recombination, no natural selection, and no gene flow or population structure, meaning that each variant is equally likely to have been passed from one generation to the next. The model looks backward in time, merging alleles into a single ancestral copy according to a random process in coalescence events. Under this model, the expected time between successive coalescence events increases almost exponentially back in time. Variance in the model comes from both the random passing of alleles from one generation to the next, and the random occurrence of mutations in these alleles.

In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain a sample of the desired distribution by observing the chain after a number of steps. The more steps there are, the more closely the distribution of the sample matches the actual desired distribution.

Microfluidic whole genome haplotyping is a technique for the physical separation of individual chromosomes from a metaphase cell followed by direct resolution of the haplotype for each allele.

Microfluidic whole genome haplotyping is a technique for the physical separation of individual chromosomes from a metaphase cell followed by direct resolution of the haplotype for each allele.

Metaphase is a stage of mitosis in the eukaryotic cell cycle in which chromosomes are at their second-most condensed and coiled stage. These chromosomes, carrying genetic information, align in the equator of the cell before being separated into each of the two daughter cells. Metaphase accounts for approximately 4% of the cell cycle's duration. Preceded by events in prometaphase and followed by anaphase, microtubules formed in prophase have already found and attached themselves to kinetochores in metaphase.

## Y-DNA haplotypes from genealogical DNA tests

Unlike other chromosomes, Y chromosomes generally do not come in pairs. Every human male (excepting those with XYY syndrome) has only one copy of that chromosome. This means that there is not any chance variation of which copy is inherited, and also (for most of the chromosome) not any shuffling between copies by recombination; so, unlike autosomal haplotypes, there is effectively not any randomisation of the Y-chromosome haplotype between generations. A human male should largely share the same Y chromosome as his father, give or take a few mutations; thus Y chromosomes tend to pass largely intact from father to son, with a small but accumulating number of mutations that can serve to differentiate male lineages. In particular, the Y-DNA represented as the numbered results of a Y-DNA genealogical DNA test should match, except for mutations.

### UEP results (SNP results)

Unique-event polymorphisms (UEPs) such as SNPs represent haplogroups. STRs represent haplotypes. The results that comprise the full Y-DNA haplotype from the Y chromosome DNA test can be divided into two parts: the results for UEPs, sometimes loosely called the SNP results as most UEPs are single-nucleotide polymorphisms, and the results for microsatellite short tandem repeat sequences (Y-STRs).

The UEP results represent the inheritance of events it is believed can be assumed to have happened only once in all human history. These can be used to identify the individual's Y-DNA haplogroup, his place in the "family tree" of the whole of humanity. Different Y-DNA haplogroups identify genetic populations that are often distinctly associated with particular geographic regions; their appearance in more recent populations located in different regions represents the migrations tens of thousands of years ago of the direct patrilineal ancestors of current individuals.

### Y-STR haplotypes

Genetic results also include the Y-STR haplotype, the set of results from the Y-STR markers tested.

Unlike the UEPs, the Y-STRs mutate much more easily, which allows them to be used to distinguish recent genealogy. But it also means that, rather than the population of descendants of a genetic event all sharing the same result, the Y-STR haplotypes are likely to have spread apart, to form a cluster of more or less similar results. Typically, this cluster will have a definite most probable center, the modal haplotype (presumably similar to the haplotype of the original founding event), and also a haplotype diversity — the degree to which it has become spread out. The further in the past the defining event occurred, and the more that subsequent population growth occurred early, the greater the haplotype diversity will be for a particular number of descendants. However, if the haplotype diversity is smaller for a particular number of descendants, this may indicate a more recent common ancestor, or a recent population expansion.

It is important to note that, unlike for UEPs, two individuals with a similar Y-STR haplotype may not necessarily share a similar ancestry. Y-STR events are not unique. Instead, the clusters of Y-STR haplotype results inherited from different events and different histories tend to overlap.

In most cases, it is a long time since the haplogroups' defining events, so typically the cluster of Y-STR haplotype results associated with descendents of that event has become rather broad. These results will tend to significantly overlap the (similarly broad) clusters of Y-STR haplotypes associated with other haplogroups. This makes it impossible for researchers to predict with absolute certainty to which Y-DNA haplogroup a Y-STR haplotype would point. If the UEPs are not tested, the Y-STRs may be used only to predict probabilities for haplogroup ancestry, but not certainties.

A similar scenario exists in trying to evaluate whether shared surnames indicate shared genetic ancestry. A cluster of similar Y-STR haplotypes may indicate a shared common ancestor, with an identifiable modal haplotype, but only if the cluster is sufficiently distinct from what may have happened by chance from different individuals who historically adopted the same name independently. Many names were adopted from common occupations, for instance, or were associated with habitation of particular sites. More extensive haplotype typing is needed to establish genetic genealogy. Commercial DNA-testing companies now offer their customers testing of more numerous sets of markers to improve definition of their genetic ancestry. The number of sets of markers tested has increased from 12 during the early years to 111 more recently.

Establishing plausible relatedness between different surnames data-mined from a database is significantly more difficult. The researcher must establish that the very nearest member of the population in question, chosen purposely from the population for that reason, would be unlikely to match by accident. This is more than establishing that a randomly selected member of the population is unlikely to have such a close match by accident. Because of the difficulty, establishing relatedness between different surnames as in such a scenario is likely to be impossible, except in special cases where there is specific information to drastically limit the size of the population of candidates under consideration.

## Diversity

Haplotype diversity is a measure of the uniqueness of a particular haplotype in a given population. The haplotype diversity (H) is computed as: [10]
${\displaystyle H={\frac {N}{N-1}}(1-\sum _{i}x_{i}^{2})}$
where ${\displaystyle x_{i}}$ is the (relative) haplotype frequency of each haplotype in the sample and ${\displaystyle N}$ is the sample size. Haplotype diversity is given for each sample.

## Software

• FAMHAP [11] FAMHAP is a software for single-marker analysis and, in particular, joint analysis of unphased genotype data from tightly linked markers (haplotype analysis).
• Fugue EM based haplotype estimation and association tests in unrelated and nuclear families.
• HPlus [12] A software package for imputation and testing of haplotypes in association studies using a modified method that incorporates the expectation-maximization algorithm and a Bayesian method known as progressive ligation.
• HaploBlockFinder A software package for analyses of haplotype block structure.
• Haploscribe [13] Reconstruction of whole-chromosome haplotypes based on all genotyped positions in a nuclear family, including rare variants.
• Haploview [14] Visualisation of linkage disequilibrium, haplotype estimation and haplotype tagging (Homepage).
• HelixTree Haplotype analysis software - Haplotype Trend Regression (HTR), haplotypic association tests, and haplotype frequency estimation using both the expectation-maximization (EM) algorithm and composite haplotype method (CHM).
• PHASE A software for haplotype reconstruction, and recombination rate estimation from population data.
• SHAPEIT [15] SHAPEIT2 is a program for haplotype estimation of SNP genotypes in large cohorts across whole chromosome.
• SNPHAP EM based software for estimating haplotype frequencies from unphased genotypes.
• WHAP [16] haplotype based association analysis.

## Related Research Articles

In population genetics, linkage disequilibrium is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is higher or lower than what would be expected if the loci were independent and associated randomly.

The International HapMap Project was an organization that aimed to develop a haplotype map (HapMap) of the human genome, to describe the common patterns of human genetic variation. HapMap is used to find genetic variants affecting health, disease and responses to drugs and environmental factors. The information produced by the project is made freely available for research.

Genetic genealogy is the use of DNA testing in combination with traditional genealogical methods to infer relationships between individuals and find ancestors. Genetic genealogy involves the use of genealogical DNA testing to determine the level and type of the genetic relationship between individuals. This application of genetics became popular with family historians in the 21st century, as tests became affordable. The tests have been promoted by amateur groups, such as surname study groups, or regional genealogical groups, as well as research projects such as the genographic project. As of 2019, 26 million people had been tested. As this field has developed, the aims of practitioners broadened, with many seeking knowledge of their ancestry beyond the recent centuries for which traditional pedigrees can be constructed.

Genetic association is when one or more genotypes within a population co-occur with a phenotypic trait more often than would be expected by chance occurrence.

Haplogroup G (M201) is a human Y-chromosome haplogroup. It is one of two branches of Haplogroup GHIJK, the other being Haplogroup HIJK.

In genetic genealogy and human genetics, Y DNA haplogroup J-M267, also commonly known as haplogroup J1, is a subclade (branch) of Y-DNA haplogroup J-P209 along with its sibling clade Y DNA haplogroup J-M172.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

Haplogroup T-M184, also known as Haplogroup T is a human Y-chromosome DNA haplogroup. The UEP that defines this clade is the SNP known as M184. Other SNPs – M272, PAGES129, L810, L455, L452, and L445 – are considered to be phylogenetically equivalent to M184. As a primary branch of haplogroup LT, the basal, undivergent haplogroup T* currently has the alternate phylogenetic name of K1b and is a sibling of haplogroup L*. It has two primary branches: T1 (T-L206) and T2 (T-PH110).

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic variation within and across different species developed and hosted by the National Center for Biotechnology Information (NCBI) in collaboration with the National Human Genome Research Institute (NHGRI). Although the name of the database implies a collection of one class of polymorphisms only, it in fact contains a range of molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs), (3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms (MNPs), (5) heterozygous sequences, and (6) named variants. The dbSNP accepts apparently neutral polymorphisms, polymorphisms corresponding to known phenotypes, and regions of no variation. It was created in September 1998 to supplement GenBank, NCBI’s collection of publicly available nucleic acid and protein sequences.

Zygosity is the degree of similarity of the alleles for a trait in an organism.

In human genetics, Haplogroup G-M285, also known as Haplogroup G1, is a Y-chromosome haplogroup. Haplogroup G1 is a primary subclade of haplogroup G.

In human genetics, Haplogroup G-M406 is a Y-chromosome haplogroup. G-M406 is a branch of Haplogroup G Y-DNA (M201). More specifically in descending order, G-M406 is a subbranch also of G2 (P287), G2a (P15) and finally G2a2b (L30/S126) Haplogroup G-M406 seems most common in Turkey and Greece. Secondary concentrations of G-M406 are found in the northern and eastern Mediterranean, and it is found in very small numbers in more inland areas of Europe, the Middle East, and the southern Caucasus Mountains area.

The Y Chromosome Haplotype Reference Database (YHRD) is an open access, annotated collection of population samples typed for Y chromosomal sequence variants. Two important objectives are pursued: (1) the generation of reliable frequency estimates for Y-STR haplotypes and Y-SNP haplotypes to be used in the quantitative assessment of matches in forensic and kinship cases and (2) the characterization of male lineages to draw conclusions about the origins and history of human populations. Since its creation in 1999 it has been curated by Lutz Roewer and Sascha Willuweit at the Institute of Legal Medicine and Forensic Sciences, Charité - Universitätsmedizin Berlin. The database is endorsed by the International Society for Forensic Genetics (ISFG). By January 2019 269,383 9-STR locus haplotypes, among them 209,111 17-STR locus haplotypes, 52,312 23-STR locus haplotypes, 45,892 27-STR locus haplotypes and 23,710 Y SNP profiles sampled in 135 countries have been directly submitted by forensic institutions and universities from 72 countries. In geographic terms, 40.6 % of the YHRD samples stem from Asia, 27.0 % from Europe, 16.5 % from North America, 11.8 % from Latin America, 3.0 % from Africa and 1.0 % from Oceania/Australia. The 1,262 individual sampling projects are described in more than 550 peer-reviewed publications

The relationship of the Mayas to other indigenous peoples of the Americas has been assessed using traditional genetic markers. Mayas inhabited several parts of Mexico and Central America, including Chiapas, the northern lowlands of the Yucatán Peninsula, the southern lowlands and highlands of Guatemala, Belize, and parts of western El Salvador and Honduras. Genetic studies of the Maya people are reported to show higher levels of variation when compared to other groups.

As with all modern European nations, a large degree of 'biological continuity' exists between the Bosniaks and their ancient predecessors with Bosniak Y chromosomal lineages testifying to predominantly Paleolithic European ancestry. A majority (>67%) of Bosniaks belong to one of the three major European Y-DNA haplogroups: I2 (43.50%), R1a (15.3%) and R1b (3.5%), while a minority belongs to less frequently occurring haplogroups E-V13 (12.90%) and J2 (8.7%), along with other more rare lineages.

## References

1. By C. Barry Cox, Peter D. Moore, Richard Ladle. Wiley-Blackwell, 2016. ISBN   978-1-118-96858-1 p106. Biogeography: An Ecological and Evolutionary Approach
2. Editorial Board, V&S Publishers, 2012, ISBN   9381588643 p137.Concise Dictionary of Science
3. BiologyPages/H/Haplotypes.html Kimball's Biology Pages (Creative Commons Attribution 3.0)
4. "haplotype / haplotypes | Learn Science at Scitable". www.nature.com.
5. The International HapMap Consortium (2003). "The International HapMap Project". Nature. 426 (6968): 789–796. doi:10.1038/nature02168. hdl:2027.42/62838. PMID   14685227.
6. The International HapMap Consortium (2005). "A haplotype map of the human genome". Nature. 437 (7063): 1299–1320. doi:10.1038/nature04226. PMC  . PMID   16255080.
7. "Facts & Genes. Volume 7, Issue 3". Archived from the original on May 9, 2008.
8. Arora, Devender; Singh, Ajeet; Sharma, Vikrant; Bhaduria, Harvendra Singh; Patel, Ram Bahadur (2015). "Hgs Db: Haplogroups Database to understand migration and molecular risk assessment". Bioinformation. 11 (6): 272–5. doi:10.6026/97320630011272. PMC  . PMID   26229286.
9. International Society of Genetic Genealogy 2015 Genetics Glossary
10. Masatoshi Nei and Fumio Tajima, "DNA polymorphism detectable by restriction endonucleases", Genetics 97:145 (1981)
11. Becker T.; Knapp M. (2004). "Maximum-likelihood estimation of haplotype frequencies in nuclear families". Genetic Epidemiology. 27 (1): 21–32. doi:10.1002/gepi.10323. PMID   15185400.
12. Li S.S.; Khalid N.; Carlson C.; Zhao L.P. (2003). "Estimating haplotype frequencies and standard errors for multiple single nucleotide polymorphisms". Biostatistics. 4 (4): 513–522. doi:10.1093/biostatistics/4.4.513. PMID   14557108.
13. Roach J.C.; Glusman G.; Hubley R.; Montsaroff S.Z.; Holloway A.K.; Mauldin D.E.; Srivastava D.; Garg V.; Pollard K.S.; Galas D.J.; Hood L.; Smit A.F.A. (2011). "Chromosomal Haplotypes by Genetic Phasing of Human Families". American Journal of Human Genetics. 89 (3): 382–397. doi:10.1016/j.ajhg.2011.07.023. PMC  . PMID   21855840.
14. Barrett J.C.; Fry B.; Maller J.; Daly M.J. (2005). "Haploview: analysis and visualization of LD and haplotype maps". Bioinformatics. 21 (2): 263–265. doi:10.1093/bioinformatics/bth457. PMID   15297300.
15. Delaneau O, Zagury JF, Marchini J (2013). "Improved whole chromosome phasing for disease and population genetic studies". Nature Methods. 10 (1): 5–6. doi:10.1038/nmeth.2307. PMID   23269371.
16. Purcell S.; Daly M. J.; Sham P. C. (2007). "WHAP: haplotype-based association analysis". Bioinformatics. 23 (2): 255–256. doi:10.1093/bioinformatics/btl580. PMID   17118959.