Structural variation in the human genome is operationally defined as genomic alterations, varying between individuals, that involve DNA segments larger than 1 kilo base (kb), and could be either microscopic or submicroscopic. [1] This definition distinguishes them from smaller variants that are less than 1 kb in size such as short deletions, insertions, and single nucleotide variants.
Humans have an incredibly complex and intricate genome that has been shaped and modified over time by evolution. About 99.9% of the DNA-sequence in the human genome is conserved between individuals from all over the world, but some variation does exist. [1] Single nucleotide polymorphisms (SNPs) are considered to be the largest contributor to genetic variation in humans since they are so abundant and easily detectable. [2] It is estimated that there are at least 10 million SNPs within the human population but there are also many other types of genetic variants and they occur at dramatically different scales. [1] The variation between genomes in the human population range from single nucleotide polymorphisms to dramatic alterations in the human karyotype. [3]
Human genetic variation is responsible for the phenotypic differences between individuals in the human population. There are different types of genetic variation and it is studied extensively in order to better understand its significance. These studies lead to discoveries associating genetic variants to certain phenotypes as well as their implications in disease. At first, before DNA sequencing technologies, variation was studied and observed exclusively at a microscopic scale. At this scale, the only observations made were differences in chromosome number and chromosome structure. These variants that are about 3 Mb or larger in size are considered microscopic structural variants. [1] This scale is large enough to be visualized using a microscope and include aneuploidies, heteromorphisms, and chromosomal rearrangements. [1] When DNA sequencing was introduced, it opened the door to finding smaller and incredibly more sequence variations including SNPs and minisatellites. This also includes small inversions, duplications, insertions, and deletions that are under 1 kb in size. [1] In the human genome project the human genome was successfully sequenced, which provided a reference human genome for comparison of genetic variation. With improving sequencing technologies and the reference genome, more and more variations were found of several different sizes that were larger than 1 kb but smaller than microscopic variants. These variants ranging from about 1 Kb to 3 Mb in size are considered submicroscopic structural variants. [1] These recently discovered structural variants are thought to play a very significant role in phenotypic diversity and disease susceptibility.
Structural variation is an important type of human genetic variation that contributes to phenotypic diversity. [2] There are microscopic and submicroscopic structural variants which include deletions, duplications, and large copy number variants as well as insertions, inversions, and translocations. [1] These are several different types of structural variants in the human genome and they are quite distinctive from each other. A translocation is a chromosomal rearrangement, at the inter- or intra-chromosomal level, where a section of a chromosome changes position but with no change in the whole DNA content. [1] A section of DNA that is larger than 1 kb and occurs in two or more copies per haploid genome, in which the different copies share greater than 90% of the same sequence, are considered to be segmental duplications or low-copy repeats. [1] These are only a few of the several different types of structural variants that have been known to exist in the human genome. A table visualizing these different forms of structural variants, as well as others, is shown in Figure 1.
An inversion is a section of DNA on a chromosome that is reversed in its orientation in comparison to the reference genome. [1] There have been many studies identifying inversions because they have been found to have a big role in many diseases. A study found that forty percent of haemophilia A patients had a factor 8 gene inversion of a certain region that was four hundred kb in size. [4] The inversion breakpoint was found to be around a segmental duplication which is observed in many other inversion events. [4]
It is difficult to completely understand how each structural variant is created. It was previously known that repeated sequences on a chromosome increases the probability of non allelic homologous recombination. [5] These repeated sequences could cause deletions, duplications, inversions, and inverted duplication chromosomes. The products of this mechanism from the sequence repeats is depicted in Figure 2. A study was done on the olfactory receptor gene clusters where they questioned if there was an association between normal rearrangement of 8p and the repeated inverted sequences. The researchers observed that the rearrangement of chromosomes was actually caused by the homologous recombination in the 8p-reps. Therefore, they concluded that the substrate used in order to make rearrangements at the intrachromosomal level are the genes for olfactory receptors. [5] This discovery revealed the role that inverted duplicates have in affecting the development of structural variants. The mechanisms and ways in which structural variants are produced are important to better understand the development of these type of genetic variants.
Copy-number variants are defined as sections of DNA that exist in a variable copy number when comparing it to the reference genome and are larger than 1 kb in size. [1] This definition is broad and includes deletions, duplications, and large copy number variants. If the copy number variant is present in 1% or more of the population then it is also considered a copy-number polymorphism. [1] There was a study on the global variation in copy number in the human genome which questioned the characteristics of copy number variants in the human genome. It was known that copy number variation in the human genome is important but at this point of time, it had not yet been fully understood. Human genome variation itself is very diverse as there are many types including inversions, duplication, SNPs, and other forms. They surveyed the genomes of 270 individuals, from a variety of populations, for copy number variants with technologies such as SNP arrays. [6] Their results showed that many copy number variants had specific arrangements of linkage disequilibrium which revealed the copy number variation in all of the different populations. [6] The study concluded that twelve percent of the genome contained CNVRs. They were found to be involved in more of the DNA in each genome than single nucleotide polymorphisms. [6] This was a remarkable discovery since single nucleotide polymorphisms have been known to be the greatest in number in the human genome. In terms of size, however, these type of structural variants were found to have a larger presence in the human genome.
The copy number variants continued to be studied as several studies continued to reveal the depth of their presence and their significance. A study was conducted that questioned the role of the organization of copy number variants and wondered what type of duplications they are. It was known that copy number variation plays a big role in many human diseases but at the time large scale studies of these duplications had not been done. They decided to sequence 130 breakpoints from 112 individuals that contained 119 known CNVs by doing whole genome sequencing as well as next generation sequencing. [7] They found that tandem duplications comprised 83% of the CNVs while 8.4% were triplications, 4.2% were adjacent duplications, 2.5% were insertional translocations, and 1.7% were other complex rearrangements. [7] The copy number variants were predominantly tandem duplications which made it the most common type of copy number variant in the human genome according to the results of the study on this population. More was needed on the mechanistic side of the formation of structural variants. There was a study that focused on the mechanisms of very interesting and rare pathogenic copy number variants. The researchers knew that copy number variation is important in genome structural variation and contributes to human genetic disease but the actual mechanisms of most of the new and few pathogenic copy number variants had not been known. They used sequencing technologies to sequence breakpoint areas of many rare pathogenic copy number variants which was the biggest and most in depth analysis of copy number variants. They saw that the genomic architectural features were very important in the human genome and they were associated with about eighty-one percent of breakpoints. [8] They concluded that tandem duplications and microdeletions that are rare and pathogenic do not happen in the human genome by chance. Instead, they arise from many different genomic architectural features. [8] It was a very interesting result in that the certain architectural features of the genome physically made it possible and probable to develop certain rare and pathogenic structural variants.
Structural variation can be seen as an avenue of genome modification for adaptation by evolution. A study was conducted on ancestral diet and the evolution of the human amylase gene copy number. The consumption of starch became a huge component of the human diet with the development of agricultural societies. Amylase is the enzyme that breaks down starch and its copy number varies. [9] These observations led to the question of whether or not the differences in starch consumption between different populations created natural selection pressures on the enzyme amylase. They tested for the differences in the amylase protein expression in saliva from different populations and compared their expression to their copy number in their respective genomes. [9] Then they compared the starch consumption of different populations to their copy number of the amylase gene. They found that there was more amylase protein expression in saliva from people that had higher amylase copy number in their genome and there was also an association between groups of people with high starch diets and a larger amylase gene copy number. [9] This study brought exciting results as structural variation proved an involvement in the evolution of the human population by increasing its amylase copy number over time.
The 1000 genomes project was able to successfully produce the DNA sequence of the human genome. They provided much sequencing data from many populations to analyze as well as a reference human genome for comparison and future studies. One study took advantage of this resource to question the structural variation differences between genomes from whole genome sequence data. It was known that human diseases are affected by duplications and deletions and that copy number analysis is common but multiallelic copy number variants (mCNVs) were not as well studied. The researchers got their data from the 1000 genomes project and analyzed 849 different genomes from a variety of populations that were sequenced in order to find large mCNVs. [10] From their analysis, they found that mCNVs create most genetic variation in gene dosage compared to other structural variants and that the gene expression variation is created by the dosage diversity of genes created by mCNVs. [10] The study underlined the great significance that structural variants, especially mCNVs, have on gene dosage which leads to variable gene expressions and human phenotypic diversity in the population.
There are several structural variants in the human genome that have been observed but have not led to any obvious phenotypic effects. [1] There are some, however, that play a role in gene dosage which could lead to genetic diseases or distinct phenotypes. Structural variants can directly affect gene expression, such as with copy-number variants, or indirectly through position effects. [1] These effects can have significant implications in susceptibility to disease. The first gene dosage effect that was observed, and considered to be an autosomal dominant disease from an inherited DNA rearrangement, was Charcot-Marie Tooth (CMT) disease. Most of the associations found with CMT were with a 1.5 Mb tandem duplication in 17p11.2-p12 at the PMP22 gene. [11] The proposed mechanism for the structural variation is shown in Figure 2. When an individual has three copies of the normal gene, it results in the disease phenotype. [11] If the individual had only one copy of the PMP22 gene, on the other hand, the result was a clinically different hereditary neuropathy with liability to pressure palsies. [11] The differences in gene dosage created vastly different disease phenotypes which revealed the significant role that structural variation has on phenotype and susceptibility to disease.
Structural variation studies became increasingly popular due to the discovery of their possible roles and effects in the human genome. Copy number variation is a very important type of structural variation and has been studied extensively. A study on the influence of the CCL3L1 gene on HIV-1/AIDS susceptibility tested if the copy number of the CCL3L1 gene had any effect on an individual’s susceptibility to HIV-1/AIDS. They sampled several different individuals and populations for their CCL3L1 copy number and compared it to their HIV acquirement risk. They found that there is an association between higher amounts in the copy number of CCL3L1 and susceptibility to HIV and AIDS since individuals who were more prone to HIV had a low copy number of CCL3L1. [12] This difference in copy number was shown to play a possibly significant role in HIV susceptibility due to this association. Another study that focused on the pathogenesis of human obesity tested if structural variation of the NPY4R gene was significant in obesity. Studies had previously shown that 10q11.22 CNV had an association with obesity and that several copy number variants were associated with obesity. Their CNV analysis revealed that the NPY4R gene had a much higher frequency of 10q11.22 CNV loss in the patient population. [13] The control population, on the other hand, had more CNV gain in the same region. This led the researchers to conclude that the NPY4R gene played an important role in the pathogenesis of obesity due to its copy number variation. [13] Studies involving copy number variation as well as other structural variants have brought new insights to the significant roles that structural variants play in the human genome.
The factors that contribute to the development of schizophrenia have been studied extensively. A very recent study was conducted on the mechanism and genes responsible for schizophrenia development. It had been previously shown that variation at an MHC locus was associated with the development of schizophrenia. This study found that the association is caused partly by the complement component 4 (C4) genes and therefore implying that allele variants of the C4 genes contribute to the development of schizophrenia. [14] Linkage disequilibrium helped researchers identify which C4 structural variant an individual had by looking at the SNP haplotypes. The SNP haplotypes and the C4 alleles were linked which was why they were in linkage disequilibrium, meaning that they segregated together. [14] A single structural C4 variant was associated with many different SNP haplotypes, but different SNP haplotypes where associated with only one C4 structural variant. [14] This was due to the linkage disequilibrium which allowed the researchers to determine the C4 structural variant easily by looking at the SNP haplotype. Their data suggested this because the results showed that the structural variants of C4 express the C4A protein at different levels and this difference in higher C4A protein expressions were associated with higher rates of schizophrenia development. [14] The different structural variant alleles of the same gene were shown to have different phenotypes and susceptibility to disease. These studies exhibit the breadth of the involvement and significance of structural variation on the human genome. Its importance is demonstrated with its contribution to phenotypic diversity and disease susceptibility.
Many studies have been conducted to better understand human genome structural variation. There have been great advances in the research but its significance is still not fully understood. There are several questions still left unanswered which beg for further studies on the subject. Current studies usually target “unique” areas of the genome but are not able to detect the phenotypic effect of structural variants in highly repetitive, duplicated, and complex genomic areas. [15] It is very difficult to study this with the genomic technology of today but this may change with future development of sequencing technologies. In order to better understand the phenotypic effect of structural variants, large databases of genotypes and phenotypes of individuals must be created in order to make accurate associations. Huge projects such as Deciphering Developmental Disorders, UK10K, and International Standards for Cytogenomic Arrays Consortium have already paved the way to create databases for researchers to more easily pursue these studies. [15]
In addition, there has been growth and development in technology to create induced pluripotent stem cells with specific diseases. This introduces appropriate model systems to recreate disease causing structural variants such as translocations, duplications, and inversions. [15] The future advancement in technologies and large database efforts will help lead the way to better quality studies and a much better understanding of human genome structural variation.
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome that is present in a sufficiently large fraction of considered population.
Genetic architecture is the underlying genetic basis of a phenotypic trait and its variational properties. Phenotypic variation for quantitative traits is, at the most basic level, the result of the segregation of alleles at quantitative trait loci (QTL). Environmental factors and other external influences can also play a role in phenotypic variation. Genetic architecture is a broad term that can be described for any given individual based on information regarding gene and allele number, the distribution of allelic and mutational effects, and patterns of pleiotropy, dominance, and epistasis.
Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of duplication or deletion event that affects a considerable number of base pairs. Approximately two-thirds of the entire human genome may be composed of repeats and 4.8–9.5% of the human genome can be classified as copy number variations. In mammals, copy number variations play an important role in generating necessary variation in the population as well as disease phenotype.
Human genetic variation is the genetic differences in and among populations. There may be multiple variants of any given gene in the human population (alleles), a situation called polymorphism.
In molecular biology, SNP array is a type of DNA microarray which is used to detect polymorphisms within a population. A single nucleotide polymorphism (SNP), a variation at a single site in DNA, is the most frequent type of variation in the genome. Around 335 million SNPs have been identified in the human genome, 15 million of which are present at frequencies of 1% or higher across different populations worldwide.
A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.
Complement component 4 (C4), in humans, is a protein involved in the intricate complement system, originating from the human leukocyte antigen (HLA) system. It serves a number of critical functions in immunity, tolerance, and autoimmunity with the other numerous components. Furthermore, it is a crucial factor in connecting the recognition pathways of the overall system instigated by antibody-antigen (Ab-Ag) complexes to the other effector proteins of the innate immune response. For example, the severity of a dysfunctional complement system can lead to fatal diseases and infections. Complex variations of it can also lead to schizophrenia. The C4 protein was thought to derive from a simple two-locus allelic model, which however has been replaced by a much more sophisticated multimodular RCCX gene complex model which contain long and short forms of the C4A or C4B genes usually in tandem RCCX cassettes with copy number variation, that somewhat parallels variation in the levels of their respective proteins within a population along with CYP21 in some cases depending on the number of cassettes and whether it contains the functional gene instead of pseudogenes or fragments. Originally defined in the context of the Chido/Rodgers blood group system, the C4A-C4B genetic model is under investigation for its possible role in schizophrenia risk and development.
Chemokine ligand 3-like 1, also known as CCL3L1, is a protein which in humans is encoded by the CCL3L1 gene.
The 1000 Genomes Project, taken place from January 2008 to 2015, was an international research effort to establish the most detailed catalogue of human genetic variation at the time. Scientists planned to sequence the genomes of at least one thousand anonymous healthy participants from a number of different ethnic groups within the following three years, using advancements in newly developed technologies. In 2010, the project finished its pilot phase, which was described in detail in a publication in the journal Nature. In 2012, the sequencing of 1092 genomes was announced in a Nature publication. In 2015, two papers in Nature reported results and the completion of the project and opportunities for future research.
The Center for Applied Genomics is a research center at the Children's Hospital of Philadelphia that focuses on genomics research and the utilization of basic research findings in the development of new medical treatments.
Non-allelic homologous recombination (NAHR) is a form of homologous recombination that occurs between two lengths of DNA that have high sequence similarity, but are not alleles.
1q21.1 duplication syndrome or 1q21.1 (recurrent) microduplication is a rare aberration of chromosome 1.
Stephen Wayne "Steve" Scherer is a Canadian scientist who currently serves as the Chief of Research at The Hospital for Sick Children (SickKids) and distinguished University Professor at the University of Toronto. He obtained his PhD at the University of Toronto under Professor Lap-chee Tsui. Together they founded Canada's first human genome centre, the Centre for Applied Genomics (TCAG). He is a Senior Fellow of Massey College at the University of Toronto. In 2014, he was named an esteemed Clarivate Citation laureate in Physiology or Medicine for the “Discovery of large-scale gene copy number variation and its association with specific diseases.”
Genomic structural variation is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, copy-number variants, insertions, inversions and translocations. Originally, a structure variation affects a sequence length about 1kb to 3Mb, which is larger than SNPs and smaller than chromosome abnormality. However, the operational range of structural variants has widened to include events > 50bp. The definition of structural variation does not imply anything about frequency or phenotypical effects. Many structural variants are associated with genetic diseases, however many are not. Recent research about SVs indicates that SVs are more difficult to detect than SNPs. Approximately 13% of the human genome is defined as structurally variant in the normal population, and there are at least 240 genes that exist as homozygous deletion polymorphisms in human populations, suggesting these genes are dispensable in humans. Rapidly accumulating evidence indicates that structural variations can comprise millions of nucleotides of heterogeneity within every genome, and are likely to make an important contribution to human diversity and disease susceptibility.
1q21.1 copy number variations (CNVs) are rare aberrations of human chromosome 1.
End-sequence profiling (ESP) is a method based on sequence-tagged connectors developed to facilitate de novo genome sequencing to identify high-resolution copy number and structural aberrations such as inversions and translocations.
Segmental duplication are blocks of DNA ranging from 1 to 400 kb in length which recur at multiple sites within the genome, sharing greater than 90% similarity. Multiple studies have found a correlation between the location of segmental duplications and regions of chromosomal instability. This correlation suggests that they may be mediators of some genomic disorders. Segmental duplications are shown to be flanked on both sides by large homologous repeats, which exposes the region to recurrent rearrangement by nonallelic homologous recombination, leading to either deletion, duplication, or inversion of the original sequence.
ANNOVAR is a bioinformatics software tool for the interpretation and prioritization of single nucleotide variants (SNVs), insertions, deletions, and copy number variants (CNVs) of a given genome.
Human somatic variations are somatic mutations both at early stages of development and in adult cells. These variations can lead either to pathogenic phenotypes or not, even if their function in healthy conditions is not completely clear yet.