Cosegregation

Last updated

Cosegregation is the transmission to the next generation, of two or more genes in proximity on the same chromosome. Their closeness means that they are genetically linked. [1] It may also represent an interaction estimation probability between any number of loci.

Contents

A. Nucleus, B. Nuclear Profile - Thin slice of Nucleus, C. Loci - Parts of a target gene found within the Nuclear Profile Genome With Nuclear Profile Slice and Loci.png
A. Nucleus, B. Nuclear Profile - Thin slice of Nucleus, C. Loci - Parts of a target gene found within the Nuclear Profile

Interaction probability is determined using specified parts of a target gene (loci) and a group of nuclear profiles (NPs). [2] The picture to the right serves to provide visual aid as to how a slice (NP) is taken from the nucleus and loci are searched for within the NP. Cosegregation used within other mathematical models (SLICE [3] and normalized linkage disequilibrium) assist in rendering 3-D visualizations as a smaller process of genome architecture mapping (GAM). These renderings help determine genomic density and radial position.

Articles Using Co-segregation Methodologies
TitleDescription
Complex multi-enhancer contacts captured by Genome Architecture Mapping (GAM). [3] Co-segregation between a pair of loci helped in this study to quantify Normalized Linkage Disequilibrium.
A simple method for cosegregation analysis to evaluate the pathogenicity of unclassified variants; BRCA1 and BRCA2 as an example. [4] Using co-segregation analysis along with a multifactorial approach resulted in highly conclusive results when attempting to classify unclassified variants.
Considerations in assessing germline variant pathogenicity using co-segregation analysis. [5] This article found that utilizing Bayes factor co-segregation analysis, along with a strong penetrance model, will result with higher accuracy than meiosis counting.

History

Cosegregation in Genome architecture mapping (GAM) is another process being used to identify the compaction and adjacency of genomic windows. In a study from 2017, cosegregation was used to understand gene-expression-specific contacts in organizing the genome in mammalian nuclei in the larger process of GAM. [3] The results of the study produced complex 3D structures that displayed interactions under certain regions of chromatin contacts and proved that GAM is a useful tool in the genome biologist's skill set that expands the ability to finely dissect 3D chromatin structures, cell types and valuable human samples. A study in 2021 "discovered extensive 'melting' of long genes when they are highly expressed and/or have high chromatin accessibility. The contacts most specific of neuron subtypes contain genes associated with specialized processes, such as addiction and synaptic plasticity, which harbour putative binding sites for neuronal transcription factors within accessible chromatin regions." [6] Both of these studies used mice as models due to their anatomical, physiological, and genetic similarity to humans. [7]

Some of the earliest known studies that have used cosegregation dates back to the early 1980s. Around this time, scientists were conducting experiments on vegetative organisms to see the if there are unique sequences of chloroplast DNA. The process of the experiment was to track the chloroplast gene in each generation by clustering the genes in nucleoids to reduce the number of segregated units. This study was done at the Duke University in the Zoology Department [8] where Karen P. VanWinkle-Swift utilized Pedigree Diagrams to show how the traits and sequences were passed down from parent to child.

Usage

Cosegregation is best suited for cases where multiple factors' interactions are under consideration. It can show how different factors are linked and highlight their interactions and connections. For example, if a genetic disorder was identified as related to a certain gene, but is not always present when that gene is, then a cosegregation analysis could help identify other genes that interact with the suspect gene more often than normal. This could lead researchers to discover the combination of genes that manifest the genetic disorder. Cosegregation is being actively used in medical fields like cancer research. It can highlight the strongest connections between genes in cases where cancer develops. This is useful because there often isn't a single gene causing cancer. Rather, cancer can be caused by a multitude of gene combinations. Cosegregation helps to show links between genes that could be forming these combinations. [3]

Examples of using cosegregation

An example of an application using cosegregation would be finding the normalized linkage disequilibrium (NL) between two loci. Given a 2D dataset (row = genomic window slice, column = nuclear profile (NP)) a "1" was displayed if an NP existed in a window or a "0" otherwise. From this data, the NL could be found using the base disequilibrium and its theorized maximum (). The amount of NPs present in loci (genomic windows) and , is then used to find the , and and the co-segregation which is, . after the NL is found between two loci, it was then placed into another dataset to be visualized and then analyzed to determine how interconnected a loci is. This example was executed using python for computation and visualization of the given data and results and in finding the NL. Using the NL further analysis can be done to place the windows into "communities". To showcase this a graph to the right will show the community of one of the windows with the highest centrality which uses the average of the window's NLs.

Displays the communities for a specific loci using centrality Community detect.jpg
Displays the communities for a specific loci using centrality
A sample of the 2D dataset that was used for the application of the cosegregation example. Snipped of data used.jpg
A sample of the 2D dataset that was used for the application of the cosegregation example.
Formulas for the example above
CalculationsFormulas [3]
Detection Frequency or
Linkage
Linkage maximum (dmax):
Normalized Linkage (NL)

Formula

pseudo-code showcasing the implementation of co-segregation in data science. Coseggy.jpg
pseudo-code showcasing the implementation of co-segregation in data science.
Formula for finding co-segregation given a GAM table showing if a loci is present in a slice of a genomic region
Formula [3] Variables
or

Variables "A" and "B" are the total number of nuclear profiles (NP) present in a given a detected genomic region slice, "N" is the total number of NPs and FAB is the frequency of A and B

This formula can be easily programmed into code as seen in the pseudo-code in the figure to the right. The code was written to satisfy the Example described above.

Advantages

Given a large dataset of nuclear profiles, cosegregation is easily scalable given its simplistic mathematical formulas. The larger the data set that is provided, the more accurate the following equations will be. As depicted in the photo below, the amount of data being added to the equation merely adds linear time adjustments to the original equation.

How adding more NPs to dataset affects cosegregation equation. Adding NPs in co-segregation.png
How adding more NPs to dataset affects cosegregation equation.

Fortunately, not only is it able to scale dataset sizes well, it is able to take as many loci of focus that are required to determine the interaction probability. Provided that adding each loci adds a single computation to the equation, a linear time complexity is the result. The picture below shows how the amount of loci affects the detection frequency equation.

Adding loci affects the cosegregation equation in a linear time complexity. How adding loci affects co-segregation equation.png
Adding loci affects the cosegregation equation in a linear time complexity.

Finally, the numerical value that results can assist in drawing multiple conclusions including radial position, compaction, and the most influential contacts.

Limitations

This co-segregation heat map of genetic windows has not been normalized, the pattern is much less clear and the data is not as meaningful compared to the normalized version. Non-normalized Heatmap.png
This co-segregation heat map of genetic windows has not been normalized, the pattern is much less clear and the data is not as meaningful compared to the normalized version.
This co-segregation heat map of genetic windows has been normalized, the pattern is much more clear and that data can more easily and accurately be interpreted. Normalized Heatmap.png
This co-segregation heat map of genetic windows has been normalized, the pattern is much more clear and that data can more easily and accurately be interpreted.

Effective cosegregation analysis depends largely on having a strong supporting dataset because even small inaccuracies can be compounded by cosegregation. A complete understanding of the material is necessary as cosegregation only provides connections between datapoints. The interpretation of those connections must be done through another method. For example, locus cosegregation can give a score of genes that commonly interact with each other, but no matter how strong those relationships are, the results of quantitative cosegregation can seem to support either a correlated, anti-correlated or independent relationships. It is important to be aware of this and follow up cosegregation analysis with another form of analysis, such as normalized linkage disequilibrium to correct for the compounding effect cosegregation can have on negligible variations in the detection frequency of the data.

Example Cancer Patient Data Set.png

For example, imagine a simple form of cancer that is trigged by a small number of genes. Here we are examining a suspect gene and three other genes that are suspected to be involved in the processes. This chart shows a hypothetical data set of 10 people and their cancer status as well as if they possess the four genes of interest. Looking at the graph, there is a clear connection between the suspect gene and Gene A. There is also a less obvious interaction between the suspect gene and Gene C that only takes place when Gene B is absent. It is entirely possible that co-segregation would have a hard time determining that relationship. Gene B is commonly present with Gene A and that combination does result in cancer. In a real data set with hundreds or even thousands of genes being examined, one could erroneously conclude that Gene B contributes to the cancer when, in reality it does not and can actually prevent it.

Another limitation of this technique is that many mapping tools measure not only specific physical interactions between genes but also random contacts, the latter being much more common between genes with smaller linear genomic distance this could lead to inflated co-segregation scores. GAM has helped to resolve this issue because in GAM the detection of genomic windows is independent of any interactions with other regions. This allows for an expected interaction value to be calculated and combining this with the co-segregation results to filter out the noise of random connections this will provide a cleaner result. [3]

Visualizations

Matrices

Matrices are a rectangular structured array of numbers (entries) where the entries can be summed, subtracted, multiplied, and divided using the standard math operations. In the case of co-segregation, Graph theory is used to see if a variable shares an edge or vertex with another variable on a network of nodes. Graph theory is the mathematical study of objects using pairwise relations that is shown through connected nodes called vertices that are connected to other nodes by edges.

Cosegregation to adjacency.png

The image above depicts the conversion from a cosegregation matrix to an adjacency matrix is one use of a matrix in genome architecture mapping where scientists are using cryosectioning to find colocalization between DNA regions, genomes, and/or alleles. In that example, cosegregation is being used to describe the linkage of data to each other in terms of the distance between specific windows in a genome. The values in the cosegregation matrix were found using the formula above. Comparing windows A and B, the formula seeks to find the intersection of Nuclear Profiles between the respective windows. The genomic windows would be the nodes and the adjacency graph is the matrix depiction of the edges connecting each node.

Heat maps

A heat map is a visual representation of a matrix of m × n that can show different phenomenons on a two-dimensional scale. Heat maps have a range of color intensities based on the values and scale given from the data. Coding-wise, heat maps can be created using libraries such as plotly.express in Python. Using co-segregation, heat maps are used to visualize a matrix that contains values of either 1 or 0 to visualize the commonalities between 2 or more variables. "The primary benefit of using heat maps is that they make otherwise dull or impenetrable data understandable. Many people understand heat maps intuitively, without even needing to be told that those warmer colors indicate a denser focus of interactions." [9]

In the limitation section, there are two heat maps (also put below for easy viewing) shown depicting the difference between normalized and un-normalized data. Showing the difference in the graphs would help the researcher identify different patterns based on the intensity of the color gradients as well as the clustering of data points. Cosegregation results as seen above can have different forms and visualizing them in heat maps can aid researchers in understanding which genomes are connected similar to matrices.

Non-normalized Heatmap.png Normalized Heatmap.png

One limitation to heat maps are that some software does not allow the use of locating specific points on the graph, especially if there are many variables. There are coding libraries such as plotly.express that can create interactive heat maps where the programmer can hover over specified points on a graph and read the exact dependent variable's value. Another limitation is that heat maps do not represent real-time data. Since heat maps work by aggregating data over time, it does not show recent changes in behavior compared to the more dominant patterns already present. [9]

Related Research Articles

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.

In population genetics, linkage disequilibrium (LD) is the non-random association of alleles at different loci in a given population. Loci are said to be in linkage disequilibrium when the frequency of association of their different alleles is higher or lower than expected if the loci were independent and associated randomly.

<span class="mw-page-title-main">Synteny</span>

In genetics, the term synteny refers to two related concepts:

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

A tag SNP is a representative single nucleotide polymorphism (SNP) in a region of the genome with high linkage disequilibrium that represents a group of SNPs called a haplotype. It is possible to identify genetic variation and association to phenotypes without genotyping every SNP in a chromosomal region. This reduces the expense and time of mapping genome areas associated with disease, since it eliminates the need to study every individual SNP. Tag SNPs are useful in whole-genome SNP association studies in which hundreds of thousands of SNPs across the entire genome are genotyped.

In genetics, completelinkage is defined as the state in which two loci are so close together that alleles of these loci are virtually never separated by crossing over. The closer the physical location of two genes on the DNA, the less likely they are to be separated by a crossing-over event. In the case of male Drosophila there is complete absence of recombinant types due to absence of crossing over. This means that all of the genes that start out on a single chromosome, will end up on that same chromosome in their original configuration. In the absence of recombination, only parental phenotypes are expected.

<span class="mw-page-title-main">ChIP-on-chip</span> Molecular biology method

ChIP-on-chip is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. Specifically, it allows the identification of the cistrome, the sum of binding sites, for DNA-binding proteins on a genome-wide basis. Whole-genome analysis can be performed to determine the locations of binding sites for almost any protein of interest. As the name of the technique suggests, such proteins are generally those operating in the context of chromatin. The most prominent representatives of this class are transcription factors, replication-related proteins, like origin recognition complex protein (ORC), histones, their variants, and histone modifications.

<span class="mw-page-title-main">Chromosome conformation capture</span>

Chromosome conformation capture techniques are a set of molecular biology methods used to analyze the spatial organization of chromatin in a cell. These methods quantify the number of interactions between genomic loci that are nearby in 3-D space, but may be separated by many nucleotides in the linear genome. Such interactions may result from biological functions, such as promoter-enhancer interactions, or from random polymer looping, where undirected physical motion of chromatin causes loci to collide. Interaction frequencies may be analyzed directly, or they may be converted to distances and used to reconstruct 3-D structures.

<span class="mw-page-title-main">ATRX</span> Protein-coding gene in humans

Transcriptional regulator ATRX also known as ATP-dependent helicase ATRX, X-linked helicase II, or X-linked nuclear protein (XNP) is a protein that in humans is encoded by the ATRX gene.

Population structure is the presence of a systematic difference in allele frequencies between subpopulations. In a randomly mating population, allele frequencies are expected to be roughly similar between groups. However, mating tends to be non-random to some degree, causing structure to arise. For example, a barrier like a river can separate two groups of the same species and make it difficult for potential mates to cross; if a mutation occurs, over many generations it can spread and become common in one subpopulation while being completely absent in the other.

The Illumina Methylation Assay using the Infinium I platform uses 'BeadChip' technology to generate a comprehensive genome-wide profiling of human DNA methylation. Similar to bisulfite sequencing and pyrosequencing, this method quantifies methylation levels at various loci within the genome. This assay is used for methylation probes on the Illumina Infinium HumanMethylation27 BeadChip. Probes on the 27k array target regions of the human genome to measure methylation levels at 27,578 CpG dinucleotides in 14,495 genes. The Infinium HumanMethylation450 BeadChip array targets > 450,000 methylation sites. In 2016, the Infinium MethylationEPIC BeadChip was released, which interrogates over 850,000 methylation sites across the human genome.

<span class="mw-page-title-main">Biological network</span> Method of representing systems

A biological network is a method of representing systems as complex sets of binary interactions or relations between various biological entities. In general, networks or graphs are used to capture relationships between entities or objects. A typical graphing representation consists of a set of nodes connected by edges.

In genetics, association mapping, also known as "linkage disequilibrium mapping", is a method of mapping quantitative trait loci (QTLs) that takes advantage of historic linkage disequilibrium to link phenotypes to genotypes, uncovering genetic associations.

Chromatin Interaction Analysis by Paired-End Tag Sequencing is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequencing to determine de novo long-range chromatin interactions genome-wide.

Quantitative trait loci mapping or QTL mapping is the process of identifying genomic regions that potentially contain genes responsible for important economic, health or environmental characters. Mapping QTLs is an important activity that plant breeders and geneticists routinely use to associate potential causal genes with phenotypes of interest. Family-based QTL mapping is a variant of QTL mapping where multiple-families are used.

geWorkbench is an open-source software platform for integrated genomic data analysis. It is a desktop application written in the programming language Java. geWorkbench uses a component architecture. As of 2016, there are more than 70 plug-ins available, providing for the visualization and analysis of gene expression, sequence, and structure data.

<span class="mw-page-title-main">Genome architecture mapping</span>

In molecular biology, genome architecture mapping (GAM) is a cryosectioning method to map colocalized DNA regions in a ligation independent manner. It overcomes some limitations of Chromosome conformation capture (3C), as these methods have a reliance on digestion and ligation to capture interacting DNA segments. GAM is the first genome-wide method for capturing three-dimensional proximities between any number of genomic loci without ligation.

<span class="mw-page-title-main">Joan Bailey-Wilson</span> American statistical geneticist

Joan Ellen Bailey-Wilson is an American statistical geneticist. She is a senior investigator and co-chief of the Computational and Statistical Genomic Branch of the National Human Genome Research Institute.

<span class="mw-page-title-main">Hi-C (genomic analysis technique)</span> Genomic analysis technique

Hi-C is a high-throughput genomic and epigenomic technique first described in 2009 by Lieberman-Aiden et al. to capture chromatin conformation. In general, Hi-C is considered as a derivative of a series of chromosome conformation capture technologies, including but not limited to 3C, 4C, and 5C. Hi-C comprehensively detects genome-wide chromatin interactions in the cell nucleus by combining 3C and next-generation sequencing (NGS) approaches and has been considered as a qualitative leap in C-technology development and the beginning of 3D genomics.

Pore-C is an emerging genomic technique which utilizes chromatin conformation capture (3C) and Oxford Nanopore Technologies' (ONT) long-read sequencing to characterize three-dimensional (3D) chromatin structure. To characterize concatemers, the originators of Pore-C developed an algorithm to identify alignments that are assigned to a restriction fragment; concatemers with greater than two associated fragments are deemed high order. Pore-C attempts to improve on previous 3C technologies, such as Hi-C and SPRITE, by not requiring DNA amplification prior to sequencing. This technology was developed as a simpler and more easily scalable method of capturing higher-order chromatin structure and mapping regions of chromatin contact. In addition, Pore-C can be used to visualize epigenomic interactions due to the capability of ONT long-read sequencing to detect DNA methylation. Applications of this technology include analysis of combinatorial chromatin interactions, the generation of de novo chromosome scale assemblies, visualization of regions associated with multi-locus histone bodies, and detection and resolution of structural variants.

References

  1. "Cosegregation". cancer.gov. Retrieved 4 May 2023.
  2. Wrighton, Katharine H. (May 2017). "Zooming in on nuclear organization". Nature Reviews Molecular Cell Biology. 18 (5): 275. doi: 10.1038/nrm.2017.28 . PMID   28327555. S2CID   3453730.
  3. 1 2 3 4 5 6 7 Beagrie, Robert A.; Scialdone, Antonio; Schueler, Markus; Kraemer, Dorothee C. A.; Chotalia, Mita; Xie, Sheila Q.; Barbieri, Mariano; de Santiago, Inês; Lavitas, Liron-Mark; Branco, Miguel R.; Fraser, James; Dostie, Josée; Game, Laurence; Dillon, Niall; Edwards, Paul A. W.; Nicodemi, Mario; Pombo, Ana (March 2017). "Complex multi-enhancer contacts captured by genome architecture mapping". Nature. 543 (7646): 519–524. Bibcode:2017Natur.543..519B. doi:10.1038/nature21411. PMC   5366070 . PMID   28273065.
  4. Mohammadi, Leila; Vreeswijk, Maaike P; Oldenburg, Rogier; van den Ouweland, Ans; Oosterwijk, Jan C; van der Hout, Annemarie H; Hoogerbrugge, Nicoline; Ligtenberg, Marjolijn; Ausems, Margreet G; van der Luijt, Rob B; Dommering, Charlotte J; Gille, Johan J; Verhoef, Senno; Hogervorst, Frans B; van Os, Theo A; Gómez García, Encarna; Blok, Marinus J; Wijnen, Juul T; Helmer, Quinta; Devilee, Peter; van Asperen, Christi J; van Houwelingen, Hans C (29 June 2009). "A simple method for co-segregation analysis to evaluate the pathogenicity of unclassified variants; BRCA1 and BRCA2 as an example". BMC Cancer. 9: 211. doi: 10.1186/1471-2407-9-211 . PMC   2714556 . PMID   19563646.
  5. Belman, Sophie; Parsons, Michael T.; Spurdle, Amanda B.; Goldgar, David E.; Feng, Bing-Jian (December 2020). "Considerations in assessing germline variant pathogenicity using cosegregation analysis". Genetics in Medicine. 22 (12): 2052–2059. doi: 10.1038/s41436-020-0920-4 . PMID   32773770. S2CID   221084291.
  6. Winick-Ng, Warren; Kukalev, Alexander; Harabula, Izabela; Zea-Redondo, Luna; Szabó, Dominik; Meijer, Mandy; Serebreni, Leonid; Zhang, Yingnan; Bianco, Simona; Chiariello, Andrea M.; Irastorza-Azcarate, Ibai; Thieme, Christoph J.; Sparks, Thomas M.; Carvalho, Sílvia; Fiorillo, Luca; Musella, Francesco; Irani, Ehsan; Torlai Triglia, Elena; Kolodziejczyk, Aleksandra A.; Abentung, Andreas; Apostolova, Galina; Paul, Eleanor J.; Franke, Vedran; Kempfer, Rieke; Akalin, Altuna; Teichmann, Sarah A.; Dechant, Georg; Ungless, Mark A.; Nicodemi, Mario; Welch, Lonnie; Castelo-Branco, Gonçalo; Pombo, Ana (November 2021). "Cell-type specialization is encoded by specific chromatin topologies". Nature. 599 (7886): 684–691. Bibcode:2021Natur.599..684W. doi:10.1038/s41586-021-04081-2. PMC   8612935 . PMID   34789882.
  7. Bryda, Elizabeth C (May 2013). "The Mighty Mouse: the impact of rodents on advances in biomedical research". Missouri Medicine. 110 (3): 207–211. PMC   3987984 . PMID   23829104.
  8. VanWinkle-Swift, Karen P. (February 1980). "A model for the rapid vegetative segregation of multiple chloroplast genomes in Chlamydomonas: Assumptions and predictions of the model". Current Genetics. 1 (2): 113–125. doi:10.1007/BF00446957. PMID   24190835. S2CID   19184456.
  9. 1 2 "Heat Maps: Types & Benefits".