Hi-C is a high-throughput genomic and epigenomic technique to capture chromatin conformation (3C). [1] In general, Hi-C is considered as a derivative of a series of chromosome conformation capture technologies, including but not limited to 3C (chromosome conformation capture), 4C (chromosome conformation capture-on-chip/circular chromosome conformation capture), and 5C (chromosome conformation capture carbon copy). [1] [2] [3] [4] Hi-C comprehensively detects genome-wide chromatin interactions in the cell nucleus by combining 3C and next-generation sequencing (NGS) approaches and has been considered as a qualitative leap in C-technology (chromosome conformation capture-based technologies) development and the beginning of 3D genomics. [2] [3] [4]
Similar to the classic 3C technique, Hi-C measures the frequency (as an average over a cell population) at which two DNA fragments physically associate in 3D space, linking chromosomal structure directly to the genomic sequence. [4] The general procedure of Hi-C involves first crosslinking chromatin material using formaldehyde. [3] [4] Then, the chromatin is solubilized and fragmented, and interacting loci are re-ligated together to create a genomic library of chimeric DNA molecules. [4] The relative abundance of these chimeras, or ligation products, is correlated to the probability that the respective chromatin fragments interact in 3D space across the cell population. [4] While 3C focuses on the analysis of a set of predetermined genomic loci to offer “one-versus-some” investigations of the conformation of the chromosome regions of interest, Hi-C enables “all-versus-all” interaction profiling by labeling all fragmented chromatin with a biotinylated nucleotide before ligation. [3] [4] As a result, biotin-marked ligation junctions can be purified more efficiently by streptavidin-coated magnetic beads, and chromatin interaction data can be obtained by direct sequencing of the Hi-C library. [3] [4]
Analyses of Hi-C data not only reveal the overall genomic structure of mammalian chromosomes, but also offer insights into the biophysical properties of chromatin as well as more specific, long-range contacts between distant genomic elements (e.g. between genes and regulatory elements), [4] [5] [6] including how these change over time in response to stimuli. [7] In recent years, Hi-C has found its application in a wide variety of biological fields, including cell growth and division, transcription regulation, fate determination, development, autoimmune disease, and genome evolution. [7] [5] [6] By combining Hi-C data with other datasets such as genome-wide maps of chromatin modifications and gene expression profiles, the functional roles of chromatin conformation in genome regulation and stability can also be delineated. [4]
At its inception, Hi-C was a low-resolution, high-noise technology that was only capable of describing chromatin interaction regions within a bin size of 1 million base pairs (Mb). [1] The Hi-C library also required several days to construct, [4] [8] and the datasets themselves were low in both output and reproducibility. [9] Nevertheless, Hi-C data offered new insights for chromatin conformation as well as nuclear and genomic architectures, and these prospects motivated scientists to put efforts to modify the technique over the past decade.
Between 2012 and 2015, several modifications to the Hi-C protocol have taken place, with 4-cutter digestion [10] or adapted deeper sequencing depth to obtain higher resolution. [8] [9] [11] The use of restriction endonucleases that cut more frequently, or DNaseI and Micrococcal nucleases also significantly increased the resolution of the method. [12] More recently (2017), Belaghzal et al. described a Hi-C 2.0 protocol that was able to achieve kilobase (kb) resolution. [12] The key adaptation to the base protocol was the removal of the SDS solubilization step after digestion to preserve nuclear structure and prevent random ligation between fragmented chromatin by ligation within the intact nuclei, which formed the basis of in situ Hi-C. [12] In 2021, Hi-C 3.0 was described by Lafontaine et al., with higher resolution achieved by enhancing crosslinking with formaldehyde followed by disuccinimidyl glutarate (DSG). [13] While formaldehyde captures the amino and imino groups of both proteins and DNA, the NHS-esters in DSG react with primary amines on proteins and can capture amine-amine interactions. [13] These updates to the base protocol allowed the scientists to look at more detailed conformational structures such as chromosomal compartment and topologically associating domains (TADs), as well as high-resolution conformational features such as DNA loops. [12] [13]
To date, a variety of derivatives of Hi-C have already emerged, including in situ Hi-C, low Hi-C, SAFE Hi-C, and Micro-C, with distinctive features related to different aspects of standard Hi-C, but the basic principle has remained the same.
The outline of the classical Hi-C workflow is as follows: cells are cross-linked with formaldehyde; chromatin is digested with a restriction enzyme that generates a 5’ overhang; the 5’ overhang is filled with biotinylated bases and the resulting blunt-ended DNA is ligated. [1] The ligation products, with biotin at the junction, are selected for using streptavidin and further processed to prepare a library ready for subsequent sequencing efforts. [1]
The pairwise interactions that Hi-C can capture across the genome are immense and so it is important to analyze an appropriately large sample size, in order to capture unique interactions that may only be observed in a minority of the general population. [4] To obtain a high complexity library of ligation products that will ensure high resolution and depth of data, a sample of 20–25 million cells is required as input for Hi-C. [3] [4] Primary human samples, which may be available only in fewer cell numbers, could be used for standard Hi-C library preparation with as low as 1–5 million cells. [4] However, using such a low input of cells may be associated with low library complexity which results in a high percentage of duplicate reads during library preparation. [4]
Standard Hi-C gives data on pairwise interactions at the resolution of 1 to 10 Mb, requires high sequencing depth and the protocol takes around 7 days to complete. [3] [4] [14]
Cell and nuclear membranes are highly permeable to formaldehyde. [4] [15] [16] Formaldehyde cross-linking is frequently employed for the detection and quantification of DNA-protein and protein-protein interactions. [15] Of interest in the context of Hi-C, and all 3C-based methods, is the ability of formaldehyde to capture cis chromosomal interactions between distal segments of chromatin. [1] [4] [15] [16] It does so by forming covalent links between spatially adjacent chromatin segments. Formaldehyde can react with macromolecules in two steps: first it reacts with a nucleophilic group on a DNA base for example, and forms a methylol adduct, which is then converted to a Schiff base. [15] In the second step, the Schiff base, which can decompose rapidly, forms a methylene bridge with another functional group on another molecule. [15] It can also make this methylene bridge with a small molecule in solution such as glycine, which is used in excess to quench formaldehyde in Hi-C. [1] [4] [15] [16] Quenchers can typically exert an effect on formaldehyde from outside the cell. [15] A key feature of this two-step formaldehyde crosslinking reaction is that all the reactions are reversible, which is vital for chromatin capture. [1] [4] [15] [16]
Crosslinking is a pivotal step of the chromatin capture workflow as the functional readout of the technique is the frequency at which two genomic regions are crosslinked to each other. [4] Thus, the standardization of this step is important and for that, one must consider potential sources of variation. [4] Presence of serum, which contains a high concentration of protein, in culture media can decrease the effective concentration of formaldehyde available for chromatin crosslinking, by sequestering it in the culture media. [4] Therefore, in cases where serum is used in culture, it should be removed for the crosslinking step. [4] The nature of cells, i.e., whether they are suspension or adherent, is also a pertinent consideration for the crosslinking step. [4] Adherent cells bind to surfaces with the help of molecular mechanisms of cytoskeletons. [4] It has been shown that there is a link between cytoskeleton-maintained nuclear and cellular morphology which, if altered, may negatively impact global nuclear organization. [4] Adherent cells therefore, should be crosslinked while still attached to their culture surface. [4]
Cells are lysed on ice with cold hypotonic buffer containing sodium chloride, Tris-HCl at pH 8.0, and non-ionic detergent IGEPAL CA-630, supplemented with protease inhibitors. [4] [16] The protease inhibitors and incubation on ice help preserve the integrity of crosslinked chromatin complexes from endogenous proteases. [4] [16] The lysis step helps to release the nucleic material from the cells. [1] [4] [16]
Following cell lysis, chromatin is solubilized with dilute SDS in order to remove proteins that have not been crosslinked and to open chromatin and make it more accessible for subsequent restriction endonuclease-mediated digestion. [4] If the incubation with SDS exceeds the recommended 10 minutes, the formaldehyde crosslinks can be reversed and so the incubation with SDS must be immediately followed by an incubation on ice. [4] A non-ionic detergent called Triton X-100 is used to quench SDS in order to prevent enzyme denaturation in the next step. [4]
Any restriction enzyme that generates a 5’ overhang, such as HindIII can be used to digest the now accessible chromatin overnight. [4] [16] This 5’ overhang provides the template required by the Klenow fragment of DNA Polymerase I to add biotinylated CTP or ATP to the digested ends of chromatin. [4] [16] This step allows for the selection of Hi-C ligation products for library preparation. [4] [16]
A dilution ligation is performed on DNA fragments that are still crosslinked to one another in order to favor the intramolecular ligation of fragments within the same chromatin complex instead of ligation events between fragments across different complexes. [4] [16] Since this ligation step occurs between blunt-ended DNA fragments (since the sticky ends have been filled in with biotin-labeled bases), the reaction is allowed to go on for up to 4 hours to make up for its inherent inefficiency. [16] As a result of proximity ligation, the terminal HindIII sites are lost and an NheI site is generated. [1]
The biotin-labeled ligation products can be purified using phenol-chloroform DNA extraction. [4] [16] [17] To remove any fragments with biotin-labeled ends that have not been ligated, T4 DNA Polymerase with 3’ to 5’ exonuclease activity is used to remove nucleotides from the ends of such fragments. [4] [16] [18] This step ensures that none of these unligated fragments are selected for library preparation. [4] [16] The reaction is stopped with EDTA and the DNA is purified once again using phenol-chloroform DNA extraction. [4] [16]
The ideal size of DNA fragments for the sequencing library depends on the sequencing platform that will be used. [4] [16] DNA can first be sheared to fragments around 300–500 bp long using sonication. [4] [16] [17] Fragments of this size are suitable for high-throughput sequencing. [4] [16] [17] Following sonication, fragments can be size selected using AMPure XP beads from Beckman Coulter to obtain ligation products with a size distribution between 150 and 300 bp. [4] [17] This is the optimal fragment size window for HiSeq cluster formation. [4] [17]
DNA shearing causes asymmetric DNA breaks and must be repaired before biotin pulldown and sequencing adaptor ligation. [4] [16] This is achieved by using a combination of enzymes that fill in 5’ overhangs, and add 5’ phosphate groups and adenylate to the 3’ ends of fragments to allow for ligation of sequencing adaptors. [4] [16]
Using an excess of streptavdin beads, such as the My-One C1 streptavidin bead solution from Dynabeads, biotinylated Hi-C ligation products can be pulled-down and enriched for. [4] [16] Ligation of the Illumina paired-end adapters is performed while the DNA fragments are bound to the streptavidin beads. [4] [16] [17] Adsorption to the beads increases efficiency of the ligation of these blunt-ended DNA fragments to the adaptors, as it decreases their mobility. [4] [16] [17]
After the ligation of the adaptors is complete, PCR amplification of the library is performed. [4] [16] The PCR step can introduce high number of duplicates in a low complexity Hi-C ligation product sample as a result of over-amplification. [4] [16] This results in very few interactions being captured and oftentimes, this is because the input sample size had a low amount of cells. [4] [16] It is important to titrate the number of cycles required to get at least 50 ng of Hi-C library DNA for sequencing. [4] [16] Fewer the cycle number, the better so that there are no PCR artifacts (such as off-target amplicons, non-specificity, etc.). [4] [16] The ideal range of PCR cycles is 9–15 and it is more ideal to pool multiple PCR reactions to get enough DNA for sequencing, than to increase the number of cycles for one PCR reaction. [4] [16] The PCR products are purified again using AMPure beads to remove primer dimers and then quantified before being sequenced. [4] [16] Regions of chromatin that interact with each other are then identified by paired-end sequencing of the biotinylated, ligated products. [4] [16]
Any platform that can allow for the ligated fragments to be sequenced across the NheI junction (Roche 454) or by paired-end or mate-paired reads (Illumina GA and HiSeq platforms) would be suitable for Hi-C. [4] Before high-throughput sequencing, the quality of the library should be verified using Sanger sequencing, wherein the long sequencing read will read through the biotin junction. [4] Thirty-six or 50 bp reads are sufficient to identify most chromatin interacting pairs using Illumina paired-end sequencing. [4] Since the average size of fragments in the library is 250 bp, 50bp paired-end reads have been found to be optimum for Hi-C library sequencing. [4]
There are several pressure points throughout the workflow of Hi-C sample preparation that are well documented and reported. [4] [16] DNA at various stages can be run on 0.8% agarose gels to assay the size distribution of fragments. [4] [16] This is particularly important after shearing of size selection steps. [4] [16] Degradation of DNA can also be monitored as smears appearing as a result under low molecular weight products on gels. [4] [16] Degradation can occur due to not adding sufficient protease inhibitors during lysis, endogenous nuclease activity or thermal degradation due to incorrect icing. [4] [16] 3C PCR reactions can be performed to test for the formation of proximity ligation products. [4] [16]
Standard Hi-C has a high input cell number cost, requires deep sequencing, generates low-resolution data, and suffers from formation of redundant molecules that contribute to low complexity libraries when cell numbers are low. [4] [16] [17] To combat these issues in order to be able to apply this technique in contexts where cell number is a limiting factor, for example, with primary human cell work, several Hi-C variants have been developed since the first conceptualization of Hi-C. [3]
The four main classes under which Hi-C variants fall under are: dilution ligation, in situ ligation, single cell, and low noise improvement systems. [3] Standard Hi-C is a type of dilution ligation and other dilution ligation include DNase Hi-C and Capture Hi-C. [3] In contrast to standard and Capture Hi-C, DNase Hi-C requires only 2–5 million cells as input, uses DNaseI for chromatin fragmentation and employs an in-gel dilution proximity ligation. [3] [19] [20] The use of DNaseI has been shown to greatly improve efficiency and resolution of Hi-C. [3] [19] Capture Hi-C is a genome-wide assaying technique to look at chromatin interactions of specific loci using a hybridization-based capture of targeted genomic regions. [20] It was first developed by Mifsud et al. to map long-range promoter contacts in human cells by generating a biotinylated RNA bait library that targeted 21,841 promoter regions. [20] These variants, in addition to others (described below), represent modifications to the foundational technique of standard Hi-C and address and alleviate one or more limitations of the original method.
In situ Hi-C combines standard Hi-C with nuclear ligation assay, i.e., proximity ligation performed in intact nuclei. [14] [21] The protocol is similar to standard Hi-C in terms of the basic workflow outline but differs in other ways. [14] In situ Hi-C requires 2 to 5 million cells compared to the ideal 20 to 25 million required for standard Hi-C and it requires only 3 days to complete the protocol versus 7 days for standard Hi-C. [14] Furthermore, proximity ligation does not take place in solution like in standard Hi-C, decreasing the frequency of random, biologically irrelevant contacts and ligations, as indicated by the lower frequency of mitochondrial and nuclear DNA contacts in captured biotinylated DNA. [14] This is achieved by leaving the nuclei intact for the ligation step. [14] Cells are still lysed with a buffer containing Tris-HCl at pH 8.0, sodium chloride, and the detergent IGEPAL CA630 before ligation, but instead of homogenization of the cell lysate, cell nuclei are pelleted after initial lysis to degrade the cell membrane. [14] After proximity ligation is complete, cell nuclei are incubated for at least 1.5 hours at 68 degrees Celsius to permeabilize the nuclear membrane and release its nuclear contents. [14]
The resolution that can be achieved with in situ Hi-C can be up to 950 to 1000 bp compared to the 1 to 10 Mb resolution of standard Hi-C and the 100 kb resolution of DNase Hi-C. [3] [4] [14] [19] While standard Hi-C makes use of a 6-bp cutter such as HindIII for the restriction digest step, in situ Hi-C uses a 4-bp cutter such as MboI or its isoschizomer DpnII (which is not sensitive to CpG methylation) to increase efficiency and resolution (as the restriction sites of MboI and DpnII are more frequently occurring in the genome). [3] [4] [14] Data between replicates for in situ Hi-C is consistent and highly reproducible, with very less background noise and demonstrating clear chromatin interactions. [3] [14] It is however possible that some of the captured interactions may not be accurate intermolecular interactions since the nucleus is densely packed with protein and DNA so performing proximity ligations in intact nuclei may pull down confounding interactions that may only form due to the nature of nuclear packaging and not so much unique chromosomal interactions with cellular functional impact. [3] [14] It also requires an extremely high sequencing depth of around 5 billion paired-end reads per sample to achieve the resolution of data described by Rao et al. [3] [14] [22] Several techniques that have adapted the concept of in situ Hi-C exist, including Sis Hi-C, OCEAN-C and in situ capture Hi-C. [3] Described below are two of the most prominent in situ Hi-C based techniques. [3]
Low-C is an in situ Hi-C protocol adapted for use on low cell numbers, which is particularly useful in contexts where cell number is a limiting agent, for example, in primary human cell culture. [23] This method makes use of minor changes, including volumes and concentrations used and the timing and order of certain experimental steps to allow for the generation of high-quality Hi-C libraries from cell numbers as low as 1000 cells. [23] Despite the potential of generating usable and high resolution data with as few as 1000 cells, Diaz et al. still recommend using at least 1 to 2 million cells if feasible, or if not a minimum of 500 K cells. [23] Library quality was first assessed on the Illumina MiSeq (2x84 np paired-end reads) platform and once passed quality control criteria (including low PCR duplicates), the library was sequenced on Illumina NextSeq (2x80 bp paired-end). [23] Overall, this technique circumvents the issue of requiring a high cell number input for Hi-C and the high sequencing depth required to obtain high resolution data, but can only achieve resolutions of up to 5 kb and may not always be reproducible due the variable nature of sample sizes used and the data generated from it. [23]
SAFE Hi-C, or simplified, fast, and economically efficient Hi-C, generates sufficient ligated fragments without amplification for high-throughput sequencing. [17] In situ Hi-C data that has been published indicates that amplification (at the PCR step for library preparation) introduces distance-dependent amplification bias, which results in a higher noise to signal ratio against genomic distance. [17] SAFE Hi-C was successfully used to generate an amplification-free, in situ Hi-C ligation library from as low as 250 thousand K562 cells. [17] Ligation fragments are anywhere between 200 and 500 bp long, with an average at about 370 bp. [17] All ligation product libraries were sequenced using the Illumina HiSeq platform (2x150 bp paired-end reads). [17] Although SAFE Hi-C can be used for a cell input as low as 250 thousand, Niu et al. recommend using 1 to 2 million cells. [17] Samples produce enough ligates to be sequenced on one-fourth of a lane. [17] SAFE Hi-C has been demonstrated to increase library complexity due to the removal of PCR duplicates which lower the overall percentage of unique paired reads. [17] Overall, SAFE Hi-C preserves the integrity of chromosomal interactions while also reducing the need to have high sequencing depth and saving overall cost and labor. [17]
Micro-C is a version of Hi-C that includes a micrococcal nuclease (MNase) digestion step to look at interactions between pairs of nucleosomes, thus enabling resolution of sub-genomic TAD structures at the 1 to 100 nucleosome scale. [24] [25] It was first developed for use in yeast and was shown to conserve the structural data obtained from a standard Hi-C but with greater signal-to-noise ratio. [24] [25] When used with human embryonic stem cells and fibroblasts, 2.6 to 4.5 billion uniquely mapped reads were obtained per sample. [24] [25] Hsieh et al. analyzed 2.64 billion reads from mouse embryonic stem cells and demonstrated that there was increased power for detecting short-range interactions. [24] [25] [26]
Hi-C has also been adapted for use with single cells but these techniques require high levels of expertise to perform and are plagued with issues such as low data quality, coverage, and resolution. [3]
The chimeric DNA ligation products generated by Hi-C represent pairwise chromatin interactions or physical 3D contacts within the nucleus, [1] [2] [3] [4] and can be analyzed by a variety of downstream approaches. Briefly, deep sequencing data is used to build unbiased genome-wide chromatin interaction maps. [3] [4] [27] [28] [29] [30] Then several different methods can be employed to analyze these maps to identify chromosomal structural patterns and their biological interpretations. Many of these data analysis approaches also apply to 3C-sequencing or other equivalent data.
Hi-C data produced by deep sequencing is in the form of a traditional FASTQ file, and the reads can be aligned to the genome of interest using sequence alignment software (e.g. Bowtie, [31] bwa, [9] [32] etc.). [27] [28] Because Hi-C ligation products may span hundreds of megabases and may bridge loci on different chromosomes, [3] [4] [27] [28] Hi-C read alignment is often chimeric in the sense that different parts of a read may be aligned to loci distant apart, possibly in different orientations. Long-read aligners (e.g. minimap2 [33] ) often support chimeric alignment and can be directly applied to long-read Hi-C data. Short-read Hi-C alignment is more challenging.
Notably, Hi-C generates ligation junctions of varying sizes, but the exact position of the ligation site is not measured. [3] [4] [27] To circumvent this problem, iterative mapping [27] is used to avoid the search for the junction site before being able to split the reads into two and mapping them separately to identify the interaction pairs. The idea behind iterative mapping is to map as short a sequence as possible to ensure unique identification of interaction pairs before reaching the junction site. [27] [28] As a result, 25-bp long reads starting from the 5’ end are mapped to the genome at first, and reads that do not uniquely map to a single loci are extended by an additional 5 bp and then re-mapped. [27] This process is repeated till all reads uniquely map, or till the reads are extended to their entirety. [27] [28] Only paired end reads with each side uniquely mapped to a single genomic loci are kept. [28] All other paired end reads are discarded.
Several variations of read mapping techniques are implemented in many bioinformatics pipelines, such as ICE, [34] HiC-Pro, [35] HIPPIE, [36] HiCUP, [37] and TADbit, [38] to map two portions of a paired end read separately, in the case that the two portions match distinct genomic positions, thus addressing the challenge where reads span the ligation junctions. [28]
With increased read length, more recent pipelines (e.g. Juicer [39] and the 4D-Nucleosome Data Portal [40] ) often align short Hi-C reads with an alignment algorithm capable of chimeric alignment, such as bwa-mem, [41] chromap [42] and dragmap. This procedure calls alignment once and is simpler than iterative mapping.
The mapped reads are then each assigned a single genomic alignment location according to its 5’ mapped position in the genome. [27] For each read pair, a location is assigned to only one of the restriction fragments, thus should fall in close proximity to a restriction site and less than the maximum molecule length away. [27] [28] Reads mapped more than the maximum molecule length away from the closest restriction sites are the results of physical breakage of the chromatin or non-canonical nuclease activities. [27] Because these reads also instruct information on chromatin interactions, they are not discarded, but appropriate filtering must take place after assigning genomic locations to remove technical noise in the dataset. [27] [28] [29] [30]
Depending on whether the read pair falls within the same or different restriction fragments, different filtering criteria are applied. If the paired reads map to the same restriction fragment, they likely represent un-ligated dangling ends or circularized fragments that are uninformative, and are therefore removed from the dataset. [27] [28] These reads could also represent PCR artifacts, undigested chromatin fragments, or simply, reads with low alignment quality. [8] [28] Whatever their origin, reads mapped to the same fragment are considered “spurious signals” [28] and are typically discarded before downstream processing.
The remaining paired reads mapped to distinct restriction fragments are also filtered to discard identical/redundant PCR products, and this is achieved by removing reads sharing the exact same sequence or 5’ alignment positions. [27] Additional levels of filtering could also be applied to fit the experimental purpose. For example, potential undigested restriction sites could be specifically filtered out, rather than passively identified, by removing reads mapped to the same chromosomal strand with a small distance (user-defined, experience-based) in between. [27]
Based on their midpoint coordinates, Hi-C restriction fragments are binned into fixed genomic intervals, with bin sizes ranging from 40 kb to 1 Mb. [27] The rationale behind this approach is that by reducing the complexity of the data and lowering the number of candidate genome-wide interactions per bin, genomic bins allow for the construction of more robust and less noisy signals, in the form of contact frequencies, at the expense of resolution (though restriction fragment length still remains the ultimate physical limit to Hi-C resolution). [27] [28] Bin to bin interactions are aggregated by simply taking the sum, although more focused and informative methods have also been developed over the years to further enhance the signal. [27] One such method described by Rao et al. aims to push the limit of bin size to smaller and smaller bins, eventually having > 80% of bins covered by 1000 reads each, which significantly increased the resolution of the final analysis results. [14]
Bin-level filtering, just like fragment-level filtering, also takes place to shed experimental artifacts from the obtained data. Bins with high noise and low signals are removed as they typically represent highly repetitive genomic contents around the telomeres and centromeres. [27] This is done by comparing the individual bin sums to the sum of all bins and removing the bottom 1% of bins, or by using the variance as a measure of noise. [27] Low-coverage bins, or bins three standard deviations below the center of a log-normal distribution (which fits the total number of contacts per genomic bin), are removed using the MAD-max (maximum allowed median absolute deviation) filter. [43] [44] After binning, Hi-C data will be stored in a symmetrical matrix format. [27] [28] [29] [30]
More recently, many approaches have been proposed to predetermine the optimal bin size for different Hi-C experiments. Li et al. in 2018 described deDoc, a method where bin size is selected as the one at which the structural entropy of the Hi-C matrix reaches a stable minimum. [45] QuASAR, on the other hand, offers a bit more quality assessment, and compares replicate scores of the samples (given that replicates are indeed included for the experimental purpose) to find the maximum usable resolution. [46] Some publications [8] [47] also tried to score interaction frequencies at the single-fragment level, where a higher coverage can be achieved even with a lower number of reads. HiCPlus, [48] a tool developed by Zhang et al. in 2018, is able to impute Hi-C matrices similar to the original ones using only 1/16 of the original reads. [48]
Balancing refers to the process of bias correction of the obtained Hi-C data, and can be either explicit or implicit. [27] [28] Explicit balancing methods require the explicit definitions of biases known to be associated with Hi-C reads (or any high-throughput sequencing technique in general) including the read mappability, GC content, as well as individual fragment length. [27] [28] A correction factor is first computed for each of the considered biases, followed by each of their combination, and then applied to the read counts per genomic bin. [27] [28]
However, some biases can come from an unknown origin, in which case an implicit balancing approach is used instead. Implicit balancing relies on the assumption that each genomic locus should have “equal visibility”, which suggests that the interaction signal at each genomic locus in the Hi-C data should add up to the same total amount. [28] One approach called iterative correction uses the Sinkhorn–Knopp balancing algorithm [49] and attempts to balance the symmetrical matrix using the aforementioned assumption (by equalizing the sum of each and every row and column in the matrix). [27] [28] [49] The algorithm iteratively alternates between two steps: 1) dividing each row by its mean, and 2) dividing each column by its mean, which are guaranteed to converge in the end and leave no obviously high rows or columns in the interaction matrix. [27] [49] Other computational methods also exist to normalize the biases inherent to Hi-C data, including sequential component normalization (SCN), [50] the Knight-Ruiz matrix-balancing approach, [14] [51] and eigenvector decomposition (ICE) normalization. [34] In the end, both the explicit and the implicit bias correction methods yield comparable results. [27]
With a binned, genome-wide interaction matrix, common interaction patterns observed in mammalian genomes can be identified and interpreted biologically, while more rare, less frequently observed patterns such as circular chromosomes and centromere clustering, may require additional specially-tailored methods to be identified.
Cis/trans interactions are one of the two strongest interaction patterns observed in Hi-C maps. [27] They are not locus-specific, and thus are considered as a genome-level pattern. [27] Typically, a higher interaction frequency is observed, on average, for pairs of loci residing on the same chromosome (in cis) than pairs of loci residing on different chromosomes (in trans). [27] In Hi-C interaction matrices, cis/trans interactions appear as square blocks centered along a diagonal, matching individual chromosomes at the same time. [27] Because this pattern is relatively consistent across different species and cell types, it can be used to assess the quality of the data. A noisier experiment, due to random background ligation or any unknown factor, will result in a lower cis to trans interaction ratio (as the noise is expected to affect both cis and trans interactions to a similar extent), and high-quality experiments typically have a cis/trans interaction ratio between 40 and 60 for the human genome. [27]
This pattern refers to the distance-dependent decay of interaction frequencies on a genome level, and represents the second one of the two strongest Hi-C interaction patterns. [27] As the interaction frequencies between cis-interacting loci decrease (as a result of further distance between them), a gradual decrease of interaction frequency can be observed moving away from the diagonal in the interaction matrix. [27]
Various polymer models [52] [53] exist to statistically characterize the properties of loci pairs separated by a given distance, but discrete binning and fitting continuous functions are two common ways to analyze the distance-dependent interaction frequencies between datapoints. [27] First, interaction frequencies can be binned based on their genomic distance, then a continuous function is fitted to the data using information of the average of each bin. [27] The resulting decay function is plotted on a log-log plot so that a linear line can be used to represent the power-law decays predicted by polymer models. [52] [53] However, oftentimes a simple polymer model will not be sufficient to fully represent the distance-dependent interaction frequencies, at which point more complicated decay functions might result, which might affect the reproducibility of the data due to the presence of locus-specific rather than genome-wide patterns observed in the Hi-C matrix (which are not taken into consideration by polymer models). [27] [52] [53]
The strongest locus-specific pattern found in Hi-C maps is chromatin compartments, [1] which takes the shape of a plaid or “checker-board”-like pattern on the interaction matrix, with alternating blocks that range between 1 and 10 Mb in size (which makes them easy to extract even in experiments with very low sampling) in the human genome. [27] [28] [30] This pattern can be found at both high and low frequencies. Because chromosomes consist of two types of genomic regions that alternate along the length of individual chromosomes, the interaction frequencies between two regions of the same type and interaction frequencies between two regions of different types can be quite different. [27] [28]
The definition of the active (A) and inactive (B) chromatin compartments is based on principal component analysis, first established by Lieberman-Aiden et al. in 2009. [1] [27] [28] [30] Their approach calculated the correlation of the Hi-C matrix of observed vs. expected signal (obtained from a distance-normalized contact matrix) ratio, and used the sign of the first eigenvector to denote positive and negative parts of the resulting plot as A and B compartments, respectively. [1] [27] [28] [30] Many genomic studies have indicated that chromatin compartments are correlated with chromatin states, such as gene density, DNA accessibility, GC content, replication timing, and histone marks. [1] [27] [28] [30] Therefore, type A compartments are more specifically defined to represent the gene-dense regions of euchromatin, while type B compartments represent heterochromatic regions with less gene activities. [27] [28] [30] Overall, chromatin compartments offer insights on the general organization principles of the genome of interest.
More and more bioinformatics tools capable of performing compartment calling have been developed over the past decade, including HOMER, [54] HiTC R, [35] and CscoreTool. [55] Although they each has their own differences and optimizations made on the original 2009 approach, their base protocols still rely on principal component analysis.
TADs are sub-Mb structures that may harbor gene-regulatory features, such as local promoter-enhancer interactions. [27] More generally, TADs are considered as an emergent property of underlying biological mechanisms, which defines TADs as loop extrusions, compartmentalizations, or any dynamic genomic pattern rather than a static structural feature of the genome. [56] Thus, TADs represent regulatory microenvironments and usually show up on a Hi-C map as blocks of highly self-interacting regions in which interaction frequencies within the region are significantly higher than interaction frequencies between two adjacent regions. [27] [28] [30] In Hi-C interaction matrices, TADs are square blocks of elevated interaction frequencies centred along the diagonal. [27] However, this is merely an oversimplified description, and identifying the actual pattern requires much more statistical processing and estimation.
One approach to identify TADs was described by Dixon et al., [9] where they first calculated (within some genomic range) the difference between the average upstream interactions and the average downstream interactions of each bin in the matrix. [9] This difference was then transformed into a chi-squared statistic based on the Hidden Markov Model, and any sharp change in this chi-squared value, called the directionality index, will define the boundaries of TADs. [9] [27] Alternatively, one could simply take the ratio between average upstream and downstream interactions to define TAD boundaries, as did Naumova et al. [57]
Another approach is to calculate the average interaction frequencies crossing over each bin, again within some predetermined genomic range. [27] [28] [58] The resulting value is referred to as the insulation score and can be thought of as the average of a square sliding along the diagonal of the matrix (Crane et al.). [58] This value is expected to be lower at TAD boundaries; thus, one can use standard statistical techniques to find local minima (boundaries), and define regions between consecutive boundaries to be TADs. [27] [28] [58]
However, as is increasingly recognized today, TADs represent a hierarchical series of structures that cannot be fully characterized by one-dimensional scores given by the previous methods. [28] The increased resolution available in newer datasets can now explicitly address TADs with multiscale analysis approaches. As first introduced by Armatus, [59] resolution specific domains can be identified and a consensus set of domains conserved across resolutions can be calculated, [28] [59] which transforms the problem of TAD calling into the optimization of scoring functions based on their local interaction densities. [59] Variations of this approach with different objective functions, such as Lavaburst, [60] MrTADFinder, [61] 3DNetMod, [62] and Matryoshka, [63] are also developed to achieve better computing performance on higher resolution datasets.
Biologically, regulatory interactions usually occur at much smaller scale than TADs, and two genomic elements can activate/inhibit the expression of a gene within as small a distance as 1 kb. [27] Therefore, point interactions are important in interpreting Hi-C maps, and are expected to appear as local enrichments in contact probability. [27] [28] However, current methodologies for the identification of point interactions are all implicit in nature, in that they do not instruct what a point interaction should look like. [27] [28] Instead, point mutations are identified as outliers with higher interaction frequencies than expected within the Hi-C matrix, given that the background model consists only of the strongest signals such as the distance-decay functions. [27] [28] The background model can be estimated and constructed using both local signal distributions and global approaches (i.e. chromosome-wide/genome-wide). [28] Many of the aforementioned bioinformatics packages incorporate algorithms to identify point interactions. In short, the significance of individual pairwise interaction is calculated, and significantly high outliers are corrected for multiple testing before they are recognized as truly informative point interactions. [27] It is helpful to compliment identified point interactions with additional evidence such as analysis of enrichment scores and biological replicates, to indicate that these interactions are indeed of biological significance. [27]
Hi-C can reveal chromatin conformation changes during cell division. In interphase, chromatins are generally loose and vivacious so that transcription regulation and other regulatory activities could take place. [64] When entering mitosis and cell division, chromatins become compactly folded into dense cylindrical chromosomes. [64] Within the past five years, the development of single-cell Hi-C has enabled the depiction of the entire 3D structural landscape of chromatins/chromosomes throughout the cell cycle, and many studies have discovered that these identified genomic domains remain unchanged in interphase, and are erased by silencing mechanisms when the cell enters mitosis. [65] [66] When mitotic division is completed and the cell re-enters the interphase, chromatin 3D structures are observed to be re-established, and transcription regulation is restored. [65]
It has been suspected that the differentiation of embryonic stem cells (ESCs) and induced pluripotent stem cells (iPSCs) into various mature cell lineages is accompanied by global changes in chromosomal structures and consequently interaction dynamics to allow for the regulation of transcriptional activation/silencing. [3] Standard Hi-C can be used to investigate this research question.
In 2015, Dixon et al. [11] applied standard Hi-C to capture global 3D dynamics in human ESCs during their differentiation into high five cells. Due to the ability of Hi-C to depict dynamic interactions in differentiation-related TADs, the researchers discovered increases in the number of DHS sites, CTCF binding ability, active histone modifications, and target gene expressions within these TADs of interest, and found significant participation of major pluripotency factors such as OCT4, NANOG, and SOX2 in the interaction network during somatic cell reprogramming. [11] Since then, Hi-C has been recognized as one of the standard methods to probe for transcriptional regulatory activities, and has confirmed that chromosome architecture is closely related to cell fate. [11] [67]
Mammalian somatic growth and development starts with the fertilization of sperm and oocyte, followed by the zygote stage, the 2-cell, 4-cell, and the 8-cell stage, the blastocyst stage, and finally the embryo stage. [68] Hi-C made it possible to explore the comprehensive genomic architecture during growth and development, as both sis-Hi-C [69] and in situ Hi-C [70] have reported that TADs and genomic A and B compartments are not obviously present and appear to be less well-structured in oocyte cells. [69] [70] These structural features of the chromatin only gradually establish from weaker frequencies to cleaner and more frequent datapoints after fertilization, as developmental stages progress. [69] [70]
As data on 3D genome structures becomes more and more prevalent in recent years, Hi-C begins to be used as a means to track evolutionary structural features/changes. Genomic single nucleotide polymorphisms (SNPs) and TADs are typically conserved across species, [71] along with the CTCF factor in the chromatin domain evolution. [72] Other factors, however, have been revealed by Hi-C techniques to experience structural evolutions in 3D architecture. These include codon usage frequency similarity (CUFS), [73] paralog gene co-regulation, [74] and spatially co-evolving orthologous modules (SCOMs). [75] For large-scale domain evolution, chromosomal translocations, syntenic regions, as well as genomic rearrangement regions were all relatively conserved. [2] [67] [72] [76] [77] These findings imply that Hi-C technologies is capable of providing an alternative point of view in the eukaryotic tree of life. [3]
Several studies have employed the use of Hi-C to describe and study chromatin architecture in different cancers and their impact on disease pathogenesis. Kloetgen et al. used in situ Hi-C to study T cell acute lymphoblastic leukemia (T-ALL) and found a TAD fusion event that removed a CTCF insulation site, allowing for the oncogene MYC’s promoter to directly interact with a distal super enhancer. [78] Fang et al. have also shown how there are T-ALL specific gain or loss of chromatin insulation, which alters the strength of TAD architecture of the genome, using in situ Hi-C. [79] Low-C has been used to map the chromatin structure of primary B cells of a diffuse large B-cell lymphoma patient and was used to find high chromosome structural variation between the patient and healthy B-cells. [23] Overall, the application of Hi-C and its variants in cancer research provides unique insight into the molecular underpinnings of the driving factors of cell abnormality. [23] [78] [79] It can help explain biological phenomena (high MYC expression in T-ALL) and help aid drug development to target mechanisms unique to cancerous cells. [23] [78] [79]
ChIP-on-chip is a technology that combines chromatin immunoprecipitation ('ChIP') with DNA microarray ("chip"). Like regular ChIP, ChIP-on-chip is used to investigate interactions between proteins and DNA in vivo. Specifically, it allows the identification of the cistrome, the sum of binding sites, for DNA-binding proteins on a genome-wide basis. Whole-genome analysis can be performed to determine the locations of binding sites for almost any protein of interest. As the name of the technique suggests, such proteins are generally those operating in the context of chromatin. The most prominent representatives of this class are transcription factors, replication-related proteins, like origin recognition complex protein (ORC), histones, their variants, and histone modifications.
Chromosome conformation capture techniques are a set of molecular biology methods used to analyze the spatial organization of chromatin in a cell. These methods quantify the number of interactions between genomic loci that are nearby in 3-D space, but may be separated by many nucleotides in the linear genome. Such interactions may result from biological functions, such as promoter-enhancer interactions, or from random polymer looping, where undirected physical motion of chromatin causes loci to collide. Interaction frequencies may be analyzed directly, or they may be converted to distances and used to reconstruct 3-D structures.
SOLiD (Sequencing by Oligonucleotide Ligation and Detection) is a next-generation DNA sequencing technology developed by Life Technologies and has been commercially available since 2006. This next generation technology generates 108 - 109 small sequence reads at one time. It uses 2 base encoding to decode the raw data generated by the sequencing platform into sequence data.
ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.
DNA adenine methyltransferase identification, often abbreviated DamID, is a molecular biology protocol used to map the binding sites of DNA- and chromatin-binding proteins in eukaryotes. DamID identifies binding sites by expressing the proposed DNA-binding protein as a fusion protein with DNA methyltransferase. Binding of the protein of interest to DNA localizes the methyltransferase in the region of the binding site. Adenine methylation does not occur naturally in eukaryotes and therefore adenine methylation in any region can be concluded to have been caused by the fusion protein, implying the region is located near a binding site. DamID is an alternate method to ChIP-on-chip or ChIP-seq.
Paired-end tags (PET) are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making the sequence of the DNA in between them available upon search or upon further sequencing. Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely. However, longer sequences are more practical for mapping reads uniquely. The endonucleases used to produce PETs give longer tags but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency. After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the Sanger method, which has a longer read length. Since the tag sequences are short, individual PETs are well suited for next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.
Chromatin Interaction Analysis by Paired-End Tag Sequencing is a technique that incorporates chromatin immunoprecipitation (ChIP)-based enrichment, chromatin proximity ligation, Paired-End Tags, and High-throughput sequencing to determine de novo long-range chromatin interactions genome-wide.
Chromatin immunoprecipitation (ChIP) is a type of immunoprecipitation experimental technique used to investigate the interaction between proteins and DNA in the cell. It aims to determine whether specific proteins are associated with specific genomic regions, such as transcription factors on promoters or other DNA binding sites, and possibly define cistromes. ChIP also aims to determine the specific location in the genome that various histone modifications are associated with, indicating the target of the histone modifiers. ChIP is crucial for the advancements in the field of epigenomics and learning more about epigenetic phenomena.
FAIRE-Seq is a method in molecular biology used for determining the sequences of DNA regions in the genome associated with regulatory activity. The technique was developed in the laboratory of Jason D. Lieb at the University of North Carolina, Chapel Hill. In contrast to DNase-Seq, the FAIRE-Seq protocol doesn't require the permeabilization of cells or isolation of nuclei, and can analyse any cell type. In a study of seven diverse human cell types, DNase-seq and FAIRE-seq produced strong cross-validation, with each cell type having 1-2% of the human genome as open chromatin.
Jumping libraries or junction-fragment libraries are collections of genomic DNA fragments generated by chromosome jumping. These libraries allow the analysis of large areas of the genome and overcome distance limitations in common cloning techniques. A jumping library clone is composed of two stretches of DNA that are usually located many kilobases away from each other. The stretch of DNA located between these two "ends" is deleted by a series of biochemical manipulations carried out at the start of this cloning technique.
A topologically associating domain (TAD) is a self-interacting genomic region, meaning that DNA sequences within a TAD physically interact with each other more frequently than with sequences outside the TAD. The median size of a TAD in mouse cells is 880 kb, and they have similar sizes in non-mammalian species. Boundaries at both side of these domains are conserved between different mammalian cell types and even across species and are highly enriched with CCCTC-binding factor (CTCF) and cohesin. In addition, some types of genes appear near TAD boundaries more often than would be expected by chance.
Nuclear organization refers to the spatial distribution of chromatin within a cell nucleus. There are many different levels and scales of nuclear organisation. Chromatin is a higher order structure of DNA.
Single cell epigenomics is the study of epigenomics in individual cells by single cell sequencing. Since 2013, methods have been created including whole-genome single-cell bisulfite sequencing to measure DNA methylation, whole-genome ChIP-sequencing to measure histone modifications, whole-genome ATAC-seq to measure chromatin accessibility and chromosome conformation capture.
CUT&RUN sequencing, also known as cleavage under targets and release using nuclease, is a method used to analyze protein interactions with DNA. CUT&RUN sequencing combines antibody-targeted controlled cleavage by micrococcal nuclease with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global DNA binding sites precisely for any protein of interest. Currently, ChIP-Seq is the most common technique utilized to study protein–DNA relations, however, it suffers from a number of practical and economical limitations that CUT&RUN sequencing does not.
BLESS, also known as breaks labeling, enrichment on streptavidin and next-generation sequencing, is a method used to detect genome-wide double-strand DNA damage. In contrast to chromatin immunoprecipitation (ChIP)-based methods of identifying DNA double-strand breaks (DSBs) by labeling DNA repair proteins, BLESS utilizes biotinylated DNA linkers to directly label genomic DNA in situ which allows for high-specificity enrichment of samples on streptavidin beads and the subsequent sequencing-based DSB mapping to nucleotide resolution.
H4K20me is an epigenetic modification to the DNA packaging protein Histone H4. It is a mark that indicates the mono-methylation at the 20th lysine residue of the histone H4 protein. This mark can be di- and tri-methylated. It is critical for genome integrity including DNA damage repair, DNA replication and chromatin compaction.
H3S28P is an epigenetic modification to the DNA packaging protein histone H3. It is a mark that indicates the phosphorylation the 28th serine residue of the histone H3 protein.
H3Y41P is an epigenetic modification to the DNA packaging protein histone H3. It is a mark that indicates the phosphorylation the 41st tyrosine residue of the histone H3 protein.
Proximity ligation-assisted chromatin immunoprecipitation sequencing (PLAC-seq) is a chromatin conformation capture(3C)-based technique to detect and quantify genomic chromatin structure from a protein-centric approach. PLAC-seq combines in situ Hi-C and chromatin immunoprecipitation (ChIP), which allows for the identification of long-range chromatin interactions at a high resolution with low sequencing costs. Mapping long-range 3-dimensional(3D) chromatin interactions is important in identifying transcription enhancers and non-coding variants that can be linked to human diseases.
Pore-C is a genomic technique which utilizes chromatin conformation capture (3C) and Oxford Nanopore Technologies' (ONT) long-read sequencing to characterize three-dimensional (3D) chromatin structure. To characterize concatemers, the originators of Pore-C developed an algorithm to identify alignments that are assigned to a restriction fragment; concatemers with greater than two associated fragments are deemed high order. Pore-C attempts to improve on previous 3C technologies, such as Hi-C and SPRITE, by not requiring DNA amplification prior to sequencing. This technology was developed as a simpler and more easily scalable method of capturing higher-order chromatin structure and mapping regions of chromatin contact. In addition, Pore-C can be used to visualize epigenomic interactions due to the capability of ONT long-read sequencing to detect DNA methylation. Applications of this technology include analysis of combinatorial chromatin interactions, the generation of de novo chromosome scale assemblies, visualization of regions associated with multi-locus histone bodies, and detection and resolution of structural variants.
{{cite book}}
: |journal=
ignored (help){{cite journal}}
: Cite journal requires |journal=
(help)