End-sequence profiling (ESP) (sometimes "Paired-end mapping (PEM)") is a method based on sequence-tagged connectors developed to facilitate de novo genome sequencing to identify high-resolution copy number and structural aberrations such as inversions and translocations.
Briefly, the target genomic DNA is isolated and partially digested with restriction enzymes into large fragments. Following size-fractionation, the fragments are cloned into plasmids to construct artificial chromosomes such as bacterial artificial chromosomes (BAC) which are then sequenced and compared to the reference genome. The differences, including orientation and length variations between constructed chromosomes and the reference genome, will suggest copy number and structural aberration.
Before analyzing target genome structural aberration and copy number variation (CNV) with ESP, the target genome is usually amplified and conserved with artificial chromosome construction. The classic strategy to construct an artificial chromosome is bacterial artificial chromosome (BAC). Basically, the target chromosome is randomly digested and inserted into plasmids which are transformed and cloned in bacteria. [1] The size of fragments inserted is 150–350 kb. [2] Another commonly used artificial chromosome is fosmid. The difference between BAC and fosmids is the size of the DNA inserted. Fosmids can only hold 40 kb DNA fragments, [3] which allows a more accurate breakpoint determination.
End sequence profiling (ESP) can be used to detect structural variations such as insertions, deletions, and chromosomal rearrangement. Compare to other methods that look at chromosomal abnormalities, ESP is particularly useful to identify copy neutral abnormalities such as inversions and translocations that would not be apparent when looking at copy number variation. [4] [5] From the BAC library, both ends of the inserted fragments are sequenced using a sequencing platform. Detection of variations is then achieved by mapping the sequenced reads onto a reference genome.
Inversions and translocations are relatively easy to detect by an invalid pair of sequenced-end. For instance, a translocation can be detected if the paired-ends are mapped onto different chromosomes on the reference genome. [4] [5] Inversion can be detected by divergent orientation of the reads, where the insert will have two plus-end or two minus-end.
In the case of an insertion or a deletion, mapping of the paired-end is consistent with the reference genome. But the read are disconcordant in apparent size. The apparent size is the distance of the BAC sequenced-ends mapped in the reference genome. If a BAC has an insert of length (l), a concordant mapping will show a fragment of size (l) in the reference genome. If the paired-ends are closer than distance (l), an insertion is suspected in the sampled DNA. A distance of (l< μ-3σ) can be used as a cut-off to detect an insertion, where μ is the mean length of the insert and σ is the standard deviation. [5] [6] In case of a deletion, the paired-ends are mapped further away in the reference genome compared to the expected distance (l> μ-3σ). [6]
In some cases, discordant reads can also indicate a CNV for example in sequences repeats. For larger CNV, the density of the reads will vary accordingly to the copy number. An increase of copy numbers will be reflected by increasing mapping of the same region on the reference genome.
ESP was first developed and published in 2003 by Dr. Collins and his colleagues in University of California, San Francisco. Their study revealed the chromosome rearrangements and CNV of MCF7 human cancer cells at a 150kb resolution, which is much more accurate compared to both CGH and spectral karyotyping at that time. [5] In 2007, Dr. Snyder and his group improved the ESP to 3kb resolution by sequencing both pairs of 3-kb DNA fragments without BAC construction. Their approach is able to identify deletions, inversions, insertions with an average breakpoint resolution of 644bp, which close to the resolution of polymerase chain reaction (PCR). [7]
Various bioinformatics tools can be used to analyze end-sequence profiling. Common ones include BreakDancer, PEMer, Variation Hunter, common LAW, GASV, and Spanner. [8] ESP can be used to map structural variation at high-resolution in disease tissue. This technique is mainly used on tumor samples from different cancer types. Accurate identification of copy neutral chromosomal abnormalities is particularly important as translocation can lead to fusion proteins, chimeric proteins, or misregulated proteins that can be seen in tumors. This technique can also be used in evolution studies by identifying large structural variation between different populations. [9] Similar methods are being developed for various applications. For example, a barcoded Illumina paired-end sequencing (BIPES) approach was used to assess microbial diversity by sequencing the 16S V6 tag. [10]
Resolution of structural variation detection by ESP has been increased to the similar level as PCR, and can be further improved by selection of more evenly sized DNA fragments. ESP can be applied for either with or without constructed artificial chromosome. With BAC, precious samples can be immortalized and conserved, which is particularly important for small quantity of smalls which are planned for extensive analyses. Furthermore, BACs carrying rearranged DNA fragments can be directly transfected in vitro or in vivo to analyze the function of these arrangements. However, BAC construction is still expensive and labor-intensive. Researchers should be really careful to choose which strategy they need for particular project. Because ESP only looks at short paired-end sequences, it has the advantage of providing useful information genome-wide without the need for large-scale sequencing. Approximately 100-200 tumors can be sequenced at a resolution greater than 150kb when compared to sequencing an entire genome.
In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.
A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid, used for transforming and cloning in bacteria, usually E. coli. F-plasmids play a crucial role because they contain partition genes that promote the even distribution of plasmids after bacterial cell division. The bacterial artificial chromosome's usual insert size is 150–350 kbp. A similar cloning vector called a PAC has also been produced from the DNA of P1 bacteriophage.
In genetics, a deletion is a mutation in which a part of a chromosome or a sequence of DNA is left out during DNA replication. Any number of nucleotides can be deleted, from a single base to an entire piece of chromosome. Some chromosomes have fragile spots where breaks occur, which result in the deletion of a part of the chromosome. The breaks can be induced by heat, viruses, radiation, or chemical reactions. When a chromosome breaks, if a part of it is deleted or lost, the missing piece of chromosome is referred to as a deletion or a deficiency.
A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.
A cloning vector is a small piece of DNA that can be stably maintained in an organism, and into which a foreign DNA fragment can be inserted for cloning purposes. The cloning vector may be DNA taken from a virus, the cell of a higher organism, or it may be the plasmid of a bacterium. The vector contains features that allow for the convenient insertion of a DNA fragment into the vector or its removal from the vector, for example through the presence of restriction sites. The vector and the foreign DNA may be treated with a restriction enzyme that cuts the DNA, and DNA fragments thus generated contain either blunt ends or overhangs known as sticky ends, and vector DNA and foreign DNA with compatible ends can then be joined by molecular ligation. After a DNA fragment has been cloned into a cloning vector, it may be further subcloned into another vector designed for more specific use.
Yeast artificial chromosomes (YACs) are genetically engineered chromosomes derived from the DNA of the yeast, Saccharomyces cerevisiae, which is then ligated into a bacterial plasmid. By inserting large fragments of DNA, from 100–1000 kb, the inserted sequences can be cloned and physically mapped using a process called chromosome walking. This is the process that was initially used for the Human Genome Project, however due to stability issues, YACs were abandoned for the use of bacterial artificial chromosome
Comparative genomic hybridization(CGH) is a molecular cytogenetic method for analysing copy number variations (CNVs) relative to ploidy level in the DNA of a test sample compared to a reference sample, without the need for culturing cells. The aim of this technique is to quickly and efficiently compare two genomic DNA samples arising from two sources, which are most often closely related, because it is suspected that they contain differences in terms of either gains or losses of either whole chromosomes or subchromosomal regions (a portion of a whole chromosome). This technique was originally developed for the evaluation of the differences between the chromosomal complements of solid tumor and normal tissue, and has an improved resolution of 5–10 megabases compared to the more traditional cytogenetic analysis techniques of giemsa banding and fluorescence in situ hybridization (FISH) which are limited by the resolution of the microscope utilized.
Copy number variation (CNV) is a phenomenon in which sections of the genome are repeated and the number of repeats in the genome varies between individuals. Copy number variation is a type of structural variation: specifically, it is a type of duplication or deletion event that affects a considerable number of base pairs. Approximately two-thirds of the entire human genome may be composed of repeats and 4.8–9.5% of the human genome can be classified as copy number variations. In mammals, copy number variations play an important role in generating necessary variation in the population as well as disease phenotype.
A genomic library is a collection of overlapping DNA fragments that together make up the total genomic DNA of a single organism. The DNA is stored in a population of identical vectors, each containing a different insert of DNA. In order to construct a genomic library, the organism's DNA is extracted from cells and then digested with a restriction enzyme to cut the DNA into fragments of a specific size. The fragments are then inserted into the vector using DNA ligase. Next, the vector DNA can be taken up by a host organism - commonly a population of Escherichia coli or yeast - with each cell containing only one vector molecule. Using a host cell to carry the vector allows for easy amplification and retrieval of specific clones from the library for analysis.
Fosmids are similar to cosmids but are based on the bacterial F-plasmid. The cloning vector is limited, as a host can only contain one fosmid molecule. Fosmids can hold DNA inserts of up to 40 kb in size; often the source of the insert is random genomic DNA. A fosmid library is prepared by extracting the genomic DNA from the target organism and cloning it into the fosmid vector. The ligation mix is then packaged into phage particles and the DNA is transfected into the bacterial host. Bacterial clones propagate the fosmid library. The low copy number offers higher stability than vectors with relatively higher copy numbers, including cosmids. Fosmids may be useful for constructing stable libraries from complex genomes. Fosmids have high structural stability and have been found to maintain human DNA effectively even after 100 generations of bacterial growth. Fosmid clones were used to help assess the accuracy of the Public Human Genome Sequence.
In the fields of bioinformatics and computational biology, Genome survey sequences (GSS) are nucleotide sequences similar to expressed sequence tags (ESTs) that the only difference is that most of them are genomic in origin, rather than mRNA.
The following outline is provided as an overview of and topical guide to genetics:
Paired-end tags (PET) are the short sequences at the 5’ and 3' ends of a DNA fragment which are unique enough that they (theoretically) exist together only once in a genome, therefore making the sequence of the DNA in between them available upon search or upon further sequencing. Paired-end tags (PET) exist in PET libraries with the intervening DNA absent, that is, a PET "represents" a larger fragment of genomic or cDNA by consisting of a short 5' linker sequence, a short 5' sequence tag, a short 3' sequence tag, and a short 3' linker sequence. It was shown conceptually that 13 base pairs are sufficient to map tags uniquely. However, longer sequences are more practical for mapping reads uniquely. The endonucleases used to produce PETs give longer tags but sequences of 50–100 base pairs would be optimal for both mapping and cost efficiency. After extracting the PETs from many DNA fragments, they are linked (concatenated) together for efficient sequencing. On average, 20–30 tags could be sequenced with the Sanger method, which has a longer read length. Since the tag sequences are short, individual PETs are well suited for next-generation sequencing that has short read lengths and higher throughput. The main advantages of PET sequencing are its reduced cost by sequencing only short fragments, detection of structural variants in the genome, and increased specificity when aligning back to the genome compared to single tags, which involves only one end of the DNA fragment.
Genomic structural variation is the variation in structure of an organism's chromosome. It consists of many kinds of variation in the genome of one species, and usually includes microscopic and submicroscopic types, such as deletions, duplications, copy-number variants, insertions, inversions and translocations. Originally, a structure variation affects a sequence length about 1kb to 3Mb, which is larger than SNPs and smaller than chromosome abnormality. However, the operational range of structural variants has widened to include events > 50bp. The definition of structural variation does not imply anything about frequency or phenotypical effects. Many structural variants are associated with genetic diseases, however many are not. Recent research about SVs indicates that SVs are more difficult to detect than SNPs. Approximately 13% of the human genome is defined as structurally variant in the normal population, and there are at least 240 genes that exist as homozygous deletion polymorphisms in human populations, suggesting these genes are dispensable in humans. Rapidly accumulating evidence indicates that structural variations can comprise millions of nucleotides of heterogeneity within every genome, and are likely to make an important contribution to human diversity and disease susceptibility.
1q21.1 copy number variations (CNVs) are rare aberrations of human chromosome 1.
Jumping libraries or junction-fragment libraries are collections of genomic DNA fragments generated by chromosome jumping. These libraries allow the analysis of large areas of the genome and overcome distance limitations in common cloning techniques. A jumping library clone is composed of two stretches of DNA that are usually located many kilobases away from each other. The stretch of DNA located between these two "ends" is deleted by a series of biochemical manipulations carried out at the start of this cloning technique.
Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.
Structural variation in the human genome is operationally defined as genomic alterations, varying between individuals, that involve DNA segments larger than 1 kilo base (kb), and could be either microscopic or submicroscopic. This definition distinguishes them from smaller variants that are less than 1 kb in size such as short deletions, insertions, and single nucleotide variants.
Physical map is a technique used in molecular biology to find the order and physical distance between DNA base pairs by DNA markers. It is one of the gene mapping techniques which can determine the sequence of DNA base pairs with high accuracy. Genetic mapping, another approach of gene mapping, can provide markers needed for the physical mapping. However, as the former deduces the relative gene position by recombination frequencies, it is less accurate than the latter.
Human somatic variations are somatic mutations both at early stages of development and in adult cells. These variations can lead either to pathogenic phenotypes or not, even if their function in healthy conditions is not completely clear yet.