Contig

Last updated April 01, 2022

A contig (from contiguous) is a set of overlapping DNA segments that together represent a consensus region of DNA.^[1] In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads);^[2] in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly.^[3] Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.

Original definition of contig

In 1980, Staden ^[4] wrote: In order to make it easier to talk about our data gained by the shotgun method of sequencing we have invented the word "contig". A contig is a set of gel readings that are related to one another by overlap of their sequences. All gel readings belong to one and only one contig, and each contig contains at least one gel reading. The gel readings in a contig can be summed to form a contiguous consensus sequence and the length of this sequence is the length of the contig.

Sequence contigs

A sequence contig is a continuous (not contiguous) sequence resulting from the reassembly of the small DNA fragments generated by bottom-up sequencing strategies. This meaning of contig is consistent with the original definition by Rodger Staden (1979).^[5] The bottom-up DNA sequencing strategy involves shearing genomic DNA into many small fragments ("bottom"), sequencing these fragments, reassembling them back into contigs and eventually the entire genome ("up"). Because current technology allows for the direct sequencing of only relatively short DNA fragments (300–1000 nucleotides), genomic DNA must be fragmented into small pieces prior to sequencing.^[6] In bottom-up sequencing projects, amplified DNA is sheared randomly into fragments appropriately sized for sequencing. The subsequent sequence reads, which are the data that contain the sequences of the small fragments, are put into a database. The assembly software ^[6] then searches this database for pairs of overlapping reads. Assembling the reads from such a pair (including, of course, only one copy of the identical sequence) produces a longer contiguous read (contig) of sequenced DNA. By repeating this process many times, at first with the initial short pairs of reads but then using increasingly longer pairs that are the result of previous assembly, the DNA sequence of an entire chromosome can be determined.

Overlapping reads from paired-end sequencing form contigs; contigs and gaps of known length form scaffolds. PET contig scaffold.png — Overlapping reads from paired-end sequencing form contigs; contigs and gaps of known length form scaffolds.

Today, it is common to use paired-end sequencing technology where both ends of consistently sized longer DNA fragments are sequenced. Here, a contig still refers to any contiguous stretch of sequence data created by read overlap. Because the fragments are of known length, the distance between the two end reads from each fragment is known.^[7] This gives additional information about the orientation of contigs constructed from these reads and allows for their assembly into scaffolds in a process called scaffolding.

Scaffolds consist of overlapping contigs separated by gaps of known length. The new constraints placed on the orientation of the contigs allows for the placement of highly repeated sequences in the genome. If one end read has a repetitive sequence, as long as its mate pair is located within a contig, its placement is known.^[7] The remaining gaps between the contigs in the scaffolds can then be sequenced by a variety of methods, including PCR amplification followed by sequencing (for smaller gaps) and BAC cloning methods followed by sequencing for larger gaps.^[2]

BAC contigs

Contig can also refer to the overlapping clones that form a physical map of a chromosome when the top-down or hierarchical sequencing strategy is used.^[1] In this sequencing method, a low-resolution map is made prior to sequencing in order to provide a framework to guide the later assembly of the sequence reads of the genome. This map identifies the relative positions and overlap of the clones used for sequencing. Sets of overlapping clones that form a contiguous stretch of DNA are called contigs; the minimum number of clones that form a contig that covers the entire chromosome comprise the tiling path that is used for sequencing. Once a tiling path has been selected, its component BACs are sheared into smaller fragments and sequenced. Contigs therefore provide the framework for hierarchical sequencing.^[3]

The assembly of a contig map involves several steps. First, DNA is sheared into larger (50–200kb) pieces, which are cloned into BACs or PACs to form a BAC library. Since these clones should cover the entire genome/chromosome, it is theoretically possible to assemble a contig of BACs that covers the entire chromosome.^[1] Reality, however, is not always ideal. Gaps often remain, and a scaffold—consisting of contigs and gaps—that covers the map region is often the first result.^[1] The gaps between contigs can be closed by various methods outlined below.

Construction of BAC contigs

BAC contigs are constructed by aligning BAC regions of known overlap via a variety of methods. One common strategy is to use sequence-tagged site (STS) content mapping to detect unique DNA sites in common between BACs. The degree of overlap is roughly estimated by the number of STS markers in common between two clones, with more markers in common signifying a greater overlap.^[2] Because this strategy provides only a very rough estimate of overlap, restriction digest fragment analysis, which provides a more precise measurement of clone overlap, is often used.^[2] In this strategy, clones are treated with one or two restriction enzymes and the resulting fragments separated by gel electrophoresis. If two clones, they will likely have restriction sites in common, and will thus share several fragments.^[3] Because the number of fragments in common and the length of these fragments is known (the length is judged by comparison to a size standard), the degree of overlap can be deduced to a high degree of precision.

Gaps between contigs

Gaps often remain after initial BAC contig construction. These gaps occur if the Bacterial Artificial Chromosome (BAC) library screened has low complexity, meaning it does not contain a high number of STS or restriction sites, or if certain regions were less stable in cloning hosts and thus underrepresented in the library.^[1] If gaps between contigs remain after STS landmark mapping and restriction fingerprinting have been performed, the sequencing of contig ends can be used to close these gaps. This end-sequencing strategy essentially creates a novel STS with which to screen the other contigs. Alternatively, the end sequence of a contig can be used as a primer to primer walk across the gap.^[2]

Related Research Articles

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid, used for transforming and cloning in bacteria, usually E. coli. F-plasmids play a crucial role because they contain partition genes that promote the even distribution of plasmids after bacterial cell division. The bacterial artificial chromosome's usual insert size is 150–350 kbp. A similar cloning vector called a PAC has also been produced from the DNA of P1 bacteriophage.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology cannot read whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

This is a list of topics in molecular biology. See also index of biochemistry articles.

Primer walking, also called chromosome walking, is a technique used to clone a gene from its known closest markers. As a result, it is employed in cloning and sequencing efforts in plants, fungi, and mammals with minor alterations. This technique, also known as "directed sequencing," employs a series of Sanger sequencing reactions to either confirm the reference sequence of a known plasmid or PCR product based on the reference sequence or to discover the unknown sequence of a full plasmid or PCR product by designing primers to sequence overlapping sections.

Sanger sequencing Method of DNA sequencing developed in 1977

Sanger sequencing is a method of DNA sequencing that involves electrophoresis and is based on the random incorporation of chain-terminating dideoxynucleotides by DNA polymerase during in vitro DNA replication. After first being developed by Frederick Sanger and colleagues in 1977, it became the most widely used sequencing method for approximately 40 years. It was first commercialized by Applied Biosystems in 1986. More recently, higher volume Sanger sequencing has been replaced by next generation sequencing methods, especially for large-scale, automated genome analyses. However, the Sanger method remains in wide use for smaller-scale projects and for validation of deep sequencing results. It still has the advantage over short-read sequencing technologies in that it can produce DNA sequence reads of >500 nucleotides and maintains a very low error rate with accuracies around 99.99%. Sanger sequencing is still actively being used in efforts for public health initiatives such as sequencing the spike protein from SARS-CoV-2 as well as for the surveillance of norovirus outbreaks through the Center for Disease Control and Prevention's (CDC) CaliciNet surveillance network.

Gene mapping describes the methods used to identify the locus of a gene and the distances between genes. Gene mapping can also describe the distances between different sites within a gene.

A genomic library is a collection of the total genomic DNA from a single organism. The DNA is stored in a population of identical vectors, each containing a different insert of DNA. In order to construct a genomic library, the organism's DNA is extracted from cells and then digested with a restriction enzyme to cut the DNA into fragments of a specific size. The fragments are then inserted into the vector using DNA ligase. Next, the vector DNA can be taken up by a host organism - commonly a population of Escherichia coli or yeast - with each cell containing only one vector molecule. Using a host cell to carry the vector allows for easy amplification and retrieval of specific clones from the library for analysis.

Artificial gene synthesis, or simply gene synthesis, refers to a group of methods that are used in synthetic biology to construct and assemble genes from nucleotides de novo. Unlike DNA synthesis in living cells, artificial gene synthesis does not require template DNA, allowing virtually any DNA sequence to be synthesized in the laboratory. It comprises two main steps, the first of which is solid-phase DNA synthesis, sometimes known as DNA printing. This produces oligonucleotide fragments that are generally under 200 base pairs. The second step then involves connecting these oligonucleotide fragments using various DNA assembly methods. Because artificial gene synthesis does not require template DNA, it is theoretically possible to make a completely synthetic DNA molecule with no limits on the nucleotide sequence or size.

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects, predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology.

Optical mapping is a technique for constructing ordered, genome-wide, high-resolution restriction maps from single, stained molecules of DNA, called "optical maps". By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence. Originally developed by Dr. David C. Schwartz and his lab at NYU in the 1990s this method has since been integral to the assembly process of many large-scale sequencing projects for both microbial and eukaryotic genomes. Later technologies use DNA melting, DNA competitive binding or enzymatic labelling in order to create the optical mappings.

Polony sequencing is an inexpensive but highly accurate multiplex sequencing technique that can be used to “read” millions of immobilized DNA sequences in parallel. This technique was first developed by Dr. George Church's group at Harvard Medical School. Unlike other sequencing techniques, Polony sequencing technology is an open platform with freely downloadable, open source software and protocols. Also, the hardware of this technique can be easily set up with a commonly available epifluorescence microscopy and a computer-controlled flowcell/fluidics system. Polony sequencing is generally performed on paired-end tags library that each molecule of DNA template is of 135 bp in length with two 17–18 bp paired genomic tags separated and flanked by common sequences. The current read length of this technique is 26 bases per amplicon and 13 bases per tag, leaving a gap of 4–5 bases in each tag.

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

Jumping libraries or junction-fragment libraries are collections of genomic DNA fragments generated by chromosome jumping. These libraries allow the analysis of large areas of the genome and overcome distance limitations in common cloning techniques. A jumping library clone is composed of two stretches of DNA that are usually located many kilobases away from each other. The stretch of DNA located between these two "ends" is deleted by a series of biochemical manipulations carried out at the start of this cloning technique.

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

End-sequence profiling (ESP) is a method based on sequence-tagged connectors developed to facilitate de novo genome sequencing to identify high-resolution copy number and structural aberrations such as inversions and translocations.

Synthetic genome is a synthetically built genome whose formation involves either genetic modification on pre-existing life forms or artificial gene synthesis to create new DNA or entire lifeforms. The field that studies synthetic genomes is called synthetic genomics.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

Physical map is a technique used in molecular biology to find the order and physical distance between DNA base pairs by DNA markers. It is one of the gene mapping techniques which can determine the sequence of DNA base pairs with high accuracy. Genetic mapping, another approach of gene mapping, can provide markers needed for the physical mapping. However, as the former deduces the relative gene position by recombination frequencies, it is less accurate than the latter.

References

1 2 3 4 5 Gregory, S. Contig Assembly. Encyclopedia of Life Sciences, 2005.
1 2 3 4 5 Gibson, Greg; Muse, Spencer V. (2009). A Primer of Genome Science (3rd ed.). Sinauer Associates. p. 84. ISBN 978-0-878-93236-8.
1 2 3 Dear, P. H. Genome Mapping. Encyclopedia of Life Sciences, 2005. doi : 10.1038/npg.els.0005353.
↑ Staden, R (1980). "A new computer method for the storage and manipulation of DNA gel reading data". Nucleic Acids Research. 8 (16): 3673–3694. doi:10.1093/nar/8.16.3673. PMC 324183 . PMID 7433103.
↑ Staden R (1979). "A strategy of DNA sequencing employing computer programs". Nucleic Acids Research. 6 (7): 2601–2610. doi:10.1093/nar/6.7.2601. PMC 327874 . PMID 461197.
1 2 Dunham, I. Genome Sequencing. Encyclopedia of Life Sciences, 2005.
1 2 Fullwood MJ, Wei C, Liu ET, et al. (2009). "Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses". Genome Research. 19 (4): 521–532. doi:10.1101/gr.074906.107. PMC 3807531 . PMID 19339662.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[contig_assembly-1] 1 2 3 4 5 Gregory, S. Contig Assembly. Encyclopedia of Life Sciences, 2005.

[textbook-2] 1 2 3 4 5 Gibson, Greg; Muse, Spencer V. (2009). A Primer of Genome Science (3rd ed.). Sinauer Associates. p. 84. ISBN 978-0-878-93236-8.

[genome_map-3] 1 2 3 Dear, P. H. Genome Mapping. Encyclopedia of Life Sciences, 2005. doi : 10.1038/npg.els.0005353.

[4] Staden, R (1980). "A new computer method for the storage and manipulation of DNA gel reading data". Nucleic Acids Research. 8 (16): 3673–3694. doi:10.1093/nar/8.16.3673. PMC 324183 . PMID 7433103.

[5] Staden R (1979). "A strategy of DNA sequencing employing computer programs". Nucleic Acids Research. 6 (7): 2601–2610. doi:10.1093/nar/6.7.2601. PMC 327874 . PMID 461197.

[genome_sequencing-6] 1 2 Dunham, I. Genome Sequencing. Encyclopedia of Life Sciences, 2005.

[pet-7] 1 2 Fullwood MJ, Wei C, Liu ET, et al. (2009). "Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses". Genome Research. 19 (4): 521–532. doi:10.1101/gr.074906.107. PMC 3807531 . PMID 19339662.

[1]

[2]

[3]

[4]

[5]

[6]

[7]