Scaffolding (bioinformatics)

Last updated
This is an example of a scaffold. PET contig scaffold.png
This is an example of a scaffold.

Scaffolding is a technique used in bioinformatics. It is defined as follows: [1]

Contents

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

When creating a draft genome, individual reads of DNA are second assembled into contigs, which, by the nature of their assembly, have gaps between them. The next step is to then bridge the gaps between these contigs to create a scaffold. [2] This can be done using either optical mapping or mate-pair sequencing. [3]

Assembly software

The sequencing of the Haemophilus influenzae genome marked the advent of scaffolding. That project generated a total of 140 contigs, which were oriented and linked using paired-end reads. The success of this strategy prompted The Institute for Genomic Research to develop the scaffolding program Grouper for their other sequencing projects. Until 2001, Grouper was the only stand-alone scaffolding software. [4] After the Human Genome Project and Celera proved that it was possible to create a large draft genome, several other similar programs were created. Bambus was created in 2003 and was a rewrite of the original grouper software, but afforded researchers the ability to adjust scaffolding parameters. [4] This software also allowed for optional use of other linking data, such as contig order in a reference genome.

Algorithms used by assembly software are very diverse, and can be classified as based on iterative marker ordering, or graph based. Graph based applications have the capacity to order and orient over 10,000 markers, compared to the maximum 3000 markers capable of iterative marker applications. [5] Algorithms can be further classified as greedy, non greedy, conservative, or non conservative. Bambus uses a greedy algorithm, defined as such because it joins together contigs with the most links first. The algorithm used by Bambus 2 removes repetitive contigs before orienting and ordering them into scaffolds. SSPACE also uses a greedy algorithm that begins building its first scaffold with the longest contig provided by the sequence data. SSPACE is the most commonly cited assembly tool in biology publications, likely due to the fact that it is rated as a significantly more intuitive program to install and run than other assemblers. [6]

In recent years, there has been an advent of new kinds of assemblers capable of integrating linkage data from multiple types of linkage maps. ALLMAPS is the first of such programs and is capable of combining data from genetic maps, created using SNPs or recombination data, with physical maps such as optical or synteny maps. [7]

Some software, like ABySS and SOAPdenovo, contain gap filling algorithms which, although they do not create any new scaffolds, serve to decrease the gap length between contigs of individual scaffolds. A standalone program, GapFiller, is capable of closing a larger amount of gaps, using less memory than gap filling algorithms contained within assembly programs. [8]

Utturkar et al. investigated the utility of several different assembly software packages in combination with hybrid sequence data. They concluded that the ALLPATHS-LG and SPAdes algorithms were superior to other assemblers in terms of the number of, maximum length of, and N50 length of contigs and scaffolds. [9]

Scaffolding and next generation sequencing

Most high-throughput, next generation sequencing platforms produce shorter read lengths compared to Sanger sequencing. These new platforms are able to generate large quantities of data in short periods of time, but until methods were developed for de novo assembly of large genomes from short read sequences, Sanger sequencing remained the standard method of creating a reference genome. [10] Although Illumina platforms are now able to generate mate pair reads with average lengths of 150bp, they were originally only able to generate reads of 75bp or less, which caused many people in the science community to doubt a reliable reference genome could ever be constructed with short read technology. The increased difficulty of contig and scaffold assembly associated with the new technologies has created a demand for powerful new computer programs and algorithms capable of making sense of the data. [11]

One strategy that incorporates high-throughput next generation sequencing is hybrid sequencing, wherein several sequencing technologies are used at different levels of coverage, so that they can complement each other with their respective strengths. The release of the SMRT platform, from Pacific Biosciences, marked the beginning of single molecule sequencing and long read tech. It has been shown that 80-100X coverage with SMRT technology, which generates average read with lengths of 5456bp, is usually sufficient to create a finished de novo assembly for prokaryotic organisms. When the funds for that level of coverage are not available to a researcher, they might decide to use a hybrid approach.

Goldberg et al. evaluated the effectiveness of combining high throughput pyrosequencing with traditional Sanger sequencing. They were able to greatly increase N50 contig length and decrease gap length, and even to close one microbial genome with this approach. [12]

Optical mapping

It has been shown that integration of linkage maps can aid de novo assemblies with long range, chromosome scale recombination data, without which, assemblies can be subject to macro ordering errors. Optical mapping is the process of immobilizing the DNA on a slide and digesting it with restriction enzymes. The fragment ends are then fluorescently tagged and stitched back together. For the last two decades, optical mapping has been prohibitively expensive, but recent advances in technology have reduced cost significantly. [5] [13]

See also

  1. "EDAM Ontology of Bioinformatics Operations and Data Formats".
  2. Waterston, Robert (2002). "On the Sequencing of the Human Genome". Proceedings of the National Academy of Sciences of the United States of America. 99 (6): 3712–3716. Bibcode:2002PNAS...99.3712W. doi: 10.1073/pnas.042692499 . PMC   122589 . PMID   11880605.
  3. Flot, Jean-François; Marie-Nelly, Hervé; Koszul, Romain (2015-10-07). "Contact genomics: scaffolding and phasing (meta)genomes using chromosome 3D physical signatures". FEBS Letters. 589 (20 Pt A): 2966–2974. doi: 10.1016/j.febslet.2015.04.034 . ISSN   1873-3468. PMID   25935414.
  4. 1 2 Pop, Mihai; Kosack, Daniel S.; Salzberg, Steven L. (2004-01-01). "Hierarchical Scaffolding With Bambus". Genome Research. 14 (1): 149–159. doi:10.1101/gr.1536204. ISSN   1088-9051. PMC   314292 . PMID   14707177.
  5. 1 2 Fierst JL (2015) Using linkage maps to correct and scaffold de novo genome assemblies: methods, challenges, and computational tools. In: Frontiers in Genetics. http://journal.frontiersin.org/article/10.3389/fgene.2015.00220/full . Accessed 7 Apr 2017
  6. Hunt, M; Newbold, C; Berriman, M; Otto, TD (2014). "A comprehensive evaluation of assembly scaffolding tools". Genome Biology. 15 (3): R42. doi: 10.1186/gb-2014-15-3-r42 . PMC   4053845 . PMID   24581555.
  7. Tang, H; Zhang, X; Miao, C; et al. (2015). "ALLMAPS: robust scaffold ordering based on multiple maps". Genome Biology. 16 (1): 3. doi: 10.1186/s13059-014-0573-1 . PMC   4305236 . PMID   25583564.
  8. Boetzer, M; Pirovano, W (2012). "Toward almost closed genomes with GapFiller". Genome Biology. 13 (6): R56. doi: 10.1186/gb-2012-13-6-r56 . PMC   3446322 . PMID   22731987.
  9. Utturkar, SM; Klingeman, DM; Land, ML; et al. (2014). "Evaluation and validation of de novo and hybrid assembly techniques to derive high-quality genome sequences". Bioinformatics. 30 (19): 2709–2716. doi:10.1093/bioinformatics/btu391. PMC   4173024 . PMID   24930142.
  10. Li, Ruiqiang; Zhu, Hongmei; Ruan, Jue; Qian, Wubin; Fang, Xiaodong; Shi, Zhongbin; Li, Yingrui; Li, Shengting; Shan, Gao (2017-02-09). "De novo assembly of human genomes with massively parallel short read sequencing". Genome Research. 20 (2): 265–272. doi:10.1101/gr.097261.109. ISSN   1088-9051. PMC   2813482 . PMID   20019144.
  11. Pareek, Chandra Shekhar; Smoczynski, Rafal; Tretyn, Andrzej (2017-02-09). "Sequencing technologies and genome sequencing". Journal of Applied Genetics. 52 (4): 413–435. doi:10.1007/s13353-011-0057-x. ISSN   1234-1983. PMC   3189340 . PMID   21698376.
  12. Goldberg, Susanne M. D.; et al. (2006). "A Sanger/Pyrosequencing Hybrid Approach for the Generation of High-Quality Draft Assemblies of Marine Microbial Genomes". Proceedings of the National Academy of Sciences of the United States of America. 103 (30): 11240–11245. Bibcode:2006PNAS..10311240G. doi: 10.1073/pnas.0604351103 . JSTOR   30049789. PMC   1544072 . PMID   16840556.
  13. Chaisson, Mark; Wilson, Richard; Eichler, Evan (7 October 2015). "Genetic variation and the de novo assembly of human genomes". Nature Reviews Genetics. 16 (11): 627–640. doi:10.1038/nrg3933. PMC   4745987 . PMID   26442640.

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

<span class="mw-page-title-main">Genome project</span>

Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.

A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

DNA sequencing theory is the broad body of work that attempts to lay analytical foundations for determining the order of specific nucleotides in a sequence of DNA, otherwise known as DNA sequencing. The practical aspects revolve around designing and optimizing sequencing projects, predicting project performance, troubleshooting experimental results, characterizing factors such as sequence bias and the effects of software processing algorithms, and comparing various sequencing methods to one another. In this sense, it could be considered a branch of systems engineering or operations research. The permanent archive of work is primarily mathematical, although numerical calculations are often conducted for particular problems too. DNA sequencing theory addresses physical processes related to sequencing DNA and should not be confused with theories of analyzing resultant DNA sequences, e.g. sequence alignment. Publications sometimes do not make a careful distinction, but the latter are primarily concerned with algorithmic issues. Sequencing theory is based on elements of mathematics, biology, and systems engineering, so it is highly interdisciplinary. The subject may be studied within the context of computational biology.

Velvet is an algorithm package that has been designed to deal with de novo genome assembly and short read sequencing alignments. This is achieved through the manipulation of de Bruijn graphs for genomic sequence assembly via the removal of errors and the simplification of repeated regions. Velvet has also been implemented in commercial packages, such as Sequencher, Geneious, MacVector and BioNumerics.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

Optical mapping is a technique for constructing ordered, genome-wide, high-resolution restriction maps from single, stained molecules of DNA, called "optical maps". By mapping the location of restriction enzyme sites along the unknown DNA of an organism, the spectrum of resulting DNA fragments collectively serves as a unique "fingerprint" or "barcode" for that sequence. Originally developed by Dr. David C. Schwartz and his lab at NYU in the 1990s this method has since been integral to the assembly process of many large-scale sequencing projects for both microbial and eukaryotic genomes. Later technologies use DNA melting, DNA competitive binding or enzymatic labelling in order to create the optical mappings.

In computational biology, N50 and L50 are statistics of a set of contig or scaffold lengths. The N50 is similar to a mean or median of lengths, but has greater weight given to the longer contigs. It is used widely in genome assembly, especially in reference to contig lengths within a draft assembly. There are also the related U50, UL50, UG50, UG50%, N90, NG50, and D50 statistics.

<span class="mw-page-title-main">Reference genome</span>

A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. For example, one of the most recent human reference genomes, assembly GRCh38/hg38, is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.

<span class="mw-page-title-main">Hybrid genome assembly</span>

In bioinformatics, hybrid genome assembly refers to utilizing various sequencing technologies to achieve the task of assembling a genome from fragmented, sequenced DNA resulting from shotgun sequencing. Genome assembly presents one of the most challenging tasks in genome sequencing as most modern DNA sequencing technologies can only produce reads that are, on average, 25-300 base pairs in length. This is orders of magnitude smaller than the average size of a genome. This assembly is computationally difficult and has some inherent challenges, one of these challenges being that genomes often contain complex tandem repeats of sequences that can be thousands of base pairs in length. These repeats can be long enough that second generation sequencing reads are not long enough to bridge the repeat, and, as such, determining the location of each repeat in the genome can be difficult. Resolving these tandem repeats can be accomplished by utilizing long third generation sequencing reads, such as those obtained using the PacBio RS DNA sequencer. These sequences are, on average, 10,000-15,000 base pairs in length and are long enough to span most repeated regions. Using a hybrid approach to this process can increase the fidelity of assembling tandem repeats by being able to accurately place them along a linear scaffold and make the process more computationally efficient.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

SPAdes is a genome assembly algorithm which was designed for single cell and multi-cells bacterial data sets. Therefore, it might not be suitable for large genomes projects.

Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.

De novo sequence assemblers are a type of program that assembles short nucleotide sequences into longer ones without the use of a reference genome. These are most commonly used in bioinformatic studies to assemble genomes or transcriptomes. Two common types of de novo assemblers are greedy algorithm assemblers and De Bruijn graph assemblers.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.