A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA (deoxyribonucleic acid) fragments that are obtained from different types of sequencing technology.
The genome of plants can vary in their structure and complexity from small genomes like green algae (15 Mbp). [1] to very large and complex genomes that have typically much higher ploidy, higher rates of heterozygosity and repetitive elements than species from other kingdoms. [2] One of the most complex plant genome assemblies available is that of loblolly pine (22 Gbp). [3] Due to their complexity, the plants' genome sequences can't be assembled back into chromosomes using only short reads provided by next-generation- sequencing technologies (NGS), [4] [5] and therefore most plant genome assemblies available that used NGS alone are highly fragmented, contain large numbers of contigs, and genome regions are not finished. Highly repetitive sequences, often larger than 10kbp, are the main challenge in plants. [6] [7] Most of the chromosomal sequences are produced by the activity of mobile genetic elements (MGEs) in the plant genomes. [8] MGEs are divided into two classes: class I or retrotransposons, and class II or DNA transposons. In plants, long- terminal repeat (LTR) retrotransposons are predominant and constitute from 15% [9] to 90% of the genome. [10] Polyploidy is another challenge in assembling a plant genome, and it is estimated that ≈80% of plants are polyploids. [11]
The first complete plant genome assembly, that of Arabidopsis thaliana , was finished in 2000, [12] being the third multicellular eukaryotic genome published after C. elegans [13] and D. melanogaster. [14] Arabidopsis, unlike other plants' genomes (e.g. Malus) has convenient traits, such as a small nuclear genome (135Mbp) and a short generation time (8 weeks from seed to seed). The genome has five chromosomes reflecting approximately 4% of the human genome size. The genome was sequenced and annotated by the Arabidopsis Genome Initiative (AGI).
The initiative for sequencing the genome of rice ( Oryza sativa ), [15] began in September 1997, when scientists from many nations agreed to an international collaboration to sequence the rice genome, forming "The International Rice Genome Sequencing Project" (IRGSP). At an estimated size between 400 and 430 Mb, approximatively four times larger in dimensions than A. thaliana, rice has the smallest of the major cereal crop genomes. [15]
Between 2000 and 2008 in total 10 plant genomes were published while in 2012 alone, 13 plant genomes were published. Since then the number was constantly increasing, and now more than 400 plant genomes are available in the NCBI genome database, of which 72 were re-annotated [NCBI].
EnsemblPlants [16] is part of EnsemblGenome database and contains resources for a reduced number of sequenced plant species (45, Oct. 2017). It mainly provides genome sequences, gene models, functional annotations and polymorphic loci. For some of the plant species, additional information is provided including population structure, individual genotypes, linkage, and phenotype data.
Gramene [17] is an online web database resource for plant comparative genomics and pathway analysis based on Ensembl technology.
Plant Genome DataBase Japan [18] (PGDBj) is a website that contains information related to genomes of model and crop plants from databases. It has three main components: ortholog db, DNA marker and linkage map db, and plant resource db, where multiple plant resources accumulated by different institutes are integrated. The aim is "to provide a platform, enabling comparative searches of different resources" (pgdbj.jp).
PlantsDB [19] is a resource for analysing and storing genetic and genomic information from various plants, and offers tools to query these data and to perform comparative analysis with the help of in-house tools.
PLAZA [20] [21] is another online resource for comparative genomics that integrates plant sequence data and comparative genomic methods, and performs evolutionary analysis within the green plant lineage (Viridiplantae).
The Arabidopsis Information Resource (TAIR) [22] maintains a web database of the "model higher plant Arabidopsis Thaliana ".
In general, for sequencing and assembling large and complex genomes like plants, different strategies are used, based on the technologies available at that time when the project started.
Clone-by-clone sequencing strategies are based on the construction of a map for each chromosome before the sequencing, and rely on libraries made from large-insert clones. The most common type of large-insert clone is the bacterial artificial chromosome (BAC).
With BAC, the genome is first split into smaller pieces with the location recorded. The pieces of DNA are then inserted into BAC clones that are further multiplied by inserting them into bacterial cells that grow very fast. These pieces are further fragmented into overlapping smaller pieces that are placed into a vector and then sequenced. The small pieces are then assembled into contigs by overlapping them. Next, using the map from the first step the contigs are assembled back into the chromosomes.
The first complete plant genome assembly (also the first plant genome published) that used this type of technique was Arabidopsis thaliana, in 2000. [12] Different large-insert libraries like BACs, P1 artificial chromosomes (PAC), yeast artificial chromosome (YAC) and transformation-competent artificial chromosomes (TACs) were combined to assemble the genome. From clones with restriction fragment fingerprint, by comparison of the patterns and hybridization or polymerase chain reaction (PCR) the physical maps were constructed. The physical maps were integrated together with genetic maps to identify contig positions and orientations. End sequences from 47,788 BAC clones were used to extend contigs from anchored BACs and to select a minimum tiling path. A total of 1,569 clones found in minimum tiling path were selected and sequenced. Direct PCR products were used to clone remaining gaps, and YACs allowed the characterization of telomere sequences. The resulting sequenced regions were 115.4 Mb of the 125 Mb predicted size of the genome and a total of 25,498 of protein-coding genes.
To sequence and assemble the genome of Oryza sativa (japonica), [15] the same strategy was used. For Oryza sativa a total of 3,401 mapped clones in a minimum tiling path were selected from the physical map and assembled.
One of the most important crops in the world, maize (Zea mays), is the last plant genome project primarily based on Sanger BAC-by-BAC strategy. [23] The genome size of Maize, 2.3 Gb and 10 chromosomes, [23] is significantly larger than that of rice and Arabidopsis. [23] To assemble the genome of maize a set of 16,848 minimally overlapping BAC clones derived
from combinations of physical and genetic map were selected and sequenced. The assembly on maize was performed in addition with external information data. The data was obtained from cDNA and sequences from libraries with methyl-filtered DNA (libraries that uses the knowledge that the bases in genic sequences tends to be less heavily methylated than those in non-genic regions) and high C0 t techniques.
Sanger clone-by-clone strategy has the advantage of working in small units, which reduces the complexity and computational requirements, as well as minimized problems associated with the misassembly of highly repetitive DNA and therefore is an attractive solution in assembling plant genomes and other complex eukaryotic genomes. The main disadvantages of this method are the costs and the resources required. The cost of the first plant genome assemblies was estimated between 70 million dollars [24] and 200 million dollars per assembly. [25]
In the WGS sequencing technology there is no order for the fragments that are sequenced. The DNA is randomly sheared and cloned fragments are sequenced and assembled using computational methods. This technology reduced the cost and the time associated with construction of the maps and relies on computational resources.
A considerable number of important plant genomes like grapevine (Vitis Vinifer), [26] papaya (Carica papaya), [27] and cottonwood ( Populus trichocarpa ) [28] were sequenced and assembled with Sanger WGS strategy.
The draft genome of grapevine [26] is the fourth genome published for a flowering plant and the first from a fruit crop. The sequences of the genome were obtained from different types of libraries, like plasmids, fosmids and BACs. All the data were generated by paired-end sequencing of cloned insert using Sanger technology on ABI3730x1 sequencers. To assemble the reads, Arachne, 2002, [29] a software designed to analyze reads obtained from both ends of plasmid clones, was used. In total 6.2 million paired-end tag reads were produced. The software produced 20.784 contigs that were combined into 3,830 supercontigs, having an N50 value of 64kb. Supercontigs had a total size of 498 Mb.
The anchorage of the supercontigs along the genome was performed first by joining supercontigs together using paired BAC end sequences. The resulting ultracontigs and the remained supercontigs were then aligned along the genetic map of the genome. Later improvements of this strategy enabled the sequencing of Brachypodium distachyon , [30] Sorghum bicolor [31] and soybean. [32]
Due to its relatively cheap cost in comparison to previous methods, most of the recent plant genomes were sequenced and assembled using data from NGS (next-generation- sequencing) technology. In general the NGS data are used in combination with Sanger Sequencing technology or long-reads obtained from the third generation sequencing. The genome of the cucumber, (Cucumis sativus), [33] was one of the plant genomes that used the NGS Illumina reads in combination with Sanger sequences. 72.2-fold genome coverage high quality base pairs were generated from which 3.9-fold coverage was provided from Sanger and the Illumina GA reads provided 68.3-fold coverage. From this two assemblies were produced based on the sequencing technology. The resulting contigs were compared between them, resulting in a total length of the assembled genome of 243.5 Mb. The result is about 30% smaller than the genome size estimated by flow cytometry of isolated nuclei stained with propidium iodide (367 Mb). A genetic map was constructed to anchor the assembled genome. 72.8% of the assembled sequences were successfully anchored onto the seven chromosomes. Another plant genome that combined NGS with Sanger sequencing was the genome of Theobroma cacao , 2010, [34] an economically important tropical fruit tree crop and the primary source of cocoa. The genome was sequenced in a consortium, "The International Cocoa Genome Sequencing consortium (ICGS) " and produced a total of 17.6 million 454 single end reads, 8.8 million 454 paired-end reads, 398.0 million Illumina paired-end reads and about 88,000 Sanger BAC reads. First by using genome assembly software, Newbler, an assembly was produced with 25,912 contigs and 4,792 scaffolds from the reads obtained from Roche/454 and Sanger raw data. This had a total length of 326.9 Mb, which represents 76% of the estimated genome size. The Illumina reads were used to complement the 454 assembly, by aligning the short reads on the cocoa genome assembly using the SOAP software. A similar strategy that combined NGS reads and Sanger Sequencing was used for other important plant species like the first published apple genome (Malus domestica), [35] cotton (Gossypium Raimond), [36] draft genome of sweet orange (Citrus sinensis) [37] and the domesticated tomato (Solanum lycopersicum) genome [38]
With the emergence of third-generation sequencing (TGS) some of the limitations from previous methods of sequencing and assembling plant genomes have started to be addressed. This technology is characterized by the parallel sequencing of single molecules of DNA, that results in sequences up to 54 kbp length (PacBio RS 2). [39] In general, long reads from TGS have relatively high error rates (≈10% on average) [40] and therefore repeated sequencing of the same DNA fragments is required. The price of such technology is still quite high and therefore is generally used in combination with short reads from NGS. One of the first plant genome that used long-reads from TGS, Pacific Biosciences in combination with short reads from NGS was the genome of spinach [41] having a genome size estimated at 989 Mb. For this, a 60× coverage of the genome was generated, with 20% of the reads larger than 20 kb. Data were assembled using PacBio's hierarchical genome assembly process (HGAP), [42] and showed that long-read assemblies revealed a 63-fold improvement in contig size over an Illumina-only assembly. Another plant genome that was recently published that used long reads in combination with short reads is the improved assembly of the apple genome. [43] In this project a hybrid approach was used, combining different data types from sequencing technologies. The sequences used came from: PacBio RS II, Illumina paired-end reads (PE) and Illumina mate- pair reads (MP). As a first step an assembly from Illumina paired-end reads was performed using a well-known de novo assembly software SOAPdevo. [44] Then using a hybrid assembly pipeline DBG2OLC. [45] the contigs obtained at the first step and the long reads from PacBio were combined. The assembly was then polished with the help of Illumina paired-end reads by mapping them to the contigs using BWA-MEM. [46] By mapping the mate-pair reads on the corrected contigs they scaffold the assembly. Further BioNano [47] optical mapping analysis with a total length of 649.7 Mb, were used in the hybrid assembly pipeline together with the scaffolds obtained from the previous step. The resulting scaffolds were anchored to a genetic map constructed from 15,417 single-nucleotide polymorphisms (SNPs) markers. For better understanding of the number and diversity of genes that were identified, ribonucleic acid RNA-seq, were used. The resulted genome has a dimension of 643.2 Mb getting closer to the estimated genome size than the previous published assembly [35] and a smaller number of protein-coding- genes.
The use of long reads in the plant genome assemblies became more popular, for reducing the number of scaffolds and increasing the quality of the genome by improving the assembly and coverage in regions that are not clearly defined by NGS assembly.
In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.
Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.
A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid, used for transforming and cloning in bacteria, usually E. coli. F-plasmids play a crucial role because they contain partition genes that promote the even distribution of plasmids after bacterial cell division. The bacterial artificial chromosome's usual insert size is 150–350 kbp. A similar cloning vector called a PAC has also been produced from the DNA of P1 bacteriophage.
Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.
A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.
Yeast artificial chromosomes (YACs) are genetically engineered chromosomes derived from the DNA of the yeast, Saccharomyces cerevisiae, which is then ligated into a bacterial plasmid. By inserting large fragments of DNA, from 100–1000 kb, the inserted sequences can be cloned and physically mapped using a process called chromosome walking. This is the process that was initially used for the Human Genome Project, however due to stability issues, YACs were abandoned for the use of bacterial artificial chromosome
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).
DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.
Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.
A genomic library is a collection of overlapping DNA fragments that together make up the total genomic DNA of a single organism. The DNA is stored in a population of identical vectors, each containing a different insert of DNA. In order to construct a genomic library, the organism's DNA is extracted from cells and then digested with a restriction enzyme to cut the DNA into fragments of a specific size. The fragments are then inserted into the vector using DNA ligase. Next, the vector DNA can be taken up by a host organism - commonly a population of Escherichia coli or yeast - with each cell containing only one vector molecule. Using a host cell to carry the vector allows for easy amplification and retrieval of specific clones from the library for analysis.
Whole genome sequencing (WGS) is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.
Cancer genome sequencing is the whole genome sequencing of a single, homogeneous or heterogeneous group of cancer cells. It is a biochemical laboratory method for the characterization and identification of the DNA or RNA sequences of cancer cell(s).
A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead, a reference provides a haploid mosaic of different DNA sequences from each donor. For example, one of the most recent human reference genomes, assembly GRCh38/hg38, is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.
Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation sequencing. Some of these technologies emerged between 1993 and 1998 and have been commercially available since 2005. These technologies use miniaturized and parallelized platforms for sequencing of 1 million to 43 billion short reads per instrument run.
De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.
In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.
Jumping libraries or junction-fragment libraries are collections of genomic DNA fragments generated by chromosome jumping. These libraries allow the analysis of large areas of the genome and overcome distance limitations in common cloning techniques. A jumping library clone is composed of two stretches of DNA that are usually located many kilobases away from each other. The stretch of DNA located between these two "ends" is deleted by a series of biochemical manipulations carried out at the start of this cloning technique.
Scaffolding is a technique used in bioinformatics. It is defined as follows:
Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.
Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.
End-sequence profiling (ESP) is a method based on sequence-tagged connectors developed to facilitate de novo genome sequencing to identify high-resolution copy number and structural aberrations such as inversions and translocations.