Plant genome assembly

Last updated

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA (deoxyribonucleic acid) fragments that are obtained from different types of sequencing technology.

Contents

Structure

The genome of plants can vary in their structure and complexity from small genomes like green algae (15 Mbp). [1] to very large and complex genomes that have typically much higher ploidy, higher rates of heterozygosity and repetitive elements than species from other kingdoms. [2] One of the most complex plant genome assemblies available is that of loblolly pine (22 Gbp). [3] Due to their complexity, the plants' genome sequences can't be assembled back into chromosomes using only short reads provided by next-generation- sequencing technologies (NGS), [4] [5] and therefore most plant genome assemblies available that used NGS alone are highly fragmented, contain large numbers of contigs, and genome regions are not finished. Highly repetitive sequences, often larger than 10kbp, are the main challenge in plants. [6] [7] Most of the chromosomal sequences are produced by the activity of mobile genetic elements (MGEs) in the plant genomes. [8] MGEs are divided into two classes: class I or retrotransposons, and class II or DNA transposons. In plants, long- terminal repeat (LTR) retrotransposons are predominant and constitute from 15% [9] to 90% of the genome. [10] Polyploidy is another challenge in assembling a plant genome, and it is estimated that ≈80% of plants are polyploids. [11]

Assemblies

The first complete plant genome assembly, that of Arabidopsis thaliana , was finished in 2000, [12] being the third multicellular eukaryotic genome published after C. elegans [13] and D. melanogaster. [14] Arabidopsis, unlike other plants' genomes (e.g. Malus) has convenient traits, such as a small nuclear genome (135Mbp) and a short generation time (8 weeks from seed to seed). The genome has five chromosomes reflecting approximately 4% of the human genome size. The genome was sequenced and annotated by the Arabidopsis Genome Initiative (AGI).

The initiative for sequencing the genome of rice ( Oryza sativa ), [15] began in September 1997, when scientists from many nations agreed to an international collaboration to sequence the rice genome, forming "The International Rice Genome Sequencing Project" (IRGSP). At an estimated size between 400 and 430 Mb, approximatively four times larger in dimensions than A. thaliana, rice has the smallest of the major cereal crop genomes. [15]

Between 2000 and 2008 in total 10 plant genomes were published while in 2012 alone, 13 plant genomes were published. Since then the number was constantly increasing, and now more than 400 plant genomes are available in the NCBI genome database, of which 72 were re-annotated [NCBI].

Databases

EnsemblPlants [16] is part of EnsemblGenome database and contains resources for a reduced number of sequenced plant species (45, Oct. 2017). It mainly provides genome sequences, gene models, functional annotations and polymorphic loci. For some of the plant species, additional information is provided including population structure, individual genotypes, linkage, and phenotype data.

Gramene [17] is an online web database resource for plant comparative genomics and pathway analysis based on Ensembl technology.

Plant Genome DataBase Japan [18] (PGDBj) is a website that contains information related to genomes of model and crop plants from databases. It has three main components: ortholog db, DNA marker and linkage map db, and plant resource db, where multiple plant resources accumulated by different institutes are integrated. The aim is "to provide a platform, enabling comparative searches of different resources" (pgdbj.jp).

PlantsDB [19] is a resource for analysing and storing genetic and genomic information from various plants, and offers tools to query these data and to perform comparative analysis with the help of in-house tools.

PLAZA [20] [21] is another online resource for comparative genomics that integrates plant sequence data and comparative genomic methods, and performs evolutionary analysis within the green plant lineage (Viridiplantae).

The Arabidopsis Information Resource (TAIR) [22] maintains a web database of the "model higher plant Arabidopsis Thaliana ".

Assembly strategies

In general, for sequencing and assembling large and complex genomes like plants, different strategies are used, based on the technologies available at that time when the project started.

Sanger clone-by-clone

Clone-by-clone sequencing strategies are based on the construction of a map for each chromosome before the sequencing, and rely on libraries made from large-insert clones. The most common type of large-insert clone is the bacterial artificial chromosome (BAC).

With BAC, the genome is first split into smaller pieces with the location recorded. The pieces of DNA are then inserted into BAC clones that are further multiplied by inserting them into bacterial cells that grow very fast. These pieces are further fragmented into overlapping smaller pieces that are placed into a vector and then sequenced. The small pieces are then assembled into contigs by overlapping them. Next, using the map from the first step the contigs are assembled back into the chromosomes.

The first complete plant genome assembly (also the first plant genome published) that used this type of technique was Arabidopsis thaliana, in 2000. [12] Different large-insert libraries like BACs, P1 artificial chromosomes (PAC), yeast artificial chromosome (YAC) and transformation-competent artificial chromosomes (TACs) were combined to assemble the genome. From clones with restriction fragment fingerprint, by comparison of the patterns and hybridization or polymerase chain reaction (PCR) the physical maps were constructed. The physical maps were integrated together with genetic maps to identify contig positions and orientations. End sequences from 47,788 BAC clones were used to extend contigs from anchored BACs and to select a minimum tiling path. A total of 1,569 clones found in minimum tiling path were selected and sequenced. Direct PCR products were used to clone remaining gaps, and YACs allowed the characterization of telomere sequences. The resulting sequenced regions were 115.4 Mb of the 125 Mb predicted size of the genome and a total of 25,498 of protein-coding genes.

To sequence and assemble the genome of Oryza sativa (japonica), [15] the same strategy was used. For Oryza sativa a total of 3,401 mapped clones in a minimum tiling path were selected from the physical map and assembled.

One of the most important crops in the world, maize (Zea mays), is the last plant genome project primarily based on Sanger BAC-by-BAC strategy. [23] The genome size of Maize, 2.3 Gb and 10 chromosomes, [23] is significantly larger than that of rice and Arabidopsis. [23] To assemble the genome of maize a set of 16,848 minimally overlapping BAC clones derived

from combinations of physical and genetic map were selected and sequenced. The assembly on maize was performed in addition with external information data. The data was obtained from cDNA and sequences from libraries with methyl-filtered DNA (libraries that uses the knowledge that the bases in genic sequences tends to be less heavily methylated than those in non-genic regions) and high C0 t techniques.

Sanger clone-by-clone strategy has the advantage of working in small units, which reduces the complexity and computational requirements, as well as minimized problems associated with the misassembly of highly repetitive DNA and therefore is an attractive solution in assembling plant genomes and other complex eukaryotic genomes. The main disadvantages of this method are the costs and the resources required. The cost of the first plant genome assemblies was estimated between 70 million dollars [24] and 200 million dollars per assembly. [25]

Sanger whole-genome shotgun (WGS)

In the WGS sequencing technology there is no order for the fragments that are sequenced. The DNA is randomly sheared and cloned fragments are sequenced and assembled using computational methods. This technology reduced the cost and the time associated with construction of the maps and relies on computational resources.

A considerable number of important plant genomes like grapevine (Vitis Vinifer), [26] papaya (Carica papaya), [27] and cottonwood ( Populus trichocarpa ) [28] were sequenced and assembled with Sanger WGS strategy.

The draft genome of grapevine [26] is the fourth genome published for a flowering plant and the first from a fruit crop. The sequences of the genome were obtained from different types of libraries, like plasmids, fosmids and BACs. All the data were generated by paired-end sequencing of cloned insert using Sanger technology on ABI3730x1 sequencers. To assemble the reads, Arachne, 2002, [29] a software designed to analyze reads obtained from both ends of plasmid clones, was used. In total 6.2 million paired-end tag reads were produced. The software produced 20.784 contigs that were combined into 3,830 supercontigs, having an N50 value of 64kb. Supercontigs had a total size of 498 Mb.

The anchorage of the supercontigs along the genome was performed first by joining supercontigs together using paired BAC end sequences. The resulting ultracontigs and the remained supercontigs were then aligned along the genetic map of the genome. Later improvements of this strategy enabled the sequencing of Brachypodium distachyon , [30] Sorghum bicolor [31] and soybean. [32]

Next-generation sequencing

Due to its relatively cheap cost in comparison to previous methods, most of the recent plant genomes were sequenced and assembled using data from NGS (next-generation- sequencing) technology. In general the NGS data are used in combination with Sanger Sequencing technology or long-reads obtained from the third generation sequencing. The genome of the cucumber, (Cucumis sativus), [33] was one of the plant genomes that used the NGS Illumina reads in combination with Sanger sequences. 72.2-fold genome coverage high quality base pairs were generated from which 3.9-fold coverage was provided from Sanger and the Illumina GA reads provided 68.3-fold coverage. From this two assemblies were produced based on the sequencing technology. The resulting contigs were compared between them, resulting in a total length of the assembled genome of 243.5 Mb. The result is about 30% smaller than the genome size estimated by flow cytometry of isolated nuclei stained with propidium iodide (367 Mb). A genetic map was constructed to anchor the assembled genome. 72.8% of the assembled sequences were successfully anchored onto the seven chromosomes. Another plant genome that combined NGS with Sanger sequencing was the genome of Theobroma cacao , 2010, [34] an economically important tropical fruit tree crop and the primary source of cocoa. The genome was sequenced in a consortium, "The International Cocoa Genome Sequencing consortium (ICGS) " and produced a total of 17.6 million 454 single end reads, 8.8 million 454 paired-end reads, 398.0 million Illumina paired-end reads and about 88,000 Sanger BAC reads. First by using genome assembly software, Newbler, an assembly was produced with 25,912 contigs and 4,792 scaffolds from the reads obtained from Roche/454 and Sanger raw data. This had a total length of 326.9 Mb, which represents 76% of the estimated genome size. The Illumina reads were used to complement the 454 assembly, by aligning the short reads on the cocoa genome assembly using the SOAP software. A similar strategy that combined NGS reads and Sanger Sequencing was used for other important plant species like the first published apple genome (Malus domestica), [35] cotton (Gossypium Raimond), [36] draft genome of sweet orange (Citrus sinensis) [37] and the domesticated tomato (Solanum lycopersicum) genome [38]

Third-generation

With the emergence of third-generation sequencing (TGS) some of the limitations from previous methods of sequencing and assembling plant genomes have started to be addressed. This technology is characterized by the parallel sequencing of single molecules of DNA, that results in sequences up to 54 kbp length (PacBio RS 2). [39] In general, long reads from TGS have relatively high error rates (≈10% on average) [40] and therefore repeated sequencing of the same DNA fragments is required. The price of such technology is still quite high and therefore is generally used in combination with short reads from NGS. One of the first plant genome that used long-reads from TGS, Pacific Biosciences in combination with short reads from NGS was the genome of spinach [41] having a genome size estimated at 989 Mb. For this, a 60× coverage of the genome was generated, with 20% of the reads larger than 20 kb. Data were assembled using PacBio's hierarchical genome assembly process (HGAP), [42] and showed that long-read assemblies revealed a 63-fold improvement in contig size over an Illumina-only assembly. Another plant genome that was recently published that used long reads in combination with short reads is the improved assembly of the apple genome. [43] In this project a hybrid approach was used, combining different data types from sequencing technologies. The sequences used came from: PacBio RS II, Illumina paired-end reads (PE) and Illumina mate- pair reads (MP). As a first step an assembly from Illumina paired-end reads was performed using a well-known de novo assembly software SOAPdevo. [44] Then using a hybrid assembly pipeline DBG2OLC. [45] the contigs obtained at the first step and the long reads from PacBio were combined. The assembly was then polished with the help of Illumina paired-end reads by mapping them to the contigs using BWA-MEM. [46] By mapping the mate-pair reads on the corrected contigs they scaffold the assembly. Further BioNano [47] optical mapping analysis with a total length of 649.7 Mb, were used in the hybrid assembly pipeline together with the scaffolds obtained from the previous step. The resulting scaffolds were anchored to a genetic map constructed from 15,417 single-nucleotide polymorphisms (SNPs) markers. For better understanding of the number and diversity of genes that were identified, ribonucleic acid RNA-seq, were used. The resulted genome has a dimension of 643.2 Mb getting closer to the estimated genome size than the previous published assembly [35] and a smaller number of protein-coding- genes.

The use of long reads in the plant genome assemblies became more popular, for reducing the number of scaffolds and increasing the quality of the genome by improving the assembly and coverage in regions that are not clearly defined by NGS assembly.

Related Research Articles

<span class="mw-page-title-main">Shotgun sequencing</span> Method used for sequencing random DNA strands

In genetics, shotgun sequencing is a method used for sequencing random DNA strands. It is named by analogy with the rapidly expanding, quasi-random shot grouping of a shotgun.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of molecular biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

A bacterial artificial chromosome (BAC) is a DNA construct, based on a functional fertility plasmid, used for transforming and cloning in bacteria, usually E. coli. F-plasmids play a crucial role because they contain partition genes that promote the even distribution of plasmids after bacterial cell division. The bacterial artificial chromosome's usual insert size is 150–350 kbp. A similar cloning vector called a PAC has also been produced from the DNA of P1 bacteriophage.

<span class="mw-page-title-main">Genome project</span> Scientific endeavours to determine the complete genome sequence of an organism

Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism and to annotate protein-coding genes and other important genome-encoded features. The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.

A contig is a set of overlapping DNA segments that together represent a consensus region of DNA. In bottom-up sequencing projects, a contig refers to overlapping sequence data (reads); in top-down sequencing projects, contig refers to the overlapping clones that form a physical map of the genome that is used to guide sequencing and assembly. Contigs can thus refer both to overlapping DNA sequences and to overlapping physical segments (fragments) contained in clones depending on the context.

<span class="mw-page-title-main">Yeast artificial chromosome</span> Genetically engineered chromosome derived from the DNA of yeast

Yeast artificial chromosomes (YACs) are genetically engineered chromosomes derived from the DNA of the yeast, Saccharomyces cerevisiae, which is then ligated into a bacterial plasmid. By inserting large fragments of DNA, from 100–1000 kb, the inserted sequences can be cloned and physically mapped using a process called chromosome walking. This is the process that was initially used for the Human Genome Project, however due to stability issues, YACs were abandoned for the use of bacterial artificial chromosome

In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments (reads) result from shotgun sequencing genomic DNA, or gene transcript (ESTs).

<span class="mw-page-title-main">DNA sequencing</span> Process of determining the nucleic acid sequence

DNA sequencing is the process of determining the nucleic acid sequence – the order of nucleotides in DNA. It includes any method or technology that is used to determine the order of the four bases: adenine, guanine, cytosine, and thymine. The advent of rapid DNA sequencing methods has greatly accelerated biological and medical research and discovery.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

A genomic library is a collection of overlapping DNA fragments that together make up the total genomic DNA of a single organism. The DNA is stored in a population of identical vectors, each containing a different insert of DNA. In order to construct a genomic library, the organism's DNA is extracted from cells and then digested with a restriction enzyme to cut the DNA into fragments of a specific size. The fragments are then inserted into the vector using DNA ligase. Next, the vector DNA can be taken up by a host organism - commonly a population of Escherichia coli or yeast - with each cell containing only one vector molecule. Using a host cell to carry the vector allows for easy amplification and retrieval of specific clones from the library for analysis.

<span class="mw-page-title-main">Whole genome sequencing</span> Determining nearly the entirety of the DNA sequence of an organisms genome at a single time

Whole genome sequencing (WGS) is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.

Cancer genome sequencing is the whole genome sequencing of a single, homogeneous or heterogeneous group of cancer cells. It is a biochemical laboratory method for the characterization and identification of the DNA or RNA sequences of cancer cell(s).

<span class="mw-page-title-main">Reference genome</span> Digital nucleic acid sequence database

A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead, a reference provides a haploid mosaic of different DNA sequences from each donor. For example, one of the most recent human reference genomes, assembly GRCh38/hg38, is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.

Massive parallel sequencing or massively parallel sequencing is any of several high-throughput approaches to DNA sequencing using the concept of massively parallel processing; it is also called next-generation sequencing (NGS) or second-generation sequencing. Some of these technologies emerged between 1993 and 1998 and have been commercially available since 2005. These technologies use miniaturized and parallelized platforms for sequencing of 1 million to 43 billion short reads per instrument run.

De novo transcriptome assembly is the de novo sequence assembly method of creating a transcriptome without the aid of a reference genome.

In DNA sequencing, a read is an inferred sequence of base pairs corresponding to all or part of a single DNA fragment. A typical sequencing experiment involves fragmentation of the genome into millions of molecules, which are size-selected and ligated to adapters. The set of fragments is referred to as a sequencing library, which is sequenced to produce a set of reads.

<span class="mw-page-title-main">Jumping library</span>

Jumping libraries or junction-fragment libraries are collections of genomic DNA fragments generated by chromosome jumping. These libraries allow the analysis of large areas of the genome and overcome distance limitations in common cloning techniques. A jumping library clone is composed of two stretches of DNA that are usually located many kilobases away from each other. The stretch of DNA located between these two "ends" is deleted by a series of biochemical manipulations carried out at the start of this cloning technique.

<span class="mw-page-title-main">Scaffolding (bioinformatics)</span> Bioinformatics technique

Scaffolding is a technique used in bioinformatics. It is defined as follows:

Link together a non-contiguous series of genomic sequences into a scaffold, consisting of sequences separated by gaps of known length. The sequences that are linked are typically contiguous sequences corresponding to read overlaps.

Single-cell DNA template strand sequencing, or Strand-seq, is a technique for the selective sequencing of a daughter cell's parental template strands. This technique offers a wide variety of applications, including the identification of sister chromatid exchanges in the parental cell prior to segregation, the assessment of non-random segregation of sister chromatids, the identification of misoriented contigs in genome assemblies, de novo genome assembly of both haplotypes in diploid organisms including humans, whole-chromosome haplotyping, and the identification of germline and somatic genomic structural variation, the latter of which can be detected robustly even in single cells.

<span class="mw-page-title-main">End-sequence profiling</span>

End-sequence profiling (ESP) is a method based on sequence-tagged connectors developed to facilitate de novo genome sequencing to identify high-resolution copy number and structural aberrations such as inversions and translocations.

References

  1. Moreau H, Verhelst B, Couloux A, Derelle E, Rombauts S, Grimsley N, et al. (August 2012). "Gene functionalities and genome structure in Bathycoccus prasinos reflect cellular specializations at the base of the green lineage". Genome Biology. 13 (8): R74. doi: 10.1186/gb-2012-13-8-r74 . PMC   3491373 . PMID   22925495.
  2. Gregory TR (January 2005). "The C-value enigma in plants and animals: a review of parallels and an appeal for partnership". Annals of Botany. 95 (1): 133–146. doi:10.1093/aob/mci009. PMC   4246714 . PMID   15596463.
  3. Zimin A, Stevens KA, Crepeau MW, Holtz-Morris A, Koriabine M, Marçais G, et al. (March 2014). "Sequencing and assembly of the 22-gb loblolly pine genome". Genetics. 196 (3): 875–890. doi:10.1534/genetics.113.159715. PMC   3948813 . PMID   24653210.
  4. Deschamps S, Campbell MA (2010-04-01). "Utilization of next-generation sequencing platforms in plant genomics and genetic variant discovery". Molecular Breeding. 25 (4): 553–570. doi:10.1007/s11032-009-9357-9. S2CID   29239452.
  5. Shendure J, Ji H (October 2008). "Next-generation DNA sequencing". Nature Biotechnology. 26 (10): 1135–1145. doi:10.1038/nbt1486. PMID   18846087. S2CID   6384349.
  6. Treangen TJ, Salzberg SL (November 2011). "Repetitive DNA and next-generation sequencing: computational challenges and solutions". Nature Reviews. Genetics. 13 (1): 36–46. doi:10.1038/nrg3117. PMC   3324860 . PMID   22124482.
  7. Harrison GE, Heslop-Harrison JS (February 1995). "Centromeric repetitive DNA sequences in the genus Brassica". Theoretical and Applied Genetics. 90 (2): 157–165. doi:10.1007/BF00222197. PMID   24173886. S2CID   20591213.
  8. Lanciano S, Carpentier MC, Llauro C, Jobet E, Robakowska-Hyzorek D, Lasserre E, et al. (February 2017). "Sequencing the extrachromosomal circular mobilome reveals retrotransposon activity in plants". PLOS Genetics. 13 (2): e1006630. doi: 10.1371/journal.pgen.1006630 . PMC   5338827 . PMID   28212378.
  9. Michael TP, VanBuren R (April 2015). "Progress, challenges and the future of crop genomes". Current Opinion in Plant Biology. 24: 71–81. Bibcode:2015COPB...24...71M. doi:10.1016/j.pbi.2015.02.002. PMID   25703261.
  10. Flavell RB, Gale MD, O'dell M, Murphy G, Moore G, Lucas H (1993). "Molecular organization of genes and repeats in the large cereal genomes and implications for the isolation of genes by chromosome walking". Chromosomes Today. Dordrecht: Springer. pp. 199–213. doi:10.1007/978-94-011-1510-0_16. ISBN   9789401046602.
  11. Meyers LA, Levin DA (June 2006). "On the abundance of polyploids in flowering plants". Evolution; International Journal of Organic Evolution. 60 (6): 1198–1206. doi:10.1554/05-629.1. PMID   16892970. S2CID   198156503.
  12. 1 2 The Arabidopsis Genome Initiative (December 2000). "Analysis of the genome sequence of the flowering plant Arabidopsis thaliana". Nature. 408 (6814): 796–815. Bibcode:2000Natur.408..796T. doi: 10.1038/35048692 . PMID   11130711.
  13. The C. elegans Sequencing Consortium (December 1998). "Genome sequence of the nematode C. elegans: a platform for investigating biology". Science. 282 (5396): 2012–2018. Bibcode:1998Sci...282.2012.. doi:10.1126/science.282.5396.2012. JSTOR   2897605. PMID   9851916.
  14. Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, et al. (March 2000). "The genome sequence of Drosophila melanogaster". Science. 287 (5461): 2185–2195. Bibcode:2000Sci...287.2185.. CiteSeerX   10.1.1.549.8639 . doi:10.1126/science.287.5461.2185. PMID   10731132.
  15. 1 2 3 Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, et al. (April 2002). "A draft sequence of the rice genome (Oryza sativa L. ssp. japonica)". Science. 296 (5565): 92–100. Bibcode:2002Sci...296...92G. doi:10.1126/science.1068275. PMID   11935018. S2CID   2960202.
  16. Bolser D, Staines DM, Pritchard E, Kersey P (2016). "Ensembl Plants: Integrating Tools for Visualizing, Mining, and Analyzing Plant Genomics Data". Plant Bioinformatics. Methods in Molecular Biology. Vol. 1374. Humana Press, New York, NY. pp. 115–140. doi:10.1007/978-1-4939-3167-5_6. ISBN   9781493931668. PMID   26519403.
  17. Gupta P, Naithani S, Tello-Ruiz MK, Chougule K, D'Eustachio P, Fabregat A, et al. (November 2016). "Gramene Database: Navigating Plant Comparative Genomics Resources". Current Plant Biology. 7–8: 10–15. Bibcode:2016CPBio...7...10G. doi:10.1016/j.cpb.2016.12.005. PMC   5509230 . PMID   28713666.
  18. Nakaya A, Ichihara H, Asamizu E, Shirasawa S, Nakamura Y, Tabata S, Hirakawa H (2017). "Plant Genome DataBase Japan (PGDBJ)". Plant Genomics Databases. Methods in Molecular Biology. Vol. 1533. New York, NY: Humana Press. pp. 45–77. doi:10.1007/978-1-4939-6658-5_3. ISBN   9781493966561. PMID   27987164.
  19. Spannagl M, Nussbaumer T, Bader K, Gundlach H, Mayer KF (2017). "PGSB/MIPS PlantsDB Database Framework for the Integration and Analysis of Plant Genome Data". Plant Genomics Databases. Methods in Molecular Biology. Vol. 1533. New York, NY: Humana Press. pp. 33–44. doi:10.1007/978-1-4939-6658-5_2. ISBN   9781493966561. PMID   27987163.
  20. Vandepoele K (2017). "A Guide to the PLAZA 3.0 Plant Comparative Genomic Database". Plant Genomics Databases. Methods in Molecular Biology. Vol. 1533. Humana Press, New York, NY. pp. 183–200. doi:10.1007/978-1-4939-6658-5_10. ISBN   9781493966561. PMID   27987171.
  21. Van Bel M, Silvestri F, Weitz EM, Kreft L, Botzki A, Coppens F, Vandepoele K (January 2022). "PLAZA 5.0: extending the scope and power of comparative and functional genomics in plants". Nucleic Acids Research. 50 (D1): D1468 –D1474. doi:10.1093/nar/gkab1024. PMC   8728282 . PMID   34747486.
  22. Reiser L, Berardini TZ, Li D, Muller R, Strait EM, Li Q, et al. (2016-01-01). "Sustainable funding for biocuration: The Arabidopsis Information Resource (TAIR) as a case study of a subscription-based funding model". Database. 2016: baw018. doi:10.1093/database/baw018. PMC   4795935 . PMID   26989150.
  23. 1 2 3 Schnable PS, Ware D, Fulton RS, Stein JC, Wei F, Pasternak S, et al. (November 2009). "The B73 maize genome: complexity, diversity, and dynamics". Science. 326 (5956): 1112–1115. Bibcode:2009Sci...326.1112S. doi:10.1126/science.1178534. PMID   19965430. S2CID   21433160.
  24. Feuillet C, Leach JE, Rogers J, Schnable PS, Eversole K (February 2011). "Crop genome sequencing: lessons and rationales". Trends in Plant Science. 16 (2): 77–88. Bibcode:2011TPS....16...77F. doi:10.1016/j.tplants.2010.10.005. PMID   21081278.
  25. Saegusa A (April 1999). "US firm's bid to sequence rice genome causes stir in Japan". Nature. 398 (6728): 545. Bibcode:1999Natur.398..545S. doi: 10.1038/19123 . PMID   10217128.
  26. 1 2 Jaillon O, Aury JM, Noel B, Policriti A, Clepet C, Casagrande A, et al. (September 2007). "The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla". Nature. 449 (7161): 463–467. Bibcode:2007Natur.449..463J. doi: 10.1038/nature06148 . hdl: 11577/2430527 . PMID   17721507.
  27. Ming R, Hou S, Feng Y, Yu Q, Dionne-Laporte A, Saw JH, et al. (April 2008). "The draft genome of the transgenic tropical fruit tree papaya (Carica papaya Linnaeus)". Nature. 452 (7190): 991–996. Bibcode:2008Natur.452..991M. doi:10.1038/nature06856. PMC   2836516 . PMID   18432245.
  28. Tuskan GA, Difazio S, Jansson S, Bohlmann J, Grigoriev I, Hellsten U, et al. (September 2006). "The genome of black cottonwood, Populus trichocarpa (Torr. & Gray)". Science (Submitted manuscript). 313 (5793): 1596–1604. Bibcode:2006Sci...313.1596T. doi:10.1126/science.1128691. PMID   16973872. S2CID   7717980. Archived from the original on 2023-05-29. Retrieved 2023-06-19.
  29. Swan KA, Curtis DE, McKusick KB, Voinov AV, Mapa FA, Cancilla MR (July 2002). "High-throughput gene mapping in Caenorhabditis elegans". Genome Research. 12 (7): 1100–1105. doi:10.1101/gr.208902. PMC   186621 . PMID   12097347.
  30. The International Brachypodium Initiative; et al. (February 2010). "Genome sequencing and analysis of the model grass Brachypodium distachyon". Nature. 463 (7282): 763–768. Bibcode:2010Natur.463..763T. doi: 10.1038/nature08747 . PMID   20148030.
  31. Paterson AH, Bowers JE, Bruggmann R, Dubchak I, Grimwood J, Gundlach H, et al. (January 2009). "The Sorghum bicolor genome and the diversification of grasses". Nature. 457 (7229): 551–556. Bibcode:2009Natur.457..551P. doi: 10.1038/nature07723 . PMID   19189423.
  32. Schmutz J, Cannon SB, Schlueter J, Ma J, Mitros T, Nelson W, et al. (January 2010). "Genome sequence of the palaeopolyploid soybean". Nature. 463 (7278): 178–183. Bibcode:2010Natur.463..178S. doi: 10.1038/nature08670 . PMID   20075913. S2CID   4372224.
  33. Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, et al. (December 2009). "The genome of the cucumber, Cucumis sativus L". Nature Genetics. 41 (12): 1275–1281. doi: 10.1038/ng.475 . PMID   19881527.
  34. Argout X, Salse J, Aury JM, Guiltinan MJ, Droc G, Gouzy J, et al. (February 2011). "The genome of Theobroma cacao". Nature Genetics. 43 (2): 101–108. doi: 10.1038/ng.736 . PMID   21186351. S2CID   4685532.
  35. 1 2 Velasco R, Zharkikh A, Affourtit J, Dhingra A, Cestaro A, Kalyanaraman A, et al. (October 2010). "The genome of the domesticated apple (Malus × domestica Borkh.)". Nature Genetics. 42 (10): 833–839. doi: 10.1038/ng.654 . PMID   20802477.
  36. Wang K, Wang Z, Li F, Ye W, Wang J, Song G, et al. (October 2012). "The draft genome of a diploid cotton Gossypium raimondii". Nature Genetics. 44 (10): 1098–1103. doi: 10.1038/ng.2371 . PMID   22922876. S2CID   38495587.
  37. Xu Q, Chen LL, Ruan X, Chen D, Zhu A, Chen C, et al. (January 2013). "The draft genome of sweet orange (Citrus sinensis)". Nature Genetics. 45 (1): 59–66. doi: 10.1038/ng.2472 . PMID   23179022.
  38. Tomato Genome Consortium (May 2012). "The tomato genome sequence provides insights into fleshy fruit evolution". Nature. 485 (7400): 635–641. Bibcode:2012Natur.485..635T. doi:10.1038/nature11119. PMC   3378239 . PMID   22660326.
  39. Bleidorn C (2015). "Third generation sequencing: technology and its potential impact on evolutionary biodiversity research". Systematics and Biodiversity. 14 (1): 1. Bibcode:2016SyBio..14....1B. doi:10.1080/14772000.2015.1099575.
  40. Lee H, Gurtowski J, Yoo S, Nattestad M, Marcus S, Goodwin S, McCombie WR, Schatz M (2016-04-13). "Third-generation sequencing and the future of genomics". bioRxiv: 048603. doi: 10.1101/048603 .
  41. van Deynze A (2015). "Using spinach to compare technologies for whole genome assemblies". Plant & Animal Genomics XXIII Conference.
  42. Chin CS, Alexander DH, Marks P, Klammer AA, Drake J, Heiner C, et al. (June 2013). "Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data". Nature Methods. 10 (6): 563–569. doi:10.1038/nmeth.2474. PMID   23644548. S2CID   205421576.
  43. Daccord N, Celton JM, Linsmith G, Becker C, Choisne N, Schijlen E, et al. (July 2017). "High-quality de novo assembly of the apple genome and methylome dynamics of early fruit development". Nature Genetics. 49 (7): 1099–1106. doi: 10.1038/ng.3886 . hdl: 10449/42064 . PMID   28581499. S2CID   24690391.
  44. Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. (December 2012). "SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler". GigaScience. 1 (1): 18. doi: 10.1186/2047-217X-1-18 . PMC   3626529 . PMID   23587118. S2CID   2681931.
  45. Ye C, Hill CM, Wu S, Ruan J, Ma ZS (August 2016). "DBG2OLC: Efficient Assembly of Large Genomes Using Long Erroneous Reads of the Third Generation Sequencing Technologies". Scientific Reports. 6 (1): 31900. Bibcode:2016NatSR...631900Y. doi:10.1038/srep31900. PMC   5004134 . PMID   27573208.
  46. Li H (2013). "Aligning sequence reads, clone sequences and assembly contigs with BWA- MEM". arXiv: 1303.3997 [q-bio.GN].
  47. "Bionano: Transforming the Way the World Sees the Genome". bionanogenomics.