Genome project

Last updated
When printed, the human genome sequence fills around 100 huge books of close print The Genome sequence when printed fills a huge book of close print.JPG
When printed, the human genome sequence fills around 100 huge books of close print

Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist or a virus) and to annotate protein-coding genes and other important genome-encoded features. [1] The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.

Contents

The Human Genome Project is a well known example of a genome project. [2]

Genome assembly

Genome assembly refers to the process of taking a large number of short DNA sequences and reassembling them to create a representation of the original chromosomes from which the DNA originated. In a shotgun sequencing project, all the DNA from a source (usually a single organism, anything from a bacterium to a mammal) is first fractured into millions of small pieces. These pieces are then "read" by automated sequencing machines. A genome assembly algorithm works by taking all the pieces and aligning them to one another, and detecting all places where two of the short sequences, or reads, overlap. These overlapping reads can be merged, and the process continues.

Genome assembly is a very difficult computational problem, made more difficult because many genomes contain large numbers of identical sequences, known as repeats. These repeats can be thousands of nucleotides long, and occur different locations, especially in the large genomes of plants and animals.

The resulting (draft) genome sequence is produced by combining the information sequenced contigs and then employing linking information to create scaffolds. Scaffolds are positioned along the physical map of the chromosomes creating a "golden path".

Assembly software

Originally, most large-scale DNA sequencing centers developed their own software for assembling the sequences that they produced. However, this has changed as the software has grown more complex and as the number of sequencing centers has increased. An example of such assembler Short Oligonucleotide Analysis Package developed by BGI for de novo assembly of human-sized genomes, alignment, SNP detection, resequencing, indel finding, and structural variation analysis. [3] [4] [5]

Genome annotation

Since the 1980s, molecular biology and bioinformatics have created the need for DNA annotation. DNA annotation or genome annotation is the process of identifying attaching biological information to sequences , and particularly in identifying the locations of genes and determining what those genes do.

Time of completion

When sequencing a genome, there are usually regions that are difficult to sequence (often regions with highly repetitive DNA). Thus, 'completed' genome sequences are rarely ever complete, and terms such as 'working draft' or 'essentially complete' have been used to more accurately describe the status of such genome projects. Even when every base pair of a genome sequence has been determined, there are still likely to be errors present because DNA sequencing is not a completely accurate process. It could also be argued that a complete genome project should include the sequences of mitochondria and (for plants) chloroplasts as these organelles have their own genomes.

It is often reported that the goal of sequencing a genome is to obtain information about the complete set of genes in that particular genome sequence. The proportion of a genome that encodes for genes may be very small (particularly in eukaryotes such as humans, where coding DNA may only account for a few percent of the entire sequence). However, it is not always possible (or desirable) to only sequence the coding regions separately. Also, as scientists understand more about the role of this noncoding DNA (often referred to as junk DNA), it will become more important to have a complete genome sequence as a background to understanding the genetics and biology of any given organism.

In many ways genome projects do not confine themselves to only determining a DNA sequence of an organism. Such projects may also include gene prediction to find out where the genes are in a genome, and what those genes do. There may also be related projects to sequence ESTs or mRNAs to help find out where the genes actually are.

Historical and technological perspectives

Historically, when sequencing eukaryotic genomes (such as the worm Caenorhabditis elegans ) it was common to first map the genome to provide a series of landmarks across the genome. Rather than sequence a chromosome in one go, it would be sequenced piece by piece (with the prior knowledge of approximately where that piece is located on the larger chromosome). Changes in technology and in particular improvements to the processing power of computers, means that genomes can now be 'shotgun sequenced' in one go (there are caveats to this approach though when compared to the traditional approach).

Improvements in DNA sequencing technology has meant that the cost of sequencing a new genome sequence has steadily fallen (in terms of cost per base pair) and newer technology has also meant that genomes can be sequenced far more quickly.

When research agencies decide what new genomes to sequence, the emphasis has been on species which are either high importance as model organism or have a relevance to human health (e.g. pathogenic bacteria or vectors of disease such as mosquitos) or species which have commercial importance (e.g. livestock and crop plants). Secondary emphasis is placed on species whose genomes will help answer important questions in molecular evolution (e.g. the common chimpanzee).

In the future, it is likely that it will become even cheaper and quicker to sequence a genome. This will allow for complete genome sequences to be determined from many different individuals of the same species. For humans, this will allow us to better understand aspects of human genetic diversity.

Examples

L1 Dominette 01449, the Hereford who serves as the subject of the Bovine Genome Project Hereford67-300.jpg
L1 Dominette 01449, the Hereford who serves as the subject of the Bovine Genome Project
The Giant Sequoia genome sequence was extracted from a single fertilized seed harvested from a 1,360-year-old tree in Sequoia/Kings Canyon National Park. Giant sequoias in Giant Sequoia National Monument.jpg
The Giant Sequoia genome sequence was extracted from a single fertilized seed harvested from a 1,360-year-old tree in Sequoia/Kings Canyon National Park.

Many organisms have genome projects that have either been completed or will be completed shortly, including:

See also

Related Research Articles

<span class="mw-page-title-main">Genome</span> All genetic material of an organism

In the fields of molecular biology and genetics, a genome is all the genetic information of an organism. It consists of nucleotide sequences of DNA. The nuclear genome includes protein-coding genes and non-coding genes, other functional regions of the genome such as regulatory sequences, and often a substantial fraction of 'junk' DNA with no evident function. Almost all eukaryotes have mitochondria and a small mitochondrial genome. Algae and plants also contain chloroplasts with a chloroplast genome.

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly-repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

<span class="mw-page-title-main">Genomics</span> Discipline in genetics

Genomics is an interdisciplinary field of biology focusing on the structure, function, evolution, mapping, and editing of genomes. A genome is an organism's complete set of DNA, including all of its genes as well as its hierarchical, three-dimensional structural configuration. In contrast to genetics, which refers to the study of individual genes and their roles in inheritance, genomics aims at the collective characterization and quantification of all of an organism's genes, their interrelations and influence on the organism. Genes may direct the production of proteins with the assistance of enzymes and messenger molecules. In turn, proteins make up body structures such as organs and tissues as well as control chemical reactions and carry signals between cells. Genomics also involves the sequencing and analysis of genomes through uses of high throughput DNA sequencing and bioinformatics to assemble and analyze the function and structure of entire genomes. Advances in genomics have triggered a revolution in discovery-based research and systems biology to facilitate understanding of even the most complex biological systems such as the brain.

<span class="mw-page-title-main">BGI Group</span> Chinese genome sequencing company

BGI Group, formerly Beijing Genomics Institute, is a Chinese genomics company with headquarters in Yantian District, Shenzhen. The company was originally formed in 1999 as a genetics research center to participate in the Human Genome Project. It also sequences the genomes of other animals, plants and microorganisms.

A genomic library is a collection of the total genomic DNA from a single organism. The DNA is stored in a population of identical vectors, each containing a different insert of DNA. In order to construct a genomic library, the organism's DNA is extracted from cells and then digested with a restriction enzyme to cut the DNA into fragments of a specific size. The fragments are then inserted into the vector using DNA ligase. Next, the vector DNA can be taken up by a host organism - commonly a population of Escherichia coli or yeast - with each cell containing only one vector molecule. Using a host cell to carry the vector allows for easy amplification and retrieval of specific clones from the library for analysis.

<span class="mw-page-title-main">Human Genome Project</span> Human genome sequencing programme

The Human Genome Project (HGP) was an international scientific research project with the goal of determining the base pairs that make up human DNA, and of identifying, mapping and sequencing all of the genes of the human genome from both a physical and a functional standpoint. It started in 1990 and was completed in 2003. It remains the world's largest collaborative biological project. Planning for the project started after it was adopted in 1984 by the US government, and it officially launched in 1990. It was declared complete on April 14, 2003, and included about 92% of the genome. Level "complete genome" was achieved in May 2021, with a remaining only 0.3% bases covered by potential issues. The final gapless assembly was finished in January 2022.

<span class="mw-page-title-main">GFER</span> Protein-coding gene in the species Homo sapiens

Growth factor, augmenter of liver regeneration , also known as GFER, or Hepatopoietin is a protein which in humans is encoded by the GFER gene. This gene is also known as essential for respiration and vegatative growth, augmenter of liver regeneration, and growth factor of Erv1-like/Hepatic regenerative stimulation substance.

<span class="mw-page-title-main">UBAP1</span>

Ubiquitin-associated protein 1 is a protein that in humans is encoded by the UBAP1 gene.

<span class="mw-page-title-main">ARID2</span> Protein-coding gene in the species Homo sapiens

AT-rich interactive domain-containing protein 2 (ARID2) is a protein that in humans is encoded by the ARID2 gene.

<span class="mw-page-title-main">Mitochondrial ribosomal protein L32</span>

39S ribosomal protein L32, mitochondrial is a protein that in humans is encoded by the MRPL32 gene.

<span class="mw-page-title-main">Whole genome sequencing</span> Determining nearly the entirety of the DNA sequence of an organisms genome at a single time.

Whole genome sequencing (WGS), also known as full genome sequencing, complete genome sequencing, or entire genome sequencing, is the process of determining the entirety, or nearly the entirety, of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.

Complete Genomics is a life sciences company that has developed and commercialized a DNA sequencing platform for human genome sequencing and analysis. This solution combines the company's proprietary human genome sequencing technology with its informatics and data management software to provide finished variant reports and assemblies at Complete Genomics’ own commercial genome center in Mountain View, California.

SOAP is a suite of bioinformatics software tools from the BGI Bioinformatics department enabling the assembly, alignment, and analysis of next generation DNA sequencing data. It is particularly suited to short read sequencing data.

<span class="mw-page-title-main">Reference genome</span>

A reference genome is a digital nucleic acid sequence database, assembled by scientists as a representative example of the set of genes in one idealized individual organism of a species. As they are assembled from the sequencing of DNA from a number of individual donors, reference genomes do not accurately represent the set of genes of any single individual organism. Instead a reference provides a haploid mosaic of different DNA sequences from each donor. For example, the most recent human reference genome is derived from >60 genomic clone libraries. There are reference genomes for multiple species of viruses, bacteria, fungus, plants, and animals. Reference genomes are typically used as a guide on which new genomes are built, enabling them to be assembled much more quickly and cheaply than the initial Human Genome Project. Reference genomes can be accessed online at several locations, using dedicated browsers such as Ensembl or UCSC Genome Browser.

<span class="mw-page-title-main">SLC52A3</span>

Solute carrier family 52, member 3, formerly known as chromosome 20 open reading frame 54 and riboflavin transporter 2, is a protein that in humans is encoded by the SLC52A3 gene.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

A plant genome assembly represents the complete genomic sequence of a plant species, which is assembled into chromosomes and other organelles by using DNA fragments that are obtained from different types of sequencing technology.

The Vertebrate Genomes Project (VGP) is a project which aims to generate high-quality, complete reference genomes of all 66,000 vertebrate species. It is an international cooperation project with members from more than 50 separate institutions and was launched in February 2017.

References

  1. Pevsner, Jonathan (2009). Bioinformatics and functional genomics (2nd ed.). Hoboken, N.J: Wiley-Blackwell. ISBN   9780470085851.
  2. "Potential Benefits of Human Genome Project Research". Department of Energy, Human Genome Project Information. 2009-10-09. Archived from the original on 2013-07-08. Retrieved 2010-06-18.
  3. Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J (February 2010). "De novo assembly of human genomes with massively parallel short read sequencing". Genome Research. 20 (2): 265–272. doi:10.1101/gr.097261.109. ISSN   1549-5469. PMC   2813482 . PMID   20019144.
  4. 1 2 Rasmussen M, Li Y, Lindgreen S, Pedersen JS, Albrechtsen A, Moltke I, Metspalu M, Metspalu E, Kivisild T, Gupta R, Bertalan M, Nielsen K, Gilbert MT, Wang Y, Raghavan M, Campos PF, Kamp HM, Wilson AS, Gledhill A, Tridico S, Bunce M, Lorenzen ED, Binladen J, Guo X, Zhao J, Zhang X, Zhang H, Li Z, Chen M, Orlando L, Kristiansen K, Bak M, Tommerup N, Bendixen C, Pierre TL, Grønnow B, Meldgaard M, Andreasen C, Fedorova SA, Osipova LP, Higham TF, Ramsey CB, Hansen TV, Nielsen FC, Crawford MH, Brunak S, Sicheritz-Pontén T, Villems R, Nielsen R, Krogh A, Wang J, Willerslev E (2010-02-11). "Ancient human genome sequence of an extinct Palaeo-Eskimo". Nature. 463 (7282): 757–762. Bibcode:2010Natur.463..757R. doi:10.1038/nature08835. ISSN   1476-4687. PMC   3951495 . PMID   20148029.
  5. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, Li G, Yang Z, Zhang G, Yang B, Yu C, Liang F, Li W, Li S, Li D, Ni P, Ruan J, Li Q, Zhu H, Liu D, Lu Z, Li N, Guo G, Zhang J, Ye J, Fang L, Hao Q, Chen Q, Liang Y, Su Y, San A, Ping C, Yang S, Chen F, Li L, Zhou K, Zheng H, Ren Y, Yang L, Gao Y, Yang G, Li Z, Feng X, Kristiansen K, Wong GK, Nielsen R, Durbin R, Bolund L, Zhang X, Li S, Yang H, Wang J (2008-11-06). "The diploid genome sequence of an Asian individual". Nature. 456 (7218): 60–65. Bibcode:2008Natur.456...60W. doi:10.1038/nature07484. ISSN   0028-0836. PMC   2716080 . PMID   18987735.
  6. Ghosh, Pallab (23 April 2015). "Mammoth genome sequence completed". BBC News.
  7. Yates, Diana (2009-04-23). "What makes a cow a cow? Genome sequence sheds light on ruminant evolution" (Press Release). EurekAlert!. Retrieved 2012-12-22.
  8. Elsik, C. G.; Elsik, R. L.; Tellam, K. C.; Worley, R. A.; Gibbs, D. M.; Muzny, G. M.; Weinstock, D. L.; Adelson, E. E.; Eichler, L.; Elnitski, R.; Guigó, D. L.; Hamernik, S. M.; Kappes, H. A.; Lewin, D. J.; Lynn, F. W.; Nicholas, A.; Reymond, M.; Rijnkels, L. C.; Skow, E. M.; Zdobnov, L.; Schook, J.; Womack, T.; Alioto, S. E.; Antonarakis, A.; Astashyn, C. E.; Chapple, H. -C.; Chen, J.; Chrast, F.; Câmara, O.; et al. (2009). "The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution". Science. 324 (5926): 522–528. Bibcode:2009Sci...324..522A. doi:10.1126/science.1169588. PMC   2943200 . PMID   19390049.
  9. "2007 Release: Horse Genome Assembled". National Human Genome Research Institute (NHGRI). Retrieved 19 April 2018.
  10. Scott, Alison D; Zimin, Aleksey V; Puiu, Daniela; Workman, Rachael; Britton, Monica; Zaman, Sumaira; Caballero, Madison; Read, Andrew C; Bogdanove, Adam J; Burns, Emily; Wegrzyn, Jill; Timp, Winston; Salzberg, Steven L; Neale, David B (November 1, 2020). "A Reference Genome Sequence for Giant Sequoia". G3: Genes, Genomes, Genetics. 10 (11): 3907–3919. doi:10.1534/g3.120.401612. PMC   7642918 . PMID   32948606.