Ensembl genome database project

Last updated

Ensembl genome database project.
Ensembl logo.png Ensembl release58 sgcb screenshot.png
Content
DescriptionEnsembl
Contact
Research center
Primary citationYates, et al. (2020) [1]
Access
Website www.ensembl.org

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. [2] [3] [4] Ensembl is one of several well known genome browsers for the retrieval of genomic information.

Contents

Similar databases and browsers are found at NCBI and the University of California, Santa Cruz (UCSC).

History

The human genome consists of three billion base pairs, which code for approximately 20,000–25,000 genes. However the genome alone is of little use, unless the locations and relationships of individual genes can be identified. One option is manual annotation, whereby a team of scientists tries to locate genes using experimental data from scientific journals and public databases. However this is a slow, painstaking task. The alternative, known as automated annotation, is to use the power of computers to do the complex pattern-matching of protein to DNA. [5] [6] The Ensembl project was launched in 1999 in response to the imminent completion of the Human Genome Project, with the initial goals of automatically annotate the human genome, integrate this annotation with available biological data and make all this knowledge publicly available. [2]

In the Ensembl project, sequence data are fed into the gene annotation system (a collection of software "pipelines" written in Perl) which creates a set of predicted gene locations and saves them in a MySQL database for subsequent analysis and display. Ensembl makes these data freely accessible to the world research community. All the data and code produced by the Ensembl project is available to download, [7] and there is also a publicly accessible database server allowing remote access. In addition, the Ensembl website provides computer-generated visual displays of much of the data.

Over time the project has expanded to include additional species (including key model organisms such as mouse, fruitfly and zebrafish) as well as a wider range of genomic data, including genetic variations and regulatory features. Since April 2009, a sister project, Ensembl Genomes, has extended the scope of Ensembl into invertebrate metazoa, plants, fungi, bacteria, and protists, focusing on providing taxonomic and evolutionary context to genes, whilst the original project continues to focus on vertebrates. [8] [9]

As of 2020, Ensembl supported over 50 000 genomes across both Ensembl and Ensembl Genomes databases, adding some new innovative features such as Rapid Release, a new website designed to make genome annotation data available more quickly to users, and COVID-19, a new website to access to SARS-CoV-2 reference genome.

Displaying genomic data

Gene SGCB aligned to the human genome Ensembl release58 sgcb screenshot.png
Gene SGCB aligned to the human genome

Central to the Ensembl concept is the ability to automatically generate graphical views of the alignment of genes and other genomic data against a reference genome. These are shown as data tracks, and individual tracks can be turned on and off, allowing the user to customise the display to suit their research interests. The interface also enables the user to zoom in to a region or move along the genome in either direction.

Other displays show data at varying levels of resolution, from whole karyotypes down to text-based representations of DNA and amino acid sequences, or present other types of display such as trees of similar genes (homologues) across a range of species. The graphics are complemented by tabular displays, and in many cases data can be exported directly from the page in a variety of standard file formats such as FASTA.

Externally produced data can also be added to the display by uploading a suitable file in one of the supported formats, such as BAM, BED, or PSL.

Graphics are generated using a suite of custom Perl modules based on GD, the standard Perl graphics display library.

Alternative access methods

In addition to its website, Ensembl provides a REST API and a Perl API [10] (Application Programming Interface) that models biological objects such as genes and proteins, allowing simple scripts to be written to retrieve data of interest. The same API is used internally by the web interface to display the data. It is divided in sections like the core API, the compara API (for comparative genomics data), the variation API (for accessing SNPs, SNVs, CNVs..), and the functional genomics API (to access regulatory data). The Ensembl website provides extensive information on how to install and use the API.

This software can be used to access the public MySQL database, avoiding the need to download enormous datasets. The users could even choose to retrieve data from the MySQL with direct SQL queries, but this requires an extensive knowledge of the current database schema.

Large datasets can be retrieved using the BioMart data-mining tool. It provides a web interface for downloading datasets using complex queries.

Last, there is an FTP server which can be used to download entire MySQL databases as well some selected data sets in other formats.

Current species

The annotated genomes include most fully sequenced vertebrates and selected model organisms. All of them are eukaryotes, there are no prokaryotes. As of 2022, there are 271 species registered, this includes: [11]

Species
Chordata Mammalia Euarchontoglires Primates Angola colobus, black-capped squirrel monkey, black snub-nosed monkey, bonobo, bushbaby, capuchin, chimpanzee, common marmoset, Coquerel's sifaka, crab-eating macaque, drill, human, macaque, mouse lemur, gelada, gibbon, golden snub-nosed monkey, gorilla, greater bamboo lemur, green monkey, Ma's night monkey, olive baboon, orangutan, pig-tailed macaque, sooty mangabey, tarsier, Ugandan red colobus
Scandentia tree shrew
Glires (Rodents + Lagomorphs) Algerian mouse, alpine marmot, american beaver, arctic ground squirrel, Brazilian guineapig, chinese hamster, damaraland mole rat, daurian ground squirrel, degu, eurasian red squirrel, golden hamster, ground squirrel, guineapig, kangaroo rat, lesser Egyptian jerboa, long-tailed chinchilla, mongolian gerbil, mouse, naked mole-rat, North American deermouse, rat, pika, prairie vole, rabbit, Ryukyu mouse, shrew mouse, steppe mouse, thirteen-lined ground squirrel, Upper Galilee mountains blind mole rat
Laurasiatheria Alpaca, american bison, american black bear, american mink, Arabian camel, asian black bear, beluga whale, blue whale, chacoan peccary, California sea lion, Canada lynx, cat, cow, dingo, dog, dolphin, domestic yak, donkey, goat, ferret, giant panda, greater horseshoe bat, hedgehog, horse, leopard, lesser hedgehog tenrec, lion, meerkat, megabat, microbat, narwhal, polar bear, pig, red fox, sheep, shrew, Siberian musk deer, sperm whale, Siberian tiger, vaquita, wild yak, yarkand deer
Afrotheria Elephant, hyrax, tenrec
Xenarthra Armadillo, sloth
Marsupialia Common wombat, koala, opossum, Tasmanian devil, wallaby
Monotremes Platypus
Reptilia Argentine black and white tegu, blue-ringed sea krait, central bearded dragon, chinese softshell turtle, common snapping turtle, common wall lizard, desert tortoise, eastern brown snake, saltwater crocodile, Goode's thornscrub tortoise, green anole, indian cobra, komodo dragon, mainland tiger snake, painted turtle, Pinta Island tortoise, three-toed box turtle, tuatara, West African mud turtle
Birds African ostrich, bengalese finch, blue-crowned manakin, blue tit, budgerigar, burrowing owl, chicken, chicken (Red junglefowl), chicken (maternal Broiler), chicken (paternal White leghorn layer), chilean tinamou, colared flycatcher, common canary, common kestrel, dark-eyed junco, duck, eastern buzzard, eastern spot-billed duck, emu, eurasian eagle-owl, eurasian sparrowhawk, golden eagle, golden pheasant, golden-collared manakin, gouldian finch, great tit, great spotted kiwi, helmeted guineafowl, indian peafowl, japanese quail, kakapo, little spotted kiwi, mallard, medium ground finch, muscovy duck, New Caledonian crow, northern spotted owl, okarito brown kiwi, oriental scops owl, pink-footed goose, ring-necked pheasant, ruff, rufous-capped babbler, silver-eye, small tree finch, spoon-billed sandpiper, superb fairywren, Swainson's thrush, swan goose, turkey, white-throated sparrow, yellow-billed amazon, zebu, zebra finch
Lissamphibia Leisan spiny toad, Xenopus tropicalis
Teleosts Amazon molly, asian arowana, atlantic cod, atlantic herring, atlantic salmon, ballan wrasse, barramundi perch, bicolor damselfish, blind barbel, blue tilapia, blunt-snouted clingfish, brown trout, Burton's mouthbrooder, channel bull blenny, channel catfish, chinese rmedaka, chinook salmon, climbing perch, clown anemonefish, coelacanth, coho salmon, common carp, denticle herring, eastern happy, electric eel, elephant shark, european bass, gilthead bream, golden-line barbel, goldfish, greater amberjack, guppy, horned golden-line barbel, huchen, indian glassy fish, indian medaka, japanese medaka, javanese ricefish, jewelled blenny, large yellow croaker, live sharksucker, lumpfish, lyretail cichlid, Makobe island chichlid, mangrove rivulus, mexican tetra, Midas chichlid, Monterrey platyfish, mummichog, Nile tilapia, northern pike, ocean sunfish, orange clownfish, orbiculate cardinalfish, Paramormyrops kingsleyae, Periophthalmus magnuspinnatus , pike-perch, pinecone soldierfish, platyfish, rainbow trout, red-bellied piranha, reedfish, round goby, sailfin molly, sheepshead minnow, shortfin molly, Siamese fighting fish, spinny chromis, spotted gar, swamp eel, tetraodon, three-spined stickleback, tiger tail seahorse, tongue sole, turbot, turquoise killfish, western mosquitofish, yellowtail amberjack, Takifugu rubripes (fugu), zebrafish, zebra mbuna, zigzag eel
Cyclostomata Petromyzon marinus (sea lamprey), hagfish
Tunicates Ciona intestinalis , Ciona savignyi
Invertebrates Insects Drosophila melanogaster (fruitfly), Anopheles gambiae (mosquito), Aedes aegypti (mosquito)
Worms Caenorhabditis elegans
Yeast Saccharomyces cerevisiae (baker's yeast)

Open source/mirrors

All data part of the Ensembl project is open access and all software is open source, being freely available to the scientific community, under a CC BY 4.0 license. Currently, Ensembl database website is mirrored at four different locations worldwide to improve the service.

Official mirror sites
UK (Sanger Institute) ---- main website
US West (Amazon AWS) ---- Cloud-based mirror on West Coast of United States
US East (Amazon AWS) ---- Cloud-based mirror on East Coast of United States
Asia (Amazon AWS) ---- Cloud-based mirror in Singapore

See also

Related Research Articles

BioJava is an open-source software project dedicated to provide Java tools to process biological data. BioJava is a set of library functions written in the programming language Java for manipulating sequences, protein structures, file parsers, Common Object Request Broker Architecture (CORBA) interoperability, Distributed Annotation System (DAS), access to AceDB, dynamic programming, and simple statistical routines. BioJava supports a range of data, starting from DNA and protein sequences to the level of 3D protein structures. The BioJava libraries are useful for automating many daily and mundane bioinformatics tasks such as to parsing a Protein Data Bank (PDB) file, interacting with Jmol and many more. This application programming interface (API) provides various file parsers, data models and algorithms to facilitate working with the standard data formats and enables rapid application development and analysis.

<span class="mw-page-title-main">Sequence homology</span> Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

The European Bioinformatics Institute (EMBL-EBI) is an intergovernmental organization (IGO) which, as part of the European Molecular Biology Laboratory (EMBL) family, focuses on research and services in bioinformatics. It is located on the Wellcome Genome Campus in Hinxton near Cambridge, and employs over 600 full-time equivalent (FTE) staff. Institute leaders such as Rolf Apweiler, Alex Bateman, Ewan Birney, and Guy Cochrane, an adviser on the National Genomics Data Center Scientific Advisory Board, serve as part of the international research network of the BIG Data Center at the Beijing Institute of Genomics.

<span class="mw-page-title-main">Ewan Birney</span> English businessman

John Frederick William Birney is joint director of EMBL's European Bioinformatics Institute (EMBL-EBI), in Hinxton, Cambridgeshire and deputy director general of the European Molecular Biology Laboratory (EMBL). He also serves as non-executive director of Genomics England, chair of the Global Alliance for Genomics and Health (GA4GH) and honorary professor of bioinformatics at the University of Cambridge. Birney has made significant contributions to genomics, through his development of innovative bioinformatics and computational biology tools. He previously served as an associate faculty member at the Wellcome Trust Sanger Institute.

The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Further information is located at the Yeastract curated repository.

<span class="mw-page-title-main">Generic Model Organism Database</span>

The Generic Model Organism Database (GMOD) project provides biological research communities with a toolkit of open-source software components for visualizing, annotating, managing, and storing biological data. The GMOD project is funded by the United States National Institutes of Health, National Science Foundation and the USDA Agricultural Research Service.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

The Vertebrate Genome Annotation (VEGA) database is a biological database dedicated to assisting researchers in locating specific areas of the genome and annotating genes or regions of vertebrate genomes. The VEGA browser is based on Ensembl web code and infrastructure and provides a public curation of known vertebrate genes for the scientific community. The VEGA website is updated frequently to maintain the most current information about vertebrate genomes and attempts to present consistently high-quality annotation of all its published vertebrate genomes or genome regions. VEGA was developed by the Wellcome Trust Sanger Institute and is in close association with other annotation databases, such as ZFIN, the Havana Group and GenBank. Manual annotation is currently more accurate at identifying splice variants, pseudogenes, polyadenylation features, non-coding regions and complex gene arrangements than automated methods.

GENCODE is a scientific project in genome research and part of the ENCODE scale-up project.

SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.

The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

<span class="mw-page-title-main">DNA annotation</span> The process of describing the structure and function of a genome

In molecular biology and genetics, DNA annotation or genome annotation is the process of describing the structure and function of the components of a genome, by analyzing and interpreting them in order to extract their biological significance and understand the biological processes in which they participate. Among other things, it identifies the locations of genes and all the coding regions in a genome and determines what those genes do.

Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.

<span class="mw-page-title-main">PhytoPath</span>

PhytoPath was a joint scientific project between the European Bioinformatics Institute and Rothamsted Research, running from January 2012 to May 30, 2017. The project aimed to enable the exploitation of the growing body of “-omics” data being generated for phytopathogens, their plant hosts and related model species. Gene mutant phenotypic information is directly displayed in genome browsers.

<span class="mw-page-title-main">European Nucleotide Archive</span> Online database from the EBI on Nucleotides

The European Nucleotide Archive (ENA) is a repository providing free and unrestricted access to annotated DNA and RNA sequences. It also stores complementary information such as experimental procedures, details of sequence assembly and other metadata related to sequencing projects. The archive is composed of three main databases: the Sequence Read Archive, the Trace Archive and the EMBL Nucleotide Sequence Database. The ENA is produced and maintained by the European Bioinformatics Institute and is a member of the International Nucleotide Sequence Database Collaboration (INSDC) along with the DNA Data Bank of Japan and GenBank.

In bioinformatics, a Gene Disease Database is a systematized collection of data, typically structured to model aspects of reality, in a way to comprehend the underlying mechanisms of complex diseases, by understanding multiple composite interactions between phenotype-genotype relationships and gene-disease mechanisms. Gene Disease Databases integrate human gene-disease associations from various expert curated databases and text mining derived associations including Mendelian, complex and environmental diseases.

Model organism databases (MODs) are biological databases, or knowledgebases, dedicated to the provision of in-depth biological data for intensively studied model organisms. MODs allow researchers to easily find background information on large sets of genes, plan experiments efficiently, combine their data with existing knowledge, and construct novel hypotheses. They allow users to analyse results and interpret datasets, and the data they generate are increasingly used to describe less well studied species. Where possible, MODs share common approaches to collect and represent biological information. For example, all MODs use the Gene Ontology (GO) to describe functions, processes and cellular locations of specific gene products. Projects also exist to enable software sharing for curation, visualization and querying between different MODs. Organismal diversity and varying user requirements however mean that MODs are often required to customize capture, display, and provision of data.

<span class="mw-page-title-main">Genome mining</span>

Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions. It depends on computational technology and bioinformatics tools. The mining process relies on a huge amount of data accessible in genomic databases. By applying data mining algorithms, the data can be used to generate new knowledge in several areas of medicinal chemistry, such as discovering novel natural products.

References

  1. Yates A. D.; et al. (January 2020). "Ensembl 2020". Nucleic Acids Res. 48 (D1): D682–D688. doi:10.1093/nar/gkz966. PMC   7145704 . PMID   31691826.
  2. 1 2 Hubbard, T. (1 January 2002). "The Ensembl genome database project". Nucleic Acids Research. 30 (1): 38–41. doi:10.1093/nar/30.1.38. PMC   99161 . PMID   11752248.
  3. Flicek P, Amode MR, Barrell D, et al. (November 2010). "Ensembl 2011". Nucleic Acids Res. 39 (Database issue): D800–D806. doi:10.1093/nar/gkq1064. PMC   3013672 . PMID   21045057.
  4. Flicek P, Aken BL, Ballester B, et al. (January 2010). "Ensembl's 10th year". Nucleic Acids Res. 38 (Database issue): D557–62. doi:10.1093/nar/gkp972. PMC   2808936 . PMID   19906699.
  5. Davis, Charles Patrick (29 March 2021). "Medical definition of Genome Annotation". Archived from the original on 14 June 2021. Retrieved 7 August 2022.
  6. Curwen, Val; Eyras, Eduardo; Andrews, T. Daniel; Clarke, Laura; Mongin, Emmanuel; Searle, Steven M. J.; Clamp, Michele (May 2004). "The Ensembl automatic gene annotation system". Genome Research. 14 (5): 942–950. doi: 10.1101/gr.1858004 . ISSN   1088-9051. PMC   479124 . PMID   15123590.
  7. Ruffier, Magali; Kähäri, Andreas; Komorowska, Monika; Keenan, Stephen; Laird, Matthew; Longden, Ian; Proctor, Glenn; Searle, Steve; Staines, Daniel; Taylor, Kieron; Vullo, Alessandro; Yates, Andrew; Zerbino, Daniel; Flicek, Paul (January 2017). "Ensembl core software resources: storage and programmatic access for DNA sequence and genome annotation". Database. 2017 (1): bax020. doi:10.1093/database/bax020. PMC   5467575 . PMID   28365736.
  8. Hubbard, T. J. P.; Aken, B. L.; Ayling, S.; Ballester, B.; Beal, K.; Bragin, E.; Brent, S.; Chen, Y.; Clapham, P.; Clarke, L.; Coates, G. (January 2009). "Ensembl 2009". Nucleic Acids Research. 37 (Database issue): D690–697. doi:10.1093/nar/gkn828. ISSN   1362-4962. PMC   2686571 . PMID   19033362.
  9. Howe, Kevin L.; Contreras-Moreira, Bruno; De Silva, Nishadi; Maslen, Gareth; Akanni, Wasiu; Allen, James; Alvarez-Jarreta, Jorge; Barba, Matthieu; Bolser, Dan M.; Cambell, Lahcen; Carbajo, Manuel (8 January 2020). "Ensembl Genomes 2020-enabling non-vertebrate genomic research". Nucleic Acids Research. 48 (D1): D689–D695. doi:10.1093/nar/gkz890. ISSN   1362-4962. PMC   6943047 . PMID   31598706.
  10. Stabenau A, McVicker G, Melsopp C, Proctor G, Clamp M, Birney E (February 2004). "The Ensembl Core Software Libraries". Genome Research. 14 (5): 929–933. doi:10.1101/gr.1857204. PMC   479122 . PMID   15123588.
  11. "Species List". uswest.ensembl.org. Archived from the original on 6 August 2022. Retrieved 5 August 2022.