UCSC Genome Browser

Last updated
The UCSC Genome Browser
UC Santa Cruz logo.svg
Content
DescriptionThe UCSC Genome Browser
Contact
Research center University of California Santa Cruz
LaboratoryCenter for Biomolecular Science and Engineering, Baskin School of Engineering
Primary citationNavarro Gonzalez & al. (2021) [1]
Access
Website genome.ucsc.edu

The UCSC Genome Browser is an online and downloadable genome browser hosted by the University of California, Santa Cruz (UCSC). [2] [3] [4] It is an interactive website offering access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. The Browser is a graphical viewer optimized to support fast interactive performance and is an open-source, web-based tool suite built on top of a MySQL database for rapid visualization, examination, and querying of the data at many levels. The Genome Browser Database, browsing tools, downloadable data files, and documentation can all be found on the UCSC Genome Bioinformatics website.

Contents

History

Initially built and still managed by Jim Kent, then a graduate student, and David Haussler, professor of Computer Science (now Biomolecular Engineering) at the University of California, Santa Cruz in 2000, the UCSC Genome Browser began as a resource for the distribution of the initial fruits of the Human Genome Project. Funded by the Howard Hughes Medical Institute and the National Human Genome Research Institute, NHGRI (one of the US National Institutes of Health), the browser offered a graphical display of the first full-chromosome draft assembly of human genome sequence. Today the browser is used by geneticists, molecular biologists and physicians as well as students and teachers of evolution for access to genomic information. [5]

Genomes

UCSC Genomes UCSC Genomes.jpg
UCSC Genomes

In the years since its inception, the UCSC Browser has expanded to accommodate genome sequences of all vertebrate species and selected invertebrates for which high-coverage genomic sequences is available, [6] now including 108 species. High coverage is necessary to allow overlap to guide the construction of larger contiguous regions. Genomic sequences with less coverage are included in multiple-alignment tracks on some browsers, but the fragmented nature of these assemblies does not make them suitable for building full featured browsers. (more below on multiple-alignment tracks). The species hosted with full-featured genome browsers are shown in the table. [7]

Species
great apes baboon, bonobo, chimpanzee, gibbon, gorilla, human, orangutan
non-ape primates bushbaby, golden snub-nosed monkey, green monkey, marmoset, mouse lemur, proboscis monkey, rhesus macaque, squirrel monkey, tarsier, tree shrew
non-primate mammals alpaca, armadillo, bison, brown kiwi, cat, Chinese hamster, Chinese pangolin, cow, dog, dolphin, elephant, ferret, guinea pig, hawaiian monk seal, hedgehog, horse, kangaroo rat, little brown bat, Malayan flying lemur, manatee, megabat, Minke whale, mouse, naked mole-rat, opossum, panda, pig, pika, platypus, rabbit, rat, rock hyrax, sheep, shrew, sloth, squirrel, Tasmanian devil, tenrec, wallaby, white rhinoceros
non-mammal chordates African clawed frog, American alligator, Atlantic cod, budgerigar, chicken, coelacanth, elephant shark, Fugu, garter snake, goldean eagle, lamprey, lizard, medaka, medium ground finch, Nile tilapia, painted turtle, stickleback, Tetraodon, Nanorana parkeri , turkey, Xenopus tropicalis , zebra finch, zebrafish
invertebrates Anopheles gambiae , Apis mellifera, Caenorhabditis spp (5), California sea hare, Ciona intestinalis , Drosophila spp. (11), Lancelet, Pristionchus pacificus , sea squirt, sea urchin, yeast
viruses Ebolavirus , SARS-CoV-2 coronavirus

Apart from these 108 species and their assemblies, the UCSC Genome Browser also offers Assembly Hubs , web-accessible directories of genomic data that can be viewed on the browser and include assemblies that are not hosted natively on it. There, users can load and annotate unique assemblies for which UCSC does not provide an annotation database. A full list of species and their assemblies can be viewed in the GenArk Portal, including 2,589 assemblies hosted by both UCSC Genome Browser database and Assembly Hubs. An example can be seen in the Vertebrate Genomes Project assembly hub.

Browser functionality

The large amount of data about biological systems that is accumulating in the literature makes it necessary to collect and digest information using the tools of bioinformatics. The UCSC Genome Browser presents a diverse collection of annotation datasets (known as "tracks" and presented graphically), including mRNA alignments, mappings of DNA repeat elements, gene predictions, gene-expression data, disease-association data (representing the relationships of genes to diseases), and mappings of commercially available gene chips (e.g., Illumina and Agilent). The basic paradigm of display is to show the genome sequence in the horizontal dimension, and show graphical representations of the locations of the mRNAs, gene predictions, etc. Blocks of color along the coordinate axis show the locations of the alignments of the various data types. The ability to show this large variety of data types on a single coordinate axis makes the browser a handy tool for the vertical integration of the data. [8]

To find a specific gene or genomic region, the user may type in the gene name, a DNA sequence, an accession number for an RNA, the name of a genomic cytological band (e.g., 20p13 for band 13 on the short arm of chr20) or a chromosomal position (chr17:38,450,000-38,531,000 for the region around the gene BRCA1).

Presenting the data in the graphical format allows the browser to present link access to detailed information about any of the annotations. The gene details page of the UCSC Genes track provides a large number of links to more specific information about the gene at many other data resources, such as Online Mendelian Inheritance in Man (OMIM) and SwissProt.

Designed for the presentation of complex and voluminous data, the UCSC Browser is optimized for speed. By pre-aligning millions of RNA secuences from GenBank to each of the 244 genome assemblies (many of the 108 species have more than one assembly), the browser allows instant access to the alignments of any RNA to any of the hosted species.

Multiple gene products of FOXP2 gene (top) and evolutionary conservation shown in multiple alignment (bottom) BrowserFoxp2.jpg
Multiple gene products of FOXP2 gene (top) and evolutionary conservation shown in multiple alignment (bottom)

The juxtaposition of the many types of data allow researchers to display exactly the combination of data that will answer specific questions. A pdf/postscript output functionality allows export of a camera-ready image for publication in academic journals.

One unique and useful feature that distinguishes the UCSC Browser from other genome browsers is the continuously variable nature of the display. Sequence of any size can be displayed, from a single DNA base up to the entire chromosome (human chr1 = 245 million bases, Mb) with full annotation tracks. Researchers can display a single gene, a single exon, or an entire chromosome band, showing dozens or hundreds of genes and any combination of the many annotations. A convenient drag-and-zoom feature allows the user to choose any region in the genome image and expand it to occupy the full screen.

Researchers may also use the browser to display their own data via the Custom Tracks tool. This feature allows users to upload a file of their own data and view the data in the context of the reference genome assembly. Users may also use the data hosted by UCSC, creating subsets of the data of their choosing with the Table Browser tool (such as only the SNPs that change the amino acid sequence of a protein) and display this specific subset of the data in the browser as a Custom Track.

Any browser view created by a user, including those containing Custom Tracks, may be shared with other users via the Saved Sessions tool.

Tracks

UCSC Genome Browser Tracks for Categories: Mapping and Sequencing, Genes and Gene Predictions, Phenotype and Literature, COVID-19, Single- Cell RNA-Seq, mRNA and EST. UCSC Tracks 2022 1.png
UCSC Genome Browser Tracks for Categories: Mapping and Sequencing, Genes and Gene Predictions, Phenotype and Literature, COVID-19, Single- Cell RNA-Seq, mRNA and EST.
UCSC Genome Browser Tracks for Categories: Regulation, Comparative Genomics, Variation, Repeats UCSC Tracks 2022 2.png
UCSC Genome Browser Tracks for Categories: Regulation, Comparative Genomics, Variation, Repeats

Below the displayed images of the UCSC Genome browser are eleven categories of additional tracks that can be selected and displayed alongside the original data. Researchers can select tracks which best represent their query to allow for more applicable data to be displayed depending on the type and depth of research being done. These categories are as follows:

Categories
CategoryDescriptionExamples of tracks
Mapping and SequencingIt allows control over the style of sequencing displayed (e.g., genomic coordinates, sequences, gaps etc.). It can also display a percentage based track to show a researcher if a particular genetic element is more prevalent in the specified area. Base Position. Mappability, Gap
Genes and Gene PredictionsIt offers programs to predict genes and which databases to display known genes from. The different tracks allow the user to display gene models, protein coding regions, non-coding RNA etc. Users can quickly compare their query with pre-selected sets of genes to look for correlations between known sets of genes. GENCODE v24, Geneid Genes, Pfam in UCSC Gene
Phenotype and LiteratureDatabases containing specific styles of phenotype data. These tracks are intended for use primarily by physicians and other professionals concerned with genetic disorders (e.g., genetics researchers, students in science and medicine). Users can display a track that shows the genomic positions of natural and artificial amino acid variants. OMIM Alleles, Cancer Gene Expr Super-track
COVID-19It shows data from Genome-Wide Association Studies (GWAS) and variant calling experiments to identify genetic variants associated with severity and susceptibility to COVID-19 disease. COVID GWAS v3, COVID GWAS v4, Rare Harmful Vars
Single Cell RNA-SeqIt offers RNA expression data at single cell level (scRNA-Seq) from different human tissues (e.g., kidney, colon, heart, muscle, placenta, peripheral blood mononuclear cells etc.) Blood (PBMC), Heart Cell Atlas, Colon Wang
mRNA and ESTIt shows Expressed Sequence Tags (ESTs) and messenger RNA. ESTs are single-read sequences, typically about 500 bases in length, that usually represent fragments of transcribed genes. The mRNA tracks allow the display of mRNA alignment data in Humans, as well as, other species. There are also tracks allowing comparison with regions of ESTs that show signs of splicing when aligned with the genome. Human ESTs, Other ESTs, Other mRNAs
ExpressionIt offers genetic data and related gene expression in tissue areas. This allows users to discover if a particular gene or sequence is linked with various tissues throughout the body. The expression tracks also allow for displays of consensus data about the tissues that express the query region. GTEx Gene, Affy U133
RegulationInformation relevant to regulation of transcription from different studies. Users can adjust the regulation tracks to add a display graph to the genome browser. These displays allow for more detail about regulatory regions, transcription factor binding sites, RNA binding sites, regulatory variants, haplotypes, and other regulatory elements. ENCODE Regulation Super-track Settings, ORegAnno
Comparative GenomicsIt shows sequences conservation data, including primates, vertebrates, mammals among others. The comparative alignments give a graphical view of the evolutionary relationships among species. This makes it a useful tool both for the researcher, who can visualize regions of conservation among a group of species and make predictions about functional elements in unknown DNA regions, and in the classroom as a tool to illustrate one of the most compelling arguments for the evolution of species. The Conservation track on the human assembly clearly shows that the farther one goes back in evolutionary time (this track includes 100 species), the less sequence homology remains, but functionally important regions of the genome (e.g., exons and control elements, but not introns typically) are conserved much farther back in evolutionary time. Conservation, Cactus 241-way, Cons 30 Primates
VariationIt compares the searched sequence with known variations. For example, the entire contents of each release of the dbSNP database from NCBI are mapped to human, mouse and other genomes. This includes the fruits of the 1000 Genomes Project, as soon as they are released in dbSNP. Other types of variation data include copy-number variation data (CNV) and human population allele frequencies from the HapMap project. Common SNPs(150), All SNPs(146), Flagged SNPs(144)
RepeatsAllows tracking of different kinds of repeated sequences in the query. Users can quickly see if their specified search contains large amounts of repeated sequences at a glance and adjust their search or track displays accordingly. RepeatMasker, Microsatellite, WM + SDust

Analysis tools

The UCSC site hosts a set of genome analysis tools, including a full-featured GUI interface for mining the information in the browser database, a FASTA format sequence alignment tool BLAT [9] that is also useful for simply finding sequences in the massive sequence (human genome = 3.23 billion bases [Gb]) of any of the featured genomes.

A liftOver tool uses whole-genome alignments to allow conversion of sequences from one assembly to another or between species. The Genome Graphs tool allows users to view all chromosomes at once and display the results of genome-wide association studies (GWAS). The Gene Sorter displays genes grouped by parameters not linked to genome location, such as expression pattern in tissues.

Open source / mirrors

The UCSC Browser code base is open-source for non-commercial use, and is mirrored locally by many research groups, allowing private display of data in the context of the public data. The UCSC Browser is mirrored at several locations worldwide, as shown in the table.

Official mirror sites
European mirror — maintained by UCSC at University of Bielefeld, Germany
Asian mirror — maintained by UCSC at RIKEN, Yokohama, Japan

The Browser code is also used in separate installations by the UCSC Malaria Genome Browser and the Archaea Browser.

See also

Related Research Articles

<span class="mw-page-title-main">Comparative genomics</span>

Comparative genomics is a field of biological research in which the genomic features of different organisms are compared. The genomic features may include the DNA sequence, genes, gene order, regulatory sequences, and other genomic structural landmarks. In this branch of genomics, whole or large parts of genomes resulting from genome projects are compared to study basic biological similarities and differences as well as evolutionary relationships between organisms. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, comparative genomic approaches start with making some form of alignment of genome sequences and looking for orthologous sequences in the aligned genomes and checking to what extent those sequences are conserved. Based on these, genome and molecular evolution are inferred and this may in turn be put in the context of, for example, phenotypic evolution or population genetics.

<span class="mw-page-title-main">Ensembl genome database project</span> Scientific project at the European Bioinformatics Institute

Ensembl genome database project is a scientific project at the European Bioinformatics Institute, which provides a centralized resource for geneticists, molecular biologists and other researchers studying the genomes of our own species and other vertebrates and model organisms. Ensembl is one of several well known genome browsers for the retrieval of genomic information.

<span class="mw-page-title-main">ENCODE</span> Research consortium investigating functional elements in human and model organism DNA

The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims "to build a comprehensive parts list of functional elements in the human genome."

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

<span class="mw-page-title-main">Jim Kent</span> American research scientist and computer programmer

William James Kent is an American research scientist and computer programmer. He has been a contributor to genome database projects and the 2003 winner of the Benjamin Franklin Award.

The completion of the human genome sequencing in the early 2000s was a turning point in genomics research. Scientists have conducted series of research into the activities of genes and the genome as a whole. The human genome contains around 3 billion base pairs nucleotide, and the huge quantity of data created necessitates the development of an accessible tool to explore and interpret this information in order to investigate the genetic basis of disease, evolution, and biological processes. The field of genomics has continued to grow, with new sequencing technologies and computational tool making it easier to study the genome.

The Saccharomyces Genome Database (SGD) is a scientific database of the molecular biology and genetics of the yeast Saccharomyces cerevisiae, which is commonly known as baker's or budding yeast. Further information is located at the Yeastract curated repository.

<span class="mw-page-title-main">David Haussler</span> American bioinformatician

David Haussler is an American bioinformatician known for his work leading the team that assembled the first human genome sequence in the race to complete the Human Genome Project and subsequently for comparative genome analysis that deepens understanding the molecular function and evolution of the genome.

BLAT is a pairwise sequence alignment algorithm that was developed by Jim Kent at the University of California Santa Cruz (UCSC) in the early 2000s to assist in the assembly and annotation of the human genome. It was designed primarily to decrease the time needed to align millions of mouse genomic reads and expressed sequence tags against the human genome sequence. The alignment tools of the time were not capable of performing these operations in a manner that would allow a regular update of the human genome assembly. Compared to pre-existing tools, BLAT was ~500 times faster with performing mRNA/DNA alignments and ~50 times faster with protein/protein alignments.

UCSC Malaria Genome Browser is a bioinformatic research tool to study the malaria genome, developed by Hughes Undergraduate Research Laboratory together with the laboratory of Prof. Manuel Ares Jr. at the University of California, Santa Cruz.

<span class="mw-page-title-main">MicrobesOnline</span>

MicrobesOnline is a publicly and freely accessible website that hosts multiple comparative genomic tools for comparing microbial species at the genomic, transcriptomic and functional levels. MicrobesOnline was developed by the Virtual Institute for Microbial Stress and Survival, which is based at the Lawrence Berkeley National Laboratory in Berkeley, California. The site was launched in 2005, with regular updates until 2011.

GENCODE is a scientific project in genome research and part of the ENCODE scale-up project.

The Consensus Coding Sequence (CCDS) Project is a collaborative effort to maintain a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assemblies. The CCDS project tracks identical protein annotations on the reference mouse and human genomes with a stable identifier, and ensures that they are consistently represented by the National Center for Biotechnology Information (NCBI), Ensembl, and UCSC Genome Browser. The integrity of the CCDS dataset is maintained through stringent quality assurance testing and on-going manual curation.

Ensembl Genomes is a scientific project to provide genome-scale data from non-vertebrate species.

WormBase is an online biological database about the biology and genome of the nematode model organism Caenorhabditis elegans and contains information about other related nematodes. WormBase is used by the C. elegans research community both as an information resource and as a place to publish and distribute their results. The database is regularly updated with new versions being released every two months. WormBase is one of the organizations participating in the Generic Model Organism Database (GMOD) project.

Kate R. Rosenbloom is a member of the Encyclopedia of DNA Elements (ENCODE) Consortium. She is a Tech Project Manager and Software Developer at the Center for Biomolecular Science and Engineering, Jack Baskin School of Engineering, University of California Santa Cruz (UCSC), USA. She has been a member of the scientific advisory board to the human proteome project and contributed data integration and visualisation within the GTEx consortium, an international project aiming to understand how genetic variation shapes variation between human tissues.

Echinobase is a Model Organism Database (MOD). It supports the international research community by providing a centralized, integrated web based resource to access the diverse and rich, functional genomics data of echinoderm evolution, development and gene regulatory networks.

The BED format is a text file format used to store genomic regions as coordinates and associated annotations. The data are presented in the form of columns separated by spaces or tabs. This format was developed during the Human Genome Project and then adopted by other sequencing projects. As a result of this increasingly wide use, this format had already become a de facto standard in bioinformatics before a formal specification was written.

The UC Santa Cruz Genomics Institute is a public research institution based in the Jack Baskin School of Engineering at the University of California, Santa Cruz. The Genomics Institute's scientists and engineers work on a variety of projects related to genome sequencing, computational biology, large data analytics, and data sharing. The institute also maintains a number of software tools used by researchers worldwide, including the UCSC Genome Browser, Dockstore, and the Xena Browser.

References

  1. Navarro Gonzalez, J; Zweig, AS; Speir, ML; Schmelter, D; Rosenbloom, KR; Raney, BJ; Powell, CC; Nassar, LR; Maulding, ND; Lee, CM; Lee, BT; Hinrichs, AS; Fyfe, AC; Fernandes, JD; Diekhans, M; Clawson, H; Casper, J; Benet-Pagès, A; Barber, GP; Haussler, D; Kuhn, RM; Haeussler, M; Kent, WJ (8 January 2021). "The UCSC Genome Browser database: 2021 update". Nucleic Acids Research. 49 (D1): D1046–D1057. doi:10.1093/nar/gkaa1070. ISSN   0305-1048. PMC   7779060 . PMID   33221922.
  2. Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A, Diekhans M, Dreszer TR, Giardine BM, Harte RA, Hillman-Jackson J, Hsu F, Kirkup V, Kuhn RM, Learned K, Li CH, Meyer LR, Pohl A, Raney BJ, Rosenbloom KR, Smith KE, Haussler D, Kent WJ (Jan 2011). "The UCSC Genome Browser database: update 2011". Nucleic Acids Res. 39 (Database issue): D876-82. doi:10.1093/nar/gkq963. PMC   3242726 . PMID   20959295.
  3. Kent WJ, Sugnet CW, Furey TS, Roskin KM, Pringle TH, Zahler AM, Haussler D (June 2002). "The human genome browser at UCSC". Genome Res. 12 (6): 996–1006. doi:10.1101/gr.229102. PMC   186604 . PMID   12045153.
  4. Kuhn, R. M.; Karolchik, D.; Zweig, A. S.; Wang, T.; Smith, K. E.; Rosenbloom, K. R.; Rhead, B.; Raney, B. J.; Pohl, A.; Pheasant, M.; Meyer, L. (2009-01-01). "The UCSC Genome Browser Database: update 2009". Nucleic Acids Research. 37 (Database): D755–D761. doi:10.1093/nar/gkn875. ISSN   0305-1048. PMC   2686463 . PMID   18996895.
  5. "History | Genomics Institute". genomics.ucsc.edu. Retrieved 2022-08-07.
  6. "High-coverage" here means 6x coverage, or six times more total sequence than the size of the genome.
  7. "UCSC Genome Browser: Acknowledgments". genome.ucsc.edu. Retrieved 2022-07-27.
  8. Navarro Gonzalez, Jairo; Zweig, Ann S.; Speir, Matthew L.; Schmelter, Daniel; Rosenbloom, Kate R.; Raney, Brian J.; Powell, Conner C.; Nassar, Luis R.; Maulding, Nathan D.; Lee, Christopher M.; Lee, Brian T. (2021-01-08). "The UCSC Genome Browser database: 2021 update". Nucleic Acids Research. 49 (D1): D1046–D1057. doi:10.1093/nar/gkaa1070. ISSN   1362-4962. PMC   7779060 . PMID   33221922.
  9. Kent, WJ. (Apr 2002). "BLAT - the BLAST-like alignment tool". Genome Res. 12 (4): 656–64. doi:10.1101/gr.229202. PMC   187518 . PMID   11932250.