UniGene

Last updated
UniGene
Unigene banner.jpg
Content
Description transcriptome
Contact
Research center NCBI
Access
Website https://www.ncbi.nlm.nih.gov/unigene

UniGene was a NCBI database of the transcriptome and thus, despite the name, not primarily a database for genes. Each entry is a set of transcripts that appear to stem from the same transcription locus (i.e. gene or expressed pseudogene). Information on protein similarities, gene expression, cDNA clones, and genomic location is included with each entry.

Contents

Descriptions of the UniGene transcript based and genome based build procedures are available.

A detailed description of UniGene database

The UniGene resource, developed at NCBI, clusters ESTs and other mRNA sequences, along with coding sequences (CDSs) annotated on genomic DNA, into subsets of related sequences. In most cases, each cluster is made up of sequences produced by a single gene, including alternatively spliced transcripts. However, some genes may be represented by more than one cluster. The clusters are organism specific and are currently available for human, mouse, rat, zebrafish, and cattle. They are built in several stages, using an automatic process based on special sequence comparison algorithms. First, the nucleotide sequences are searched for contaminants, such as mitochondrial, ribosomal, and vector sequence, repetitive elements, and low-complexity sequences. After a sequence is screened, it must contain at least 100 bases to be a candidate for entry into UniGene. mRNA and genomic DNA are clustered first into gene links. A second sequence comparison links ESTs to each other and to the gene links. At this stage, all clusters are ‘‘anchored,’’ and contain either a sequence with a polyadenylation site or two ESTs labeled as coming from the 3 end of a clone. Clone-based edges are added by linking the 5 and 3 ESTs that derive from the same clone. In some cases, this linking may merge clusters identified at a previous stage. Finally, unanchored ESTs and gene clusters of size 1 (which may represent rare transcripts) are compared with other UniGene clusters at lower stringency. The UniGene build is updated weekly, and the sequences that make up a cluster may change. Thus, it is not safe to refer to a UniGene cluster by its cluster identifier; instead, one should use the GenBank accession numbers of the sequences in the cluster.

As of July 2000, the human subset of UniGene contained 1.7 million sequences in 82,000 clusters; 98% of these clustered sequences were ESTs, and the remaining 2% were from mRNAs or CDSs annotated on genomic DNA. These human clusters could represent fragments of up to 82,000 unique human genes, implying that many human genes are now represented in a UniGene cluster. (This number is undoubtedly an overestimate of the number of genes in the human genome, as some genes may be represented by more than one cluster.) Only 1.4% of clusters totally lack ESTs, implying that most human genes are represented by at least one EST. Conversely, it appears that the majority of human genes have been identified only by ESTs; only 16% of clusters contain either an mRNA or a CDS annotated on a genomic DNA. Because fewer ESTs are available for mouse, rat, and zebrafish, the UniGene clusters are not as representative of the unique genes in the genome. Mouse UniGene contains 895,000 sequences in 88,000 clusters, and rat UniGene contains 170,000 sequences in 37,000 clusters.

A new UniGene resource, HomoloGene, includes curated and calculated orthologs and homologs for genes from human, mouse, rat, and zebrafish. Calculated orthologs and homologs are the result of nucleotide sequence comparisons between all UniGene clusters for each pair of organisms. Homologs are identified as the best match between a UniGene cluster in one organism and a cluster in a second organism. When two sequences in different organisms are best matches to one another (a reciprocal best match), the UniGene clusters corresponding to the pair of sequences are considered putative orthologs. A special symbol indicates that UniGene clusters in three or more organisms share a mutually consistent ortholog relationship. The calculated orthologs and homologs are considered putative, since they are based only on sequence comparisons. Curated orthologs are provided by the Mouse Genome Database (MGD) at the Jackson Laboratory and the Zebrafish Information Database (ZFIN) at the University of Oregon and can also be obtained from the scientific literature. Queries to UniGene are entered into a text box on any of the UniGene pages. Query terms can be, for example, the UniGene identifier, a gene name, a text term that is found somewhere in the UniGene record, or the accession number of an EST or gene sequence in the cluster. For example, the cluster entitled ‘‘A disintegrin and metalloprotease domain 10’’ that contains the sequence for human ADAM10 can be retrieved by entering ADAM10, disintegrin, AF009615 (the GenBank accession number of ADAM10), or H69859 (the GenBank accession number of an EST in the cluster). To query a specific part of the UniGene record, use the @ symbol. For example, @gene(symbol) looks for genes with the name of the symbol enclosed in the parentheses, @chr(num) searches for entries that map to chromosome num, @lib(id) returns entries in a cDNA library identified by id, and @pid(id) se- lects entries associated with a GenBank protein identifier id.

The query results page contains a list of all UniGene clusters that match the query. Each cluster is identified by an identifier, a description, and a gene symbol, if available. Cluster identifiers are prefixed with Hs for Homo sapiens, Rn for Rattus norvegicus, Mm for Mus musculus, or Dn for Danio rerio. The descriptions of UniGene clusters are taken from LocusLink, if available, or from the title of a sequence in the cluster. The UniGene report page for each cluster links to data from other NCBI resources (Fig. 12.5). At the top of the page are links to LocusLink, which provides descriptive information about genetic loci (Pruitt et al., 2000), OMIM, a catalog of human genes and genetic disorders, and HomoloGene. Next are listed similarities between the translations of DNA sequences in the cluster and protein sequences from model organisms, including human, mouse, rat, fruit fly, and worm. The subsequent section describes relevant mapping information. It is followed by ‘‘expression information,’’ which lists the tissues from which the ESTs in the cluster have been created, along with links to the SAGE database. Sequences making up the cluster are listed next, along with a link to download these sequences.

It is important to note that clusters that contain ESTs only (i.e., no mRNAs or annotated CDSs) will be missing some of these fields, such as LocusLink, OMIM, and mRNA/Gene links. UniGene titles for such clusters, such as ‘‘EST, weakly similar to ORF2 contains a reverse transcriptase domain [H. sapiens],’’ are derived from the title of a characterized protein with which the translated EST sequence aligns. The cluster title might be as simple as ‘‘EST’’ if the ESTs share no significant similarity with characterized proteins. [1]

Retirement of UniGene

On February 1, 2019, the NCBI announced that it was retiring the UniGene database because "reference genomes are available for most organisms with a sizable research community. Consequently, the usage of and need for UniGene has dropped significantly." [2] Access to the UniGene builds will remain available through FTP.

See also

Related Research Articles

<span class="mw-page-title-main">National Center for Biotechnology Information</span> Database branch of the US National Library of Medicine

The National Center for Biotechnology Information (NCBI) is part of the United States National Library of Medicine (NLM), a branch of the National Institutes of Health (NIH). It is approved and funded by the government of the United States. The NCBI is located in Bethesda, Maryland, and was founded in 1988 through legislation sponsored by US Congressman Claude Pepper.

In genetics, an expressed sequence tag (EST) is a short sub-sequence of a cDNA sequence. ESTs may be used to identify gene transcripts, and were instrumental in gene discovery and in gene-sequence determination. The identification of ESTs has proceeded rapidly, with approximately 74.2 million ESTs now available in public databases. EST approaches have largely been superseded by whole genome and transcriptome sequencing and metagenome sequencing.

In molecular biology, open reading frames (ORFs) are defined as spans of DNA sequence between the start and stop codons. Usually, this is considered within a studied region of a prokaryotic DNA sequence, where only one of the six possible reading frames will be "open". Such an ORF may contain a start codon and by definition cannot extend beyond a stop codon. That start codon indicates where translation may start. The transcription termination site is located after the ORF, beyond the translation stop codon. If transcription were to cease before the stop codon, an incomplete protein would be made during translation.

Sequence homology Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

The Rat Genome Database (RGD) is a database of rat genomics, genetics, physiology and functional data, as well as data for comparative genomics between rat, human and mouse. RGD is responsible for attaching biological information to the rat genome via structured vocabulary, or ontology, annotations assigned to genes and quantitative trait loci (QTL), and for consolidating rat strain data and making it available to the research community. They are also developing a suite of tools for mining and analyzing genomic, physiologic and functional data for the rat, and comparative data for rat, mouse, human, and five other species.

The Bioinformatic Harvester was a bioinformatic meta search engine created by the European Molecular Biology Laboratory and subsequently hosted and further developed by KIT Karlsruhe Institute of Technology for genes and protein-associated information. Harvester currently works for human, mouse, rat, zebrafish, drosophila and arabidopsis thaliana based information. Harvester cross-links >50 popular bioinformatic resources and allows cross searches. Harvester serves tens of thousands of pages every day to scientists and physicians. Since 2014 the service is down.

HomoloGene, a tool of the United States National Center for Biotechnology Information (NCBI), is a system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection of publicly available nucleotide sequences and their protein products. RefSeq was first introduced in 2000. This database is built by National Center for Biotechnology Information (NCBI), and, unlike GenBank, provides only a single record for each natural biological molecule for major organisms ranging from viruses to bacteria to eukaryotes.

TSR3

TSR3, or TSR3 Ribosome Maturation Factor, is a hypothetical human protein found on chromosome 16. Its protein is 312 amino acids long. and its cDNA has 1214 base pairs It was previously designated C16orf42.

GeneCards is a database of human genes that provides genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science.

HIKESHI

HIKESHI is a protein important in lung and multicellular organismal development that, in humans, is encoded by the HIKESHI gene. HIKESHI is found on chromosome 11 in humans and chromosome 7 in mice. Similar sequences (orthologs) are found in most animal and fungal species. The mouse homolog, lethal gene on chromosome 7 Rinchik 6 protein is encoded by the l7Rn6 gene.

KIAA0895L

Uncharacterized protein KIAA0895-like also known as LOC653319, is a protein that in humans is encoded by the KIAA0895L gene.

OSER1

Chromosome 20 open reading frame 111, or C20orf111, is the hypothetical protein that in humans is encoded by the C20orf111 gene. C20orf111 is also known as Perit1, HSPC207, and dJ1183I21.1. It was originally located using genomic sequencing of chromosome 20. The National Center for Biotechnology Information, or NCBI, shows that it is located at q13.11 on chromosome 20, however the genome browser at the University of California-Santa Cruz (UCSC) website shows that it is at location q13.12, and within a million base pairs of the adenosine deaminase locus. It was also found to have an increase in expression in cells undergoing hydrogen peroxide(H
2
O
2
)-induced apoptosis. After analyzing the amino acid content of C20orf111, it was found to be rich in serine residues.

FAM185A Gene of the species Homo sapiens

The FAM185A is a protein that in humans is encoded by the FAM185A gene. The FAM185A gene is found on the positive strand of Chromosome 7 at 7q22.1. The gene begins 102,389,399bp from the p-terminus of the chromosome and ends at 102,449,672bp from the p-terminus; it covers a total of 73,308 basepairs. The protein encoded by this gene is characterized by the presence of multiple copies of DUF4098 near its C-terminus. It is described as a Long Interspersed Nuclear Element (LINE), a subclass of penaeid repetitive elements (PREs).

FAM203B Protein-coding gene in the species Homo sapiens

Family with Sequence Similarity 203, Member B (FAM203B) is a protein encoded by the FAM203B gene (8q24.3) in humans. While FAM203B is only found in humans and possibly non-human primates, its paralog, FAM203A, is highly conserved. The FAM203B protein contains two conserved domains of unknown function, DUF383 and DUF384, and no transmembrane domains. This protein has no known function yet, although the homolog of FAM203A in Caenorhabditis elegans (Y54H5A.2) is thought to help regulate the actin cytoskeleton.

Mark Boguski American pathologist (died 2021)

Mark S. Boguski was an American pathologist specializing in computational analysis and structural biology, He was elected in 2001 to the U.S. National Academy of Medicine, and was a Fellow of the American College of Medical Informatics.

Chromosome 19 open reading frame 18 (c19orf18) is a protein which in humans is encoded by the c19orf18 gene. The gene is exclusive to mammals and the protein is predicted to have a transmembrane domain and a coiled coil stretch. This protein has a function that is not yet fully understood by the scientific community.

C22orf23

C22orf23 is a protein which in humans is encoded by the C22orf23 gene. Its predicted secondary structure consists of alpha helices and disordered/coil regions. It is expressed in many tissues and highest in the testes and it is conserved across many orthologs.

<span class="mw-page-title-main">CCDC190</span> Protein-coding gene in the species Homo sapiens

Coiled-Coil Domain Containing 190, also known as C1orf110, the Chromosome 1 Open Reading Frame 110, MGC48998 and CCDC190, is found to be a protein coding gene widely expressed in vertebrates. RNA-seq gene expression profile shows that this gene selectively expressed in different organs of human body like lung brain and heart. The expression product of c1orf110 is often called Coiled-coil domain-containing protein 190 with a size of 302 aa. It may get the name because a coiled-coil domain is found from position 14 to 72. At least 6 spliced variants of its mRNA and 3 isoforms of this protein can be identified, which is caused by alternative splicing in human.

References

  1. Andreas D. Baxevanis and B. F. Francis Ouellette|BIOINFORMATICS A Practical Guide to the Analysis of Genes and Proteins(2001 2nd edition)||JOHN WILEY & SONS, INC.| ISBN   0-471-38391-0| ISBN   978-0-471-38391-8 |
  2. "NCBI to Retire the UniGene". February 2019. Retrieved 12 February 2019.