Gene nomenclature is the scientific naming of genes, the units of heredity in living organisms. It is also closely associated with protein nomenclature, as genes and the proteins they code for usually have similar nomenclature. An international committee published recommendations for genetic symbols and nomenclature in 1957. [1] The need to develop formal guidelines for human gene names and symbols was recognized in the 1960s and full guidelines were issued in 1979 (Edinburgh Human Genome Meeting). [2] Several other genus-specific research communities (e.g., Drosophila fruit flies, Mus mice) have adopted nomenclature standards, as well, and have published them on the relevant model organism websites and in scientific journals, including the Trends in Genetics Genetic Nomenclature Guide. [3] [4] Scientists familiar with a particular gene family may work together to revise the nomenclature for the entire set of genes when new information becomes available. [5] For many genes and their corresponding proteins, an assortment of alternate names is in use across the scientific literature and public biological databases, posing a challenge to effective organization and exchange of biological information. [6] Standardization of nomenclature thus tries to achieve the benefits of vocabulary control and bibliographic control, although adherence is voluntary. The advent of the information age has brought gene ontology, which in some ways is a next step of gene nomenclature, because it aims to unify the representation of gene and gene product attributes across all species.
Gene nomenclature and protein nomenclature are not separate endeavors; they are aspects of the same whole. Any name or symbol used for a protein can potentially also be used for the gene that encodes it, and vice versa.[ citation needed ] But owing to the nature of how science has developed (with knowledge being uncovered bit by bit over decades), proteins and their corresponding genes have not always been discovered simultaneously (and not always physiologically understood when discovered), which is the largest reason why protein and gene names do not always match, or why scientists tend to favor one symbol or name for the protein and another for the gene.[ citation needed ] Another reason is that many of the mechanisms of life are the same or very similar across species, genera, orders, and phyla (through homology, analogy, or some of both), so that a given protein may be produced in many kinds of organisms; and thus scientists naturally often use the same symbol and name for a given protein in one species (for example, mice) as in another species (for example, humans). Regarding the first duality (same symbol and name for gene or protein), the context usually makes the sense clear to scientific readers, and the nomenclatural systems also provide for some specificity by using italic for a symbol when the gene is meant and plain (roman) for when the protein is meant.[ citation needed ] Regarding the second duality (a given protein is endogenous in many kinds of organisms), the nomenclatural systems also provide for at least human-versus-nonhuman specificity by using different capitalization,[ citation needed ] although scientists often ignore this distinction, given that it is often biologically irrelevant.[ citation needed ]
Also owing to the nature of how scientific knowledge has unfolded, proteins and their corresponding genes often have several names and symbols that are synonymous. Some of the earlier ones may be deprecated in favor of newer ones, although such deprecation is voluntary. Some older names and symbols live on simply because they have been widely used in the scientific literature (including before the newer ones were coined) and are well established among users. For example, mentions of HER2 and ERBB2 are synonymous.
Lastly, the correlation between genes and proteins is not always one-to-one (in either direction); in some cases it is several-to-one or one-to-several, and the names and symbols may then be gene-specific or protein-specific to some degree, or overlapping in usage:
The HUGO Gene Nomenclature Committee is responsible for providing human gene naming guidelines and approving new, unique human gene names and symbols (short identifiers typically created by abbreviating). For some nonhuman species, model organism databases serve as central repositories of guidelines and help resources, including advice from curators and nomenclature committees. In addition to species-specific databases, approved gene names and symbols for many species can be located in the National Center for Biotechnology Information's "Entrez Gene" [7] database.
There are generally accepted rules and conventions used for naming genes in bacteria. Standards were proposed in 1966 by Demerec et al. [8]
Each bacterial gene is denoted by a mnemonic of three lower case letters which indicate the pathway or process in which the gene-product is involved, followed by a capital letter signifying the actual gene. In some cases, the gene letter may be followed by an allele number. All letters and numbers are underlined or italicised. For example, leuA is one of the genes of the leucine biosynthetic pathway, and leuA273 is a particular allele of this gene.
Where the actual protein coded by the gene is known then it may become part of the basis of the mnemonic, thus:
Some gene designations refer to a known general function:
In a 1998 analysis of the E. coli genome, a large number of genes with unknown function were designated names beginning with the letter y, followed by sequentially generated letters without a mnemonic meaning (e.g., ydiO and ydbK). [9] Since being designated, some y-genes have been confirmed to have a function, [10] and assigned a synonym (alternative) name in recognition of this. However, as y-genes are not always re-named after being further characterised, this designation is not a reliable indicator of a gene's significance. [10]
Loss of gene activity leads to a nutritional requirement (auxotrophy) not exhibited by the wildtype (prototrophy).
Amino acids:
Some pathways produce metabolites that are precursors of more than one pathway. Hence, loss of one of these enzymes will lead to a requirement for more than one amino acid. For example:
Nucleotides:
Vitamins:
Loss of gene activity leads to loss of the ability to catabolise (use) the compound.
If the gene in question is the wildtype a superscript '+' sign is used:
If a gene is mutant, it is signified by a superscript '-':
By convention, if neither is used, it is considered to be mutant.
There are additional superscripts and subscripts which provide more information about the mutation:
Other modifiers:
When referring to the genotype (the gene) the mnemonic is italicized and not capitalised. When referring to the gene product or phenotype, the mnemonic is first-letter capitalised and not italicized (e.g. DnaA – the protein produced by the dnaA gene; LeuA− – the phenotype of a leuA mutant; AmpR – the ampicillin-resistance phenotype of the β-lactamase gene bla).
Protein names are generally the same as the gene names, but the protein names are not italicized, and the first letter is upper-case. E.g. the name of RNA polymerase is RpoB, and this protein is encoded by rpoB gene. [11]
Gene and protein symbol conventions ("sonic hedgehog" gene) | ||
Species | Gene symbol | Protein symbol |
---|---|---|
Homo sapiens | SHH | SHH |
Mus musculus , Rattus norvegicus | Shh | SHH |
Gallus gallus | SHH | SHH |
Anolis carolinensis | shh | SHH |
Xenopus laevis , X. tropicalis | shh | Shh |
Danio rerio | shh | Shh |
The research communities of vertebrate model organisms have adopted guidelines whereby genes in these species are given, whenever possible, the same names as their human orthologs. The use of prefixes on gene symbols to indicate species (e.g., "Z" for zebrafish) is discouraged. The recommended formatting of printed gene and protein symbols varies between species.
Vertebrate genes and proteins have names (typically strings of words) and symbols, which are short identifiers (typically 3 to 8 characters). For example, the gene cytotoxic T-lymphocyte-associated protein 4 has the HGNC symbol CTLA4. These symbols are usually, but not always, coined by contraction or acronymic abbreviation of the name. They are pseudo-acronyms, however, in the sense that they are complete identifiers by themselves—short names, essentially. They are synonymous with (rather than standing for) the gene/protein name (or any of its aliases), regardless of whether the initial letters "match". For example, the symbol for the gene v-akt murine thymoma viral oncogene homolog 1, which is AKT1, cannot be said to be an acronym for the name, and neither can any of its various synonyms, which include AKT, PKB, PRKBA, and RAC. Thus, the relationship of a gene symbol to the gene name is functionally the relationship of a nickname to a formal name (both are complete identifiers)—it is not the relationship of an acronym to its expansion. In this sense they are similar to the symbols for units of measurement in the SI system (such as km for the kilometre), in that they can be viewed as true logograms rather than just abbreviations. Sometimes the distinction is academic, but not always. Although it is not wrong to say that "VEGFA" is an acronym standing for "vascular endothelial growth factor A", just as it is not wrong that "km" is an abbreviation for "kilometre", there is more to the formality of symbols than those statements capture.
The root portion of the symbols for a gene family (such as the "SERPIN" root in SERPIN1, SERPIN2, SERPIN3, and so on) is called a root symbol. [12]
The HUGO Gene Nomenclature Committee is responsible for providing human gene naming guidelines and approving new, unique human gene names and symbols (short identifiers typically created by abbreviating). All human gene names and symbols can be searched online at the HGNC [13] website, and the guidelines for their formation are available there. [14] The guidelines for humans fit logically into the larger scope of vertebrates in general, and the HGNC's remit has recently expanded to assigning symbols to all vertebrate species without an existing nomenclature committee, to ensure that vertebrate genes are named in line with their human orthologs/paralogs. Human gene symbols generally are italicised, with all letters in uppercase (e.g., SHH, for sonic hedgehog). Italics are not necessary in gene catalogs. Protein designations are the same as the gene symbol except that they are not italicised. Like the gene symbol, they are in all caps because human (human-specific or human homolog). mRNAs and cDNAs use the same formatting conventions as the gene symbol. [5] For naming families of genes, the HGNC recommends using a "root symbol" [15] as the root for the various gene symbols. For example, for the peroxiredoxin family, PRDX is the root symbol, and the family members are PRDX1 , PRDX2 , PRDX3 , PRDX4 , PRDX5 , and PRDX6 .
Gene symbols generally are italicised, with only the first letter in uppercase and the remaining letters in lowercase (Shh). Italics are not required on web pages. Protein designations are the same as the gene symbol, but are not italicised and all are upper case (SHH). [16]
Nomenclature generally follows the conventions of human nomenclature. Gene symbols generally are italicised, with all letters in uppercase (e.g., NLGN1, for neuroligin1). Protein designations are the same as the gene symbol, but are not italicised; all letters are in uppercase (NLGN1). mRNAs and cDNAs use the same formatting conventions as the gene symbol. [17]
Gene symbols are italicised and all letters are in lowercase (shh). Protein designations are different from their gene symbol; they are not italicised, and all letters are in uppercase (SHH). [18]
Gene symbols are italicised and all letters are in lowercase (shh). Protein designations are the same as the gene symbol, but are not italicised; the first letter is in uppercase and the remaining letters are in lowercase (Shh). [19]
Gene symbols are italicised, with all letters in lowercase (shh). Protein designations are the same as the gene symbol, but are not italicised; the first letter is in uppercase and the remaining letters are in lowercase (Shh). [20]
A nearly universal rule in copyediting of articles for medical journals and other health science publications is that abbreviations and acronyms must be expanded at first use, to provide a glossing type of explanation. Typically no exceptions are permitted except for small lists of especially well known terms (such as DNA or HIV ). Although readers with high subject-matter expertise do not need most of these expansions, those with intermediate or (especially) low expertise are appropriately served by them.
One complication that gene and protein symbols bring to this general rule is that they are not, accurately speaking, abbreviations or acronyms, despite the fact that many were originally coined via abbreviating or acronymic etymology. They are pseudoacronyms (as SAT and KFC also are) because they do not "stand for" any expansion. Rather, the relationship of a gene symbol to the gene name is functionally the relationship of a nickname to a formal name (both are complete identifiers)—it is not the relationship of an acronym to its expansion. In fact, many official gene symbol–gene name pairs do not even share their initial-letter sequences (although some do). Nevertheless, gene and protein symbols "look just like" abbreviations and acronyms, which presents the problem that "failing" to "expand" them (even though it is not actually a failure and there are no true expansions) creates the appearance of violating the spell-out-all-acronyms rule.
One common way of reconciling these two opposing forces is simply to exempt all gene and protein symbols from the glossing rule. This is certainly fast and easy to do, and in highly specialized journals, it is also justified because the entire target readership has high subject matter expertise. (Experts are not confused by the presence of symbols (whether known or novel) and they know where to look them up online for further details if needed.) But for journals with broader and more general target readerships, this action leaves the readers without any explanatory annotation and can leave them wondering what the apparent-abbreviation stands for and why it was not explained. Therefore, a good alternative solution is simply to put either the official gene name or a suitable short description (gene alias/other designation) in parentheses after the first use of the official gene/protein symbol. This meets both the formal requirement (the presence of a gloss) and the functional requirement (helping the reader to know what the symbol refers to). The same guideline applies to shorthand names for sequence variations; AMA says, "In general medical publications, textual explanations should accompany the shorthand terms at first mention." [21] Thus "188del11" is glossed as "an 11-bp deletion at nucleotide 188." This corollary rule (which forms an adjunct to the spell-everything-out rule) often also follows the "abbreviation-leading" style of expansion that is becoming more prevalent in recent years. Traditionally, the abbreviation always followed the fully expanded form in parentheses at first use. This is still the general rule. But for certain classes of abbreviations or acronyms (such as clinical trial acronyms [e.g., ECOG ] or standardized polychemotherapy regimens [e.g., CHOP ]), this pattern may be reversed, because the short form is more widely used and the expansion is merely parenthetical to the discussion at hand. The same is true of gene/protein symbols.
The HUGO Gene Nomenclature Committee (HGNC) maintains an official symbol and name for each human gene, as well as a list of synonyms and previous symbols and names. For example, for AFF1 (AF4/FMR2 family, member 1), previous symbols and names are MLLT2 ("myeloid/lymphoid or mixed-lineage leukemia (trithorax (Drosophila) homolog); translocated to, 2") and PBM1 ("pre-B-cell monocytic leukemia partner 1"), and synonyms are AF-4 and AF4. Authors of journal articles often use the latest official symbol and name, but just as often they use synonyms and previous symbols and names, which are well established by earlier use in the literature. AMA style is that "authors should use the most up-to-date term" [22] and that "in any discussion of a gene, it is recommended that the approved gene symbol be mentioned at some point, preferably in the title and abstract if relevant." [22] Because copyeditors are not expected or allowed to rewrite the gene and protein nomenclature throughout a manuscript (except by rare express instructions on particular assignments), the middle ground in manuscripts using synonyms or older symbols is that the copyeditor will add a mention of the current official symbol at least as a parenthetical gloss at the first mention of the gene or protein, and query for confirmation.
Some basic conventions, such as (1) that animal/human homolog (ortholog) pairs differ in letter case (title case and all caps, respectively) and (2) that the symbol is italicized when referring to the gene but nonitalic when referring to the protein, are often not followed by contributors to medical journals. Many journals have the copyeditors restyle the casing and formatting to the extent feasible, although in complex genetics discussions only subject-matter experts (SMEs) can effortlessly parse them all. One example that illustrates the potential for ambiguity among non-SMEs is that some official gene names have the word "protein" within them, so the phrase "brain protein I3 (BRI3)" (referring to the gene) and "brain protein I3 (BRI3)" (referring to the protein) are both valid. The AMA Manual gives another example: both "the TH gene" and "the TH gene" can validly be parsed as correct ("the gene for tyrosine hydroxylase"), because the first mentions the alias (description) and the latter mentions the symbol. This seems confusing on the surface, although it is easier to understand when explained as follows: in this gene's case, as in many others, the alias (description) "happens to use the same letter string" that the symbol uses. (The matching of the letters is of course acronymic in origin and thus the phrase "happens to" implies more coincidence than is actually present; but phrasing it that way helps to make the explanation clearer.) There is no way for a non-SME to know this is the case for any particular letter string without looking up every gene from the manuscript in a database such as NCBI Gene, reviewing its symbol, name, and alias list, and doing some mental cross-referencing and double-checking (plus it helps to have biochemical knowledge). Most medical journals do not (in some cases cannot) pay for that level of fact-checking as part of their copyediting service level; therefore, it remains the author's responsibility. However, as pointed out earlier, many authors make little attempt to follow the letter case or italic guidelines; and regarding protein symbols, they often will not use the official symbol at all. For example, although the guidelines would call p53 protein "TP53" in humans or "Trp53" in mice, most authors call it "p53" in both (and even refuse to call it "TP53" if edits or queries try to), not least because of the biologic principle that many proteins are essentially or exactly the same molecules regardless of mammalian species. Regarding the gene, authors are usually willing to call it by its human-specific symbol and capitalization, TP53, and may even do so without being prompted by a query. But the end result of all these factors is that the published literature often does not follow the nomenclature guidelines completely.
An abbreviation is a shortened form of a word or phrase, by any method including shortening, contraction, initialism or crasis.
In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication channel or storage in a storage medium. An early example is an invention of language, which enabled a person, through speech, to communicate what they thought, saw, heard, or felt to others. But speech limits the range of communication to the distance a voice can carry and limits the audience to those present when the speech is uttered. The invention of writing, which converted spoken language into visual symbols, extended the range of communication across space and time.
Camel case is the practice of writing phrases without spaces or punctuation and with capitalized words. The format indicates the first word starting with either case, then the following words having an initial uppercase letter. Common examples include YouTube, PowerPoint, HarperCollins, FedEx, iPhone, eBay, and LaGuardia. Camel case is often used as a naming convention in computer programming. It is also sometimes used in online usernames such as JohnSmith, and to make multi-word domain names more legible, for example in promoting EasyWidgetCompany.com. In fact, WikiWikiWeb, an ancestor of Wikipedia, is written in camel case.
An identifier is a name that identifies either a unique object or a unique class of objects, where the "object" or class may be an idea, person, physical countable object, or physical noncountable substance. The abbreviation ID often refers to identity, identification, or an identifier. An identifier may be a word, number, letter, symbol, or any combination of those.
Virus classification is the process of naming viruses and placing them into a taxonomic system similar to the classification systems used for cellular organisms.
Nu is the thirteenth letter of the Greek alphabet, representing the voiced alveolar nasal IPA:[n]. In the system of Greek numerals it has a value of 50. It is derived from the Phoenician nun . Its Latin equivalent is N, though the lowercase resembles the Roman lowercase v.
In typography, italic type is a cursive font based on a stylised form of calligraphic handwriting. Along with blackletter and roman type, it served as one of the major typefaces in the history of Western typography.
A nucleic acid sequence is a succession of bases within the nucleotides forming alleles within a DNA or RNA (GACU) molecule. This succession is denoted by a series of a set of five different letters that indicate the order of the nucleotides. By convention, sequences are usually presented from the 5' end to the 3' end. For DNA, with its double helix, there are two possible directions for the notated sequence; of these two, the sense strand is used. Because nucleic acids are normally linear (unbranched) polymers, specifying the sequence is equivalent to defining the covalent structure of the entire molecule. For this reason, the nucleic acid sequence is also termed the primary structure.
Sigma is the eighteenth letter of the Greek alphabet. In the system of Greek numerals, it has a value of 200. In general mathematics, uppercase Σ is used as an operator for summation. When used at the end of a letter-case word, the final form (ς) is used. In Ὀδυσσεύς (Odysseus), for example, the two lowercase sigmas (σ) in the center of the name are distinct from the word-final sigma (ς) at the end. The Latin letter S derives from sigma while the Cyrillic letter Es derives from a lunate form of this letter.
In genetics and bioinformatics, a single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the genome. Although certain definitions require the substitution to be present in a sufficiently large fraction of the population, many publications do not apply such a frequency threshold.
A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions. One such family are the genes for human hemoglobin subunits; the ten genes are in two clusters on different chromosomes, called the α-globin and β-globin loci. These two gene clusters are thought to have arisen as a result of a precursor gene being duplicated approximately 500 million years ago.
Letter case is the distinction between the letters that are in larger uppercase or capitals and smaller lowercase in the written representation of certain languages. The writing systems that distinguish between the upper- and lowercase have two parallel sets of letters: each in the majuscule set has a counterpart in the minuscule set. Some counterpart letters have the same shape, and differ only in size, but for others the shapes are different. The two case variants are alternative representations of the same letter: they have the same name and pronunciation and are typically treated identically when sorting in alphabetical order.
Cytochromes P450 are a superfamily of enzymes containing heme as a cofactor that mostly, but not exclusively, function as monooxygenases. However, they are not omnipresent; for example, they have not been found in Escherichia coli. In mammals, these enzymes oxidize steroids, fatty acids, xenobiotics, and participate in many biosyntheses. By hydroxylation, CYP450 enzymes convert xenobiotics into hydrophilic derivatives, which are more readily excreted.
An acronym is a type of abbreviation consisting of a phrase whose only pronounced elements are the initial letters or initial sounds of words inside that phrase. Acronyms are often spelled with the initial letter of each word in all caps with no punctuation.
In computer programming, a naming convention is a set of rules for choosing the character sequence to be used for identifiers which denote variables, types, functions, and other entities in source code and documentation.
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA that is transcribed to produce a functional RNA. There are two types of molecular genes: protein-coding genes and non-coding genes.
The gene rpoS encodes the sigma factor sigma-38, a 37.8 kD protein in Escherichia coli. Sigma factors are proteins that regulate transcription in bacteria. Sigma factors can be activated in response to different environmental conditions. rpoS is transcribed in late exponential phase, and RpoS is the primary regulator of stationary phase genes. RpoS is a central regulator of the general stress response and operates in both a retroactive and a proactive manner: it not only allows the cell to survive environmental challenges, but it also prepares the cell for subsequent stresses (cross-protection). The transcriptional regulator CsgD is central to biofilm formation, controlling the expression of the curli structural and export proteins, and the diguanylate cyclase, adrA, which indirectly activates cellulose production. The rpoS gene most likely originated in the gammaproteobacteria.
Large ribosomal subunit protein eL19 is a protein that in humans is encoded by the RPL19 gene.
The HUGO Gene Nomenclature Committee (HGNC) is a committee of the Human Genome Organisation (HUGO) that sets the standards for human gene nomenclature. The HGNC approves a unique and meaningful name for every known human gene, based on a query of experts. In addition to the name, which is usually 1 to 10 words long, the HGNC also assigns a symbol to every gene. As with an SI symbol, a gene symbol is like an abbreviation but is more than that, being a second unique name that can stand on its own just as much as substitute for the longer name. It may not necessarily "stand for" the initials of the name, although many gene symbols do reflect that origin.
GeneCards is a database of human genes that provides genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. It is being developed and maintained by the Crown Human Genome Center at the Weizmann Institute of Science, in collaboration with LifeMap Sciences.
Bacteria: Gene symbols are typically composed of three lower-case, italicized letters that serve as an abbreviation of the process or pathway in which the gene product is involved (e.g., rpo genes encode RNA polymerase). To distinguish among different alleles, the abbreviation is followed by an upper-case letter (e.g., the rpoB gene encodes the β subunit of RNA polymerase). Protein symbols are not italicized, and the first letter is upper-case (e.g., RpoB).