A domain of unknown function (DUF) is a protein domain that has no characterised function. These families have been collected together in the Pfam database using the prefix DUF followed by a number, with examples being DUF2992 and DUF1220. As of 2019, there are almost 4,000 DUF families within the Pfam database representing over 22% of known families. Some DUFs are not named using the nomenclature due to popular usage but are nevertheless DUFs. [1]
The DUF designation is tentative, and such families tend to be renamed to a more specific name (or merged to an existing domain) after a function is identified. [2] [3]
The DUF naming scheme was introduced by Chris Ponting, through the addition of DUF1 and DUF2 to the SMART database. [4] These two domains were found to be widely distributed in bacterial signaling proteins. Subsequently, the functions of these domains were identified and they have since been renamed as the GGDEF domain and EAL domain respectively. [2]
Structural genomics programmes have attempted to understand the function of DUFs through structure determination. The structures of over 250 DUF families have been solved. This (2009) work showed that about two thirds of DUF families had a structure similar to a previously solved one and therefore likely to be divergent members of existing protein superfamilies, whereas about one third possessed a novel protein fold. [5]
Some DUF families share remote sequence homology with domains that has characterized function. Computational work can be used to link these relationships. A 2015 work was able to assign 20% of the DUFs to characterized structural superfamilies. [6] Pfam also continuously perform the (manually-verified) assignment in "clan" superfamily entries. [1]
More than 20% of all protein domains were annotated as DUFs in 2013. About 2,700 DUFs are found in bacteria compared with just over 1,500 in eukaryotes. Over 800 DUFs are shared between bacteria and eukaryotes, and about 300 of these are also present in archaea. A total of 2,786 bacterial Pfam domains even occur in animals, including 320 DUFs. [7]
Many DUFs are highly conserved, indicating an important role in biology. However, many such DUFs are not essential, hence their biological role often remains unknown. For instance, DUF143 is present in most bacteria and eukaryotic genomes. [8] However, when it was deleted in Escherichia coli no obvious phenotype was detected. Later it was shown that the proteins that contain DUF143, are ribosomal silencing factors that block the assembly of the two ribosomal subunits. [8] While this function is not essential, it helps the cells to adapt to low nutrient conditions by shutting down protein biosynthesis. As a result, these proteins and the DUF only become relevant when the cells starve. [8] It is thus believed that many DUFs (or proteins of unknown function, PUFs) are only required under certain conditions.
Goodacre et al. identified 238 DUFs in 355 essential proteins (in 16 model bacterial species), most of which represent single-domain proteins, clearly establishing the biological essentiality of DUFs. These DUFs are called "essential DUFs" or eDUFs. [7]
DNA primase is an enzyme involved in the replication of DNA and is a type of RNA polymerase. Primase catalyzes the synthesis of a short RNA segment called a primer complementary to a ssDNA template. After this elongation, the RNA piece is removed by a 5' to 3' exonuclease and refilled with DNA.
A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.
The Structural Classification of Proteins (SCOP) database is a largely manual classification of protein structural domains based on similarities of their structures and amino acid sequences. A motivation for this classification is to determine the evolutionary relationship between proteins. Proteins with the same shapes but having little sequence or functional similarity are placed in different superfamilies, and are assumed to have only a very distant common ancestor. Proteins having the same shape and some similarity of sequence and/or function are placed in "families", and are assumed to have a closer common ancestor.
Pfam is a database of protein families that includes their annotations and multiple sequence alignments generated using hidden Markov models. The most recent version, Pfam 36.0, was released in September 2023 and contains 20,795 families.
InterPro is a database of protein families, protein domains and functional sites in which identifiable features found in known proteins can be applied to new protein sequences in order to functionally characterise them.
A ribosomal protein is any of the proteins that, in conjunction with rRNA, make up the ribosomal subunits involved in the cellular process of translation. E. coli, other bacteria and Archaea have a 30S small subunit and a 50S large subunit, whereas humans and yeasts have a 40S small subunit and a 60S large subunit. Equivalent subunits are frequently numbered differently between bacteria, Archaea, yeasts and humans.
In molecular biology, a protein domain is a region of a protein's polypeptide chain that is self-stabilizing and that folds independently from the rest. Each domain forms a compact folded three-dimensional structure. Many proteins consist of several domains, and a domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. In general, domains vary in length from between about 50 amino acids up to 250 amino acids in length. The shortest domains, such as zinc fingers, are stabilized by metal ions or disulfide bridges. Domains often form functional units, such as the calcium-binding EF hand domain of calmodulin. Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins.
In the field of molecular biology, a two-component regulatory system serves as a basic stimulus-response coupling mechanism to allow organisms to sense and respond to changes in many different environmental conditions. Two-component systems typically consist of a membrane-bound histidine kinase that senses a specific environmental stimulus and a corresponding response regulator that mediates the cellular response, mostly through differential expression of target genes. Although two-component signaling systems are found in all domains of life, they are most common by far in bacteria, particularly in Gram-negative and cyanobacteria; both histidine kinases and response regulators are among the largest gene families in bacteria. They are much less common in archaea and eukaryotes; although they do appear in yeasts, filamentous fungi, and slime molds, and are common in plants, two-component systems have been described as "conspicuously absent" from animals.
MALSU1 is a gene on chromosome 7 in humans that encodes the protein MALSU1. This protein localizes to mitochondria and is probably involved in mitochondrial translation or the biogenesis of the large subunit of the mitochondrial ribosome.
SUPERFAMILY is a database and search platform of structural and functional annotation for all proteins and genomes. It classifies amino acid sequences into known structural domains, especially into SCOP superfamilies. Domains are functional, structural, and evolutionary units that form proteins. Domains of common Ancestry are grouped into superfamilies. The domains and domain superfamilies are defined and described in SCOP. Superfamilies are groups of proteins which have structural evidence to support a common evolutionary ancestor but may not have detectable sequence homology.
Richard Michael Durbin is a British computational biologist and Al-Kindi Professor of Genetics at the University of Cambridge. He also serves as an associate faculty member at the Wellcome Sanger Institute where he was previously a senior group leader.
Protein function prediction methods are techniques that bioinformatics researchers use to assign biological or biochemical roles to proteins. These proteins are usually ones that are poorly studied or predicted based on genomic sequence data. These predictions are often driven by data-intensive computational procedures. Information may come from nucleic acid sequence homology, gene expression profiles, protein domain structures, text mining of publications, phylogenetic profiles, phenotypic profiles, and protein-protein interaction. Protein function is a broad term: the roles of proteins range from catalysis of biochemical reactions to transport to signal transduction, and a single protein may play a role in multiple processes or cellular pathways.
OMPdb is a dedicated database that contains beta barrel (β-barrel) outer membrane proteins from Gram-negative bacteria. Such proteins are responsible for a broad range of important functions, like passive nutrient uptake, active transport of large molecules, protein secretion, as well as adhesion to host cells, through which bacteria expose their virulence activity.
The Protein Common Interface Database (ProtCID) is a database of similar protein-protein interfaces in crystal structures of homologous proteins.
In molecular biology, translation initiation factor IF-3 is one of the three factors required for the initiation of protein biosynthesis in bacteria. IF-3 is thought to function as a fidelity factor during the assembly of the ternary initiation complex which consists of the 30S ribosomal subunit, the initiator tRNA and the messenger RNA. IF-3 is a basic protein that binds to the 30S ribosomal subunit. The chloroplast homolog enhances the poly(A,U,G)-dependent binding of the initiator tRNA to its ribosomal 30s subunits. IF1–IF3 may also perform ribosome recycling.
EamA is a protein domain found in a wide range of proteins including the Erwinia chrysanthemi PecM protein, which is involved in pectinase, cellulase and blue pigment regulation, the Salmonella typhimurium PagO protein, and some members of the solute carrier family group 35 (SLC35) nucleoside-sugar transporters. Many members of this family have no known function and are predicted to be integral membrane proteins and many of the proteins contain two copies of the domain.
The Methanosarcinales S-layer Tile Protein (MSTP) is a protein family found almost exclusively in Methanomicrobia members of the order Methanosarcinales. Typically a tandem repeat of two DUF1608 domains are contained in a single MSTP protein chain and these proteins self-assemble into the protective proteinaceous surface layer (S-layer) structure that encompasses the cell. The S-layer, which is found in most Archaea, and in many bacteria, serves many crucial functions including protection from deleterious extracellular substances.
Alexander George Bateman is a computational biologist and Head of Protein Sequence Resources at the European Bioinformatics Institute (EBI), part of the European Molecular Biology Laboratory (EMBL) in Cambridge, UK. He has led the development of the Pfam biological database and introduced the Rfam database of RNA families. He has also been involved in the use of Wikipedia for community-based annotation of biological databases.
The K+Transporter (Trk) Family is a member of the voltage-gated ion channel (VIC) superfamily. The proteins of the Trk family are derived from Gram-negative and Gram-positive bacteria, yeast and plants.
Major facilitator superfamily domain containing 3 (MFSD3) is a protein belonging to the MFS Pfam clan. It is an Atypical solute carrier located to the neuronal plasma membrane.