C6orf222 is a protein that in humans is encoded by the C6orf222 gene (6p21.31). [1] [2] C6orf222 is conserved in mammals, birds and reptiles with the most distant ortholog being the green sea turtle, Chelonia mydas. The C6orf222 protein contains one mammalian conserved domain: DUF3293. The protein is also predicted to contain a BH3 domain, which has predicted conservation in distant orthologs from the clade Aves. [3] [4]
C6orf222 is located on the negative DNA strand of the short arm of chromosome 6 at locus 21.31. The gene is 21,129 base pairs long and extends from base pair 36315757 to 36336977. The gene produces a single transcript that is 3,750 base pairs long. [5] There is a unique upstream, in-frame stop codon in the 5’UTR region from base pairs 86-88 of the mRNA. [6]
The gene PNPLA1 is found on the positive strand just upstream of C6orf222 and belongs to the patatin-like phospholipase family. PNPLA1 extends from base pairs 36239766 to 36313955. [7] ETV7 is found downstream on the negative strand and extends from base pairs 36354221 to 36387800. [8] There are two genes located upstream (LOC105375036) and downstream (LOC105375037) of C6orf222 on the negative and positive strand, respectively. These genes code for Non-coding RNA. [9] [10]
There are several lines of evidence that suggest the C6orf222 protein is expressed non-ubiquitously in select tissue locations at low levels, including the ascites, bladder, intestine, testes, vagina, lymph nodes and skin. The ascites and bladder displayed the strongest expression with 25 and 66 transcripts per million, respectively. [11] [12]
The promoter region for C6orf222 is predicted to exist between 36304561 and 36305162 and has a length of 602 base pairs. [13]
C6orf222 consists of 652 amino acids, has a weight of 71.9 kDa and an isoelectric point of 8.70 in humans. [1] [14] [15]
While the exact function of C6orf222 remains unknown, there has been evidence for it being involved in apoptosis as a pro-apoptotic protein due to the conserved BH3 interacting domain death agonist located at the C-terminus of the protein between amino acid residues 535 to 611 in humans. [3] [4]
C6orf222 contains a single domain of unknown function, DUF3292, found between amino acid residues 38-110. The protein is proline rich but asparagine and tyrosine poor. The charge distribution of the protein is fairly equal in that there are no positive or negative charge clusters sequestered within the protein. [14]
The secondary structures of the protein were predicted via multiple bioinformatics servers. All four programs predicted extensive disorder in the protein at the N-terminus, ranging from 79% to 88% of the protein. This extent of disorder is consistent with the amino acid composition of the protein and being proline rich. There were several α-helix structures predicted at the C-terminus of the protein and conserved in orthologs of the human protein. [14] [17] [18] [19] The protein was predominantly soluble, with a majority of the amino acid residues being exposed.15 The average of hydrophobicity for c6orf222 was -0.968, indicating the soluble nature of the protein. [20] [21]
There is extensive, predicted phosphorylation of C6orf222, with 42 phosphoserines and 7 phosphothreonines being conserved in orthologs of the human C6orf222 protein. These results implicate C6orf222 as being a phosphoprotein. The protein contains only one nuclear export signal residue, found at 275-L; however, the NES score was rather low at 0.672. Structural analysis of the protein to be sequestered in the nucleus with a 95% probability. [22] This prediction is supported by the presence of a nuclear localization sequence (NLS) found between residues 142–151. [21] [23]
Although the bioinformatics programs MINT, STRING and Gene Cards did not reveal any protein interactions with C6orf222. [24] [25] [26] A predicted BH3 domain in the C6orf222 protein was found to interact with both Bcl-2 and Bcl-xL, implicating C6orf222 as being involved in apoptosis. [4]
Table of Potential Transcription Factor Binding Sites in the Predicted C6orf222 Promoter: [13]
Transcription Factor | Start | End | Strand | Matrix Score | Sequence |
---|---|---|---|---|---|
Myeloid zinc finger protein MZF1 | 115 | 125 | - | 1.000 | gtGGGGaggga |
Heat shock factor 2 | 91 | 115 | - | 0.953 | aggtacctagaaAGAAaggtcagcg |
NF-κβ | 106 | 120 | - | 0.894 | gaGGGAggtacctag |
GATA-binding factor 2 | 133 | 145 | + | 0.958 | caaaGATAactga |
Pleomorphic adenoma gene 1 | 168 | 190 | - | 1.000 | taGGGGgagaaagtcgaggtggc |
TF-yin yang 2 | 196 | 218 | + | 0.839 | catttcCCATtaagtcttgtttt |
RBPJ-κ | 196 | 208 | - | 0.951 | ttaaTGGGaaatg |
Human acute myelogenous leukemia factors | 281 | 295 | - | 0.834 | actGGGGtttgggt |
Smad3 TF involved in TGF-β signaling | 244 | 254 | - | 0.995 | aagGTCTggct |
82 organisms have predicted orthologs with C6orf222. [2] The most distant ortholog is the green sea turtle, which diverged from humans 296 million years ago, indicating C6orf222 developed in reptiles and birds. [3] [27]
Table of C6orf222 Orthologs: [3]
Scientific Name | Common Name | Divergence Date from Humans (MYA) [28] | NCBI Protein Accession | Protein Length (amino acids) | Sequence Similarity (%) |
---|---|---|---|---|---|
Homo sapiens | Human | 0 | NP_001010903.3 | 652 | 100 |
Gorilla gorilla | Gorilla | 8.8 | XP_004043942.1 | 652 | 97 |
Callithrix jacchus | Common marmoset | 42.6 | XP_002746521.2 | 678 | 81 |
Camelus ferus | Bactrian camel | 94.2 | XP_006195372 | 658 | 62 |
Physeter catodon | Sperm whale | 94.2 | XP_007115001.1 | 638 | 60 |
Orcinus orca | Orca | 94.2 | XP_004267852.1 | 639 | 60 |
Canis lupus familiaris | Dog | 94.2 | XP_850456 | 662 | 58 |
Mustela putorius furo | Ferret | 94.2 | XP_004770808 | 664 | 57 |
Oryctolagus cuniculus | Rabbit | 92.3 | XP_008261475 | 716 | 49 |
Mus musculus | Mouse | 92.3 | NP_766038.1 | 669 | 47 |
Haliaeetus leucocephalus | Bald eagle | 296 | XP_010583084.0 | 407 | 28 |
Fulmarus glacialis | Northern fulmar | 296 | XP_009570924.1 | 435 | 28 |
Chelonia mydas | Green sea turtle | 296 | XP_007063595.1 | 387 | 33 |
There are no predicted paralogs for C6orf222 in both humans and mice. [3]
Multiple sequence alignments demonstrated amino acid reside conservation throughout the C6orf222 protein in a variety of orthologs, with the most extensive conservation being found at the C-terminus of the protein. The BH3 interacting-domain death agonist (BID) was found to be conserved at the C-terminus in a multiple sequence alignment in both strict and distant orthologs. [27] The equal distribution of positively and negatively charged amino acid residues is found to be conserved throughout orthologs of the human C6orf222 protein. [22]
C6orf222 is a fast evolving gene, similar in nature to Fibrinogen alpha chain FGA. This evolutionary trend is in contrast to Cytochrome c, which has been shown to be a slow evolving gene. [28]
MAP11 is a protein that in human is encoded by the gene MAP11. It was previously referred to by the generic name C7orf43. C7orf43 has no other human alias, but in mice can be found as BC037034.
Proline-rich 12 (PRR12) is a protein of unknown function encoded by the gene PRR12.
Family with Sequence Similarity 203, Member B (FAM203B) is a protein encoded by the FAM203B gene (8q24.3) in humans. While FAM203B is only found in humans and possibly non-human primates, its paralog, FAM203A, is highly conserved. The FAM203B protein contains two conserved domains of unknown function, DUF383 and DUF384, and no transmembrane domains. This protein has no known function yet, although the homolog of FAM203A in Caenorhabditis elegans (Y54H5A.2) is thought to help regulate the actin cytoskeleton.
Protein FAM214A, also known as protein family with sequence similarity 214, A (FAM214A) is a protein that, in humans, is encoded by the FAM214A gene. FAM214A is a gene with unknown function found at the q21.2-q21.3 locus on Chromosome 15 (human). The protein product of this gene has two conserved domains, one of unknown function (DUF4210) and another one called Chromosome_Seg. Although the function of the FAM214A protein is uncharacterized, both DUF4210 and Chromosome_Seg have been predicted to play a role in chromosome segregation during meiosis.
Coiled-coil domain containing 94 (CCDC94) is a protein that in humans is encoded by the CCDC94 gene. The CCDC94 protein contains a coiled-coil domain, a domain of unknown function (DUF572), an uncharacterized conserved protein (COG5134), and lacks a transmembrane domain.
Transmembrane protein 33 is a protein that in humans, is encoded by the TMEM33 gene, also known as SHINC3. Another name for the TMEM33 protein is DB83.
C5orf34 is a protein that in humans is encoded by the C5orf34 gene (5p12).
CXorf49 is a protein, which in humans is encoded by the gene chromosome X open reading frame 49(CXorf49).
Chromosome 16 open reading frame 95 (C16orf95) is a gene which in humans encodes the protein C16orf95. It has orthologs in mammals, and is expressed at a low level in many tissues. C16orf95 evolves quickly compared to other proteins.
Uncharacterized protein C12orf60 is a protein that in humans is encoded by the C12orf60 gene. The gene is also known as LOC144608 or MGC47869. The protein lacks transmembrane domains and helices, but it is rich in alpha-helices. It is predicted to localize in the nucleus.
Chromosome 19 open reading frame 44 is a protein that in humans is encoded by the C19orf44 gene. C19orf44 is an uncharacterized protein with an unknown function in humans. C19orf44 is non-limiting implying that the protein exists in other species besides human. The protein contains one domain of unknown function (DUF) that is highly conserved throughout its orthologs. This protein is most highly expressed in the testis and ovary, but also has significant expression in the thyroid and parathyroid. Other names for this protein include: LOC84167.
C2orf16 is a protein that in humans is encoded by the C2orf16 gene. Isoform 2 of this protein is 1,984 amino acids long. The gene contains 1 exon and is located at 2p23.3. Aliases for C2orf16 include Open Reading Frame 16 on Chromosome 2 and P-S-E-R-S-H-H-S Repeats Containing Sequence.
C11orf42 is an uncharacterized protein in Homo sapiens that is encoded by the C11orf42 gene. It is also known as chromosome 11 open reading frame 42 and uncharacterized protein C11orf42, with no other aliases. The gene is mostly conserved in mammals, but it has also been found in rodents, reptiles, fish and worms.
Chromosome 9 open reading frame 50 is a protein that in humans is encoded by the C9orf50 gene. C9orf50 has one other known alias, FLJ35803. In humans the gene coding sequence is 10,051 base pairs long, transcribing an mRNA of 1,624 bases that encodes a 431 amino acid protein.
LOC101928193 is a protein which in humans is encoded by the LOC101928193 gene. There are no known aliases for this gene or protein. Similar copies of this gene, called orthologs, are known to exist in several different species across mammals, amphibians, fish, mollusks, cnidarians, fungi, and bacteria. The human LOC101928193 gene is located on the long (q) arm of chromosome 9 with a cytogenic location at 9q34.2. The molecular location of the gene is from base pair 133,189,767 to base pair 133,192,979 on chromosome 9 for an mRNA length of 3213 nucleotides. The gene and protein are not yet well understood by the scientific community, but there is data on its genetic makeup and expression. The LOC101928193 protein is targeted for the cytoplasm and has the highest level of expression in the thyroid, ovary, skin, and testes in humans.
Chromosome 1 open reading frame (C1orf167) is a protein which in humans is encoded by the C1orf167 gene. The NCBI accession number is NP_001010881. The protein is 1468 amino acids in length with a molecular weight of 162.42 kDa. The mRNA sequence was found to be 4689 base pairs in length.
C16orf90 or chromosome 16 open reading frame 90 produces uncharacterized protein C16orf90 in homo sapiens. C16orf90's protein has four predicted alpha-helix domains and is mildly expressed in the testes and lowly expressed throughout the body. While the function of C16orf90 is not yet well understood by the scientific community, it has suspected involvement in the biological stress response and apoptosis based on expression data from microarrays and post-translational modification data.
FAM120AOS, or family with sequence similarity 120A opposite strand, codes for uncharacterized protein FAM120AOS, which currently has no known function. The gene ontology describes the gene to be protein binding. Overall, it appears that the thyroid and the placenta are the two tissues with the highest expression levels of FAM120AOS across a majority of datasets.
C2orf72 is a gene in humans that encodes a protein currently named after its gene, C2orf72. It is also designated LOC257407 and can be found under GenBank accession code NM_001144994.2. The protein can be found under UniProt accession code A6NCS6.
C4orf19 is a protein which in humans is encoded by the C4orf19 gene.