Putative gene

Last updated

A putative gene is a algnment segment of DNA that is believed to be a gene. Putative genes can share sequence similarities to already characterized genes and thus can be inferred to share a similar function, yet the exact function of putative genes remains unknown. [1] Newly identified sequences are considered putative gene candidates when homologs of those sequences are found to be associated with the phenotype of interest. [2]

Contents

Examples

Examples of studies involving putative genes include the discovery of 30 putative receptor genes found in rat vomeronasal organ (VNO) [3] and the identification of 79 putative TATA boxes found in many plant genomes. [4]

Practical importance

In order to define and characterize a biosynthetic gene cluster, all the putative genes within said cluster must first be identified and their functions must be characterized. This can be performed by complementation and knock out experiments. In the process of characterizing putative genes, the genome under study becomes increasingly well understood as more interactions can be identified. [5] Identification of putative genes is necessary to study genomic evolution, as significant proportion of genomes make up larger families of related genes. Genomic evolution occurs by processes such as duplication of individual genes, genome segments, or entire genomes. These processes can result in loss of function, altered function, or gain of function, and have drastic affects on the phenotype. [6] [7]

DNA mutations outside of a putative gene can act by positional effect, in which they alter the gene expression. These alterations leave the transcription unit and promoter of the gene intact, but may involve distal promoters, enhancer/silencer elements, or the local chromatin environment. These mutations can be associated with diseases or disorders associated with the gene.

Identification

Putative genes can be identified by clustering large groups of sequences by patterns and arranging by mutual similarity [8] or can be inferred by potential TATA boxes. [9]

Putative genes can also be identified by recognizing differences between well-known gene clusters and gene clusters with a unique profiling. [10]

Software tools have been developed in order to automatically identify putative genes. This is done by searching for gene families and testing the validity of uncharacterized genes by comparison to already identified genes. [11]

Protein products can be identified and used to characterize the putative gene that codes for it. [12]

See also

Related Research Articles

<span class="mw-page-title-main">Bioinformatics</span> Computational analysis of large, complex sets of biological data

Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.

<span class="mw-page-title-main">Human genome</span> Complete set of nucleic acid sequences for humans

The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.

In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.

<span class="mw-page-title-main">Comparative genomics</span> Field of biological research

Comparative genomics is a branch of biological research that examines genome sequences across a spectrum of species, spanning from humans and mice to a diverse array of organisms from bacteria to chimpanzees. This large-scale holistic approach compares two or more genomes to discover the similarities and differences between the genomes and to study the biology of the individual genomes. Comparison of whole genome sequences provides a highly detailed view of how organisms are related to each other at the gene level. By comparing whole genome sequences, researchers gain insights into genetic relationships between organisms and study evolutionary changes. The major principle of comparative genomics is that common features of two organisms will often be encoded within the DNA that is evolutionarily conserved between them. Therefore, Comparative genomics provides a powerful tool for studying evolutionary changes among organisms, helping to identify genes that are conserved or common among species, as well as genes that give unique characteristics of each organism. Moreover, these studies can be performed at different levels of the genomes to obtain multiple perspectives about the organisms.

<span class="mw-page-title-main">Metagenomics</span> Study of genes found in the environment

Metagenomics is the study of genetic material recovered directly from environmental or clinical samples by a method called sequencing. The broad field may also be referred to as environmental genomics, ecogenomics, community genomics or microbiomics.

<span class="mw-page-title-main">Sequence homology</span> Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

Phylogenomics is the intersection of the fields of evolution and genomics. The term has been used in multiple ways to refer to analysis that involves genome data and evolutionary reconstructions. It is a group of techniques within the larger fields of phylogenetics and genomics. Phylogenomics draws information by comparing entire genomes, or at least large portions of genomes. Phylogenetics compares and analyzes the sequences of single genes, or a small number of genes, as well as many other types of data. Four major areas fall under phylogenomics:

<span class="mw-page-title-main">Interleukin 10 receptor, beta subunit</span> Protein-coding gene in the species Homo sapiens

Interleukin 10 receptor, beta subunit is a subunit for the interleukin-10 receptor. IL10RB is its human gene.

<span class="mw-page-title-main">CACNG3</span> Protein-coding gene in the species Homo sapiens

Voltage-dependent calcium channel gamma-3 subunit is a protein that in humans is encoded by the CACNG3 gene.

The Viral Bioinformatics Resource Center (VBRC) is an online resource providing access to a database of curated viral genomes and a variety of tools for bioinformatic genome analysis. This resource was one of eight BRCs funded by NIAID with the goal of promoting research against emerging and re-emerging pathogens, particularly those seen as potential bioterrorism threats. The VBRC is now supported by Dr. Chris Upton at the University of Victoria.

Bacterial small RNAs are small RNAs produced by bacteria; they are 50- to 500-nucleotide non-coding RNA molecules, highly structured and containing several stem-loops. Numerous sRNAs have been identified using both computational analysis and laboratory-based techniques such as Northern blotting, microarrays and RNA-Seq in a number of bacterial species including Escherichia coli, the model pathogen Salmonella, the nitrogen-fixing alphaproteobacterium Sinorhizobium meliloti, marine cyanobacteria, Francisella tularensis, Streptococcus pyogenes, the pathogen Staphylococcus aureus, and the plant pathogen Xanthomonas oryzae pathovar oryzae. Bacterial sRNAs affect how genes are expressed within bacterial cells via interaction with mRNA or protein, and thus can affect a variety of bacterial functions like metabolism, virulence, environmental stress response, and structure.

In metagenomics, binning is the process of grouping reads or contigs and assigning them to individual genome. Binning methods can be based on either compositional features or alignment (similarity), or both.

Horizontal or lateral gene transfer is the transmission of portions of genomic DNA between organisms through a process decoupled from vertical inheritance. In the presence of HGT events, different fragments of the genome are the result of different evolutionary histories. This can therefore complicate investigations of the evolutionary relatedness of lineages and species. Also, as HGT can bring into genomes radically different genotypes from distant lineages, or even new genes bearing new functions, it is a major source of phenotypic innovation and a mechanism of niche adaptation. For example, of particular relevance to human health is the lateral transfer of antibiotic resistance and pathogenicity determinants, leading to the emergence of pathogenic lineages.

Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.

<span class="mw-page-title-main">FANTOM</span>

FANTOM is an international research consortium first established in 2000 as part of the RIKEN research institute in Japan. The original meeting gathered international scientists from diverse backgrounds to help annotate the function of mouse cDNA clones generated by the Hayashizaki group. Since the initial FANTOM1 effort, the consortium has released multiple projects that look to understand the mechanisms governing the regulation of mammalian genomes. Their work has generated a large collection of shared data and helped advance biochemical and bioinformatic methodologies in genomics research.

<span class="mw-page-title-main">Genome mining</span>

Genome mining describes the exploitation of genomic information for the discovery of biosynthetic pathways of natural products and their possible interactions. It depends on computational technology and bioinformatics tools. The mining process relies on a huge amount of data accessible in genomic databases. By applying data mining algorithms, the data can be used to generate new knowledge in several areas of medicinal chemistry, such as discovering novel natural products.

References

  1. Alexandre S, Guyaux M, Murphy NB, Coquelet H, Pays A, Steinert M, Pays E (June 1988). "Putative genes of a variant-specific antigen gene transcription unit in Trypanosoma brucei". Molecular and Cellular Biology. 8 (6): 2367–78. doi:10.1128/mcb.8.6.2367. PMC   363435 . PMID   3405209.
  2. Mishima K, Hirao T, Tsubomura M, Tamura M, Kurita M, Nose M, et al. (April 2018). "Identification of novel putative causative genes and genetic marker for male sterility in Japanese cedar (Cryptomeria japonica D.Don)". BMC Genomics. 19 (1): 277. doi: 10.1186/s12864-018-4581-5 . PMC   5914023 . PMID   29685102.
  3. Dulac C, Axel R (October 1995). "A novel family of genes encoding putative pheromone receptors in mammals". Cell. 83 (2): 195–206. doi: 10.1016/0092-8674(95)90161-2 . PMID   7585937. S2CID   18784638.
  4. Joshi CP (August 1987). "An inspection of the domain between putative TATA box and translation start site in 79 plant genes". Nucleic Acids Research. 15 (16): 6643–53. doi:10.1093/nar/15.16.6643. PMC   306128 . PMID   3628002.
  5. Wawrzyn GT, Bloch SE, Schmidt-Dannert C (2012-01-01). "Discovery and characterization of terpenoid biosynthetic pathways of fungi". In Hopwood (ed.). Natural Product Biosynthesis by Microorganisms and Plants, Part A. Methods in Enzymology. Vol. 515. pp. 83–105. doi:10.1016/b978-0-12-394290-6.00005-7. ISBN   9780123942906. PMID   22999171.
  6. Frank RL, Mane A, Ercal F (September 2006). "An automated method for rapid identification of putative gene family members in plants". BMC Bioinformatics. 7 (2): S19. doi: 10.1186/1471-2105-7-S2-S19 . PMC   1683565 . PMID   17118140.
  7. Emery AE (2013). "Personal Memories of David Rimoin". Emery and Rimoin's Principles and Practice of Medical Genetics. Elsevier. pp. i. doi:10.1016/b978-0-12-383834-6.11001-8. ISBN   978-0-12-383834-6.
  8. Aouf M, Liyanage L (2012-09-26). "Analysis of High Dimensionality Yeast Gene Expression Data Using Data Mining". Applied Mechanics and Materials. 197: 515–522. Bibcode:2012AMM...197..515A. doi:10.4028/www.scientific.net/amm.197.515. S2CID   109965976.
  9. Joshi CP (August 1987). "An inspection of the domain between putative TATA box and translation start site in 79 plant genes". Nucleic Acids Research. 15 (16): 6643–53. doi:10.1093/nar/15.16.6643. PMC   306128 . PMID   3628002.
  10. Mihali TK, Carmichael WW, Neilan BA (February 2011). "A putative gene cluster from a Lyngbya wollei bloom that encodes paralytic shellfish toxin biosynthesis". PLOS ONE. 6 (2): e14657. Bibcode:2011PLoSO...614657M. doi: 10.1371/journal.pone.0014657 . PMC   3037375 . PMID   21347365.
  11. Frank RL, Mane A, Ercal F (September 2006). "An automated method for rapid identification of putative gene family members in plants". BMC Bioinformatics. 7 Suppl 2 (2): S19. doi: 10.1186/1471-2105-7-S2-S19 . PMC   1683565 . PMID   17118140.
  12. Denison M, Perlman S (April 1987). "Identification of putative polymerase gene product in cells infected with murine coronavirus A59". Virology. 157 (2): 565–8. doi:10.1016/0042-6822(87)90303-5. PMC   7131660 . PMID   3029990.