Infologs

Last updated

Infologs are independently designed synthetic genes derived from one or a few genes where substitutions are systematically incorporated to maximize information. Infologs are designed for perfect diversity distribution to maximize search efficiency.

Contents

Typical protein engineering methods rely on screening a high number (106-1012 or more) of gene variants to identify individuals with improved activity using a surrogate high throughput screen (HTP) to identify initial hits. Unfortunately, results are defined by what is screened for, thus the “hit” from the HTP screen often has very little real activity in a lower throughput assay more indicative of the improved functionality for which the protein is being developed. By adapting the standard algorithms for engineering complex systems to work with biological systems, the resulting process enables researchers to deconvolute how substitutions within a protein sequence modify its function. Combining these algorithms with an integrated query and ranking mechanism allows the identification of appropriate sequence substitutions. [1] Infologs refers to the set of designed genes, singular use Infolog describes an individual variant.

Infologs show full diversity distribution of space Infologs, show full diversity distribution of space.png
Infologs show full diversity distribution of space

Ancestry

Homology between protein or DNA sequences is defined in terms of shared ancestry. Two segments of DNA can have shared ancestry because of either a speciation event (orthologs) or a duplication event (paralogs).

Homologs are similar genes and/or proteins which are related by ancestry.

Orthologs are the 'same' gene, but from different organisms. Homologous sequences are orthologous if they were separated by a speciation event: when a species diverges into two separate species, the copies of a single gene in the two resulting species are said to be orthologous. Orthologs, or orthologous genes, are genes in different species that originated by vertical descent from a single gene of the last common ancestor. The term "ortholog" was coined in 1970 by Walter Fitch. [2]

Paralogs are related genes originating from one gene that through duplication ended up as two genes that over time has evolved for two separate functions (or, according to a recent Science paper, [3] a promiscuous starting gene that duplicated and each copy evolved towards different functions). Paralogs typically have the same or similar function, but sometimes do not: due to lack of the original selective pressure upon one copy of the duplicated gene, this copy is free to mutate and acquire new functions. Paralogs usually occur from within the same species.

Xenologs are homologs resulting from horizontal gene transfer between two organisms. Xenologs can have different functions, if the new environment is vastly different for the horizontally moving gene. In general, though, xenologs typically have similar function in both organisms.

Infologs are similar genes and/or proteins which are related by synthetic ancestry to approach perfect diversity distribution.

Features

Case study

Transforming Protein engineering with Infologs:

Using independently designed synthetic genes where substitutions are systematically incorporated (Infologs) leads to uniform sampling, systematic variance and unrestricted information rich results. Wheat Glutathione S-transferases (GST) with the ability to detoxify a panel of common herbicides was designed using this patented bioengineering method. The relative functional contribution of 60 amino acid substitutions against 14 herbicides was quantified using only 96 Infologs and dramatically improved by a small set (16) of 2nd generation Infologs. In addition, highly predictable GST sequence-function models against two commercially relevant herbicides were created with quantification of relative functional contribution of 60 amino acid substitutions in two dimensions. [4]

Homologs.png
Homologs: Naturally occurring genes that share common ancestry and traits. Diversity is biased and non-systematic.
Wheat GST Infologs.jpg
Infologs: Independently designed synthetic genes derived from one or a few genes where substitutions are systematically incorporated to maximize information. Infologs are designed for perfect diversity distribution to maximize search efficiency.
Comparison of Homologs and Infologs

Rational design of proteins

In rational protein design, the scientist uses detailed knowledge of the structure and function of the protein to make desired changes. This generally has the advantage of being technically easy and inexpensive, since site-directed mutagenesis techniques are well-developed. However, its major drawback is that detailed structural knowledge of a protein is often unavailable, and even when it is available, it can be extremely difficult to predict the effects of various mutations.

Computational protein design algorithms seek to identify novel amino acid sequences that are low in energy when folded to the pre-specified target structure. While the sequence-conformation space that needs to be searched is large, the most challenging requirement for computational protein design is a fast, yet accurate, energy function that can distinguish optimal sequences from similar suboptimal ones.

See also

Related Research Articles

<span class="mw-page-title-main">Protein family</span> Group of evolutionarily-related proteins

A protein family is a group of evolutionarily related proteins. In many cases, a protein family has a corresponding gene family, in which each gene encodes a corresponding protein with a 1:1 relationship. The term "protein family" should not be confused with family as it is used in taxonomy.

Gene duplication is a major mechanism through which new genetic material is generated during molecular evolution. It can be defined as any duplication of a region of DNA that contains a gene. Gene duplications can arise as products of several types of errors in DNA replication and repair machinery as well as through fortuitous capture by selfish genetic elements. Common sources of gene duplications include ectopic recombination, retrotransposition event, aneuploidy, polyploidy, and replication slippage.

<span class="mw-page-title-main">Gene family</span> Set of several similar genes

A gene family is a set of several similar genes, formed by duplication of a single original gene, and generally with similar biochemical functions. One such family are the genes for human hemoglobin subunits; the ten genes are in two clusters on different chromosomes, called the α-globin and β-globin loci. These two gene clusters are thought to have arisen as a result of a precursor gene being duplicated approximately 500 million years ago.

<span class="mw-page-title-main">Sequence homology</span> Shared ancestry between DNA, RNA or protein sequences

Sequence homology is the biological homology between DNA, RNA, or protein sequences, defined in terms of shared ancestry in the evolutionary history of life. Two segments of DNA can have shared ancestry because of three phenomena: either a speciation event (orthologs), or a duplication event (paralogs), or else a horizontal gene transfer event (xenologs).

<span class="mw-page-title-main">NBPF3</span> Protein-coding gene in the species Homo sapiens

Neuroblastoma breakpoint family, member 3, also known as NBPF3, is a human gene of the neuroblastoma breakpoint family, which resides on chromosome 1 of the human genome. NBPF3 is located at 1p36.12, immediately upstream of genes ALPL and RAP1GAP.

Inparanoid is an algorithm that finds orthologous genes and paralogous genes that arose—most likely by duplication—after some speciation event. Such protein-coding genes are called in-paralogs, as opposed to out-paralogs.

<span class="mw-page-title-main">HIKESHI</span> Protein-coding gene in the species Homo sapiens

HIKESHI is a protein important in lung and multicellular organismal development that, in humans, is encoded by the HIKESHI gene. HIKESHI is found on chromosome 11 in humans and chromosome 7 in mice. Similar sequences (orthologs) are found in most animal and fungal species. The mouse homolog, lethal gene on chromosome 7 Rinchik 6 protein is encoded by the l7Rn6 gene.

<span class="mw-page-title-main">Gene Designer</span>

Gene Designer is a computer software package for bioinformatics. It is used by molecular biologists from academia, government, and the pharmaceutical, chemical, agricultural, and biotechnology industries to design, clone, and validate genetic sequences. It is proprietary software, released as freeware needing registration.

Functional divergence is the process by which genes, after gene duplication, shift in function from an ancestral function. Functional divergence can result in either subfunctionalization, where a paralog specializes one of several ancestral functions, or neofunctionalization, where a totally new functional capability evolves. It is thought that this process of gene duplication and functional divergence is a major originator of molecular novelty and has produced the many large protein families that exist today.

<span class="mw-page-title-main">OrthoDB</span>

OrthoDB presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. Orthology refers to the last common ancestor of the species under consideration, and thus OrthoDB explicitly delineates orthologs at each major radiation along the species phylogeny. The database of orthologs presents available protein descriptors, together with Gene Ontology and InterPro attributes, which serve to provide general descriptive annotations of the orthologous groups, and facilitate comprehensive orthology database querying. OrthoDB also provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and gene intron-exon architectures.

The minimal genome is a concept which can be defined as the set of genes sufficient for life to exist and propagate under nutrient-rich and stress-free conditions. Alternatively, it can also be defined as the gene set supporting life on an axenic cell culture in rich media, and it is thought what makes up the minimal genome will depend on the environmental conditions that the organism inhabits. By one early investigation, the minimal genome of a bacterium should include a virtually complete set of proteins for replication and translation, a transcription apparatus including four subunits of RNA polymerase including the sigma factor rudimentary proteins sufficient for recombination and repair, several chaperone proteins, the capacity for anaerobic metabolism through glycolysis and substrate-level phosphorylation, transamination of glutamyl-tRNA to glutaminyl-tRNA, lipid biosynthesis, eight cofactor enzymes, protein export machinery, and a limited metabolite transport network including membrane ATPases. Proteins involved in the minimum bacterial genome tend to be substantially more related to proteins found in archaea and eukaryotes compared to the average gene in the bacterial genome more generally indicating a substantial number of universally conserved proteins. The minimal genomes reconstructed on the basis of existing genes does not preclude simpler systems in more primitive cells, such as an RNA world genome which does not have the need for DNA replication machinery, which is otherwise part of the minimal genome of current cells.

<span class="mw-page-title-main">FAM203B</span> Protein-coding gene in the species Homo sapiens

Family with Sequence Similarity 203, Member B (FAM203B) is a protein encoded by the FAM203B gene (8q24.3) in humans. While FAM203B is only found in humans and possibly non-human primates, its paralog, FAM203A, is highly conserved. The FAM203B protein contains two conserved domains of unknown function, DUF383 and DUF384, and no transmembrane domains. This protein has no known function yet, although the homolog of FAM203A in Caenorhabditis elegans (Y54H5A.2) is thought to help regulate the actin cytoskeleton.

In bioinformatics, the PANTHER classification system is a large curated biological database of gene/protein families and their functionally related subfamilies that can be used to classify and identify the function of gene products. PANTHER is part of the Gene Ontology Reference Genome Project designed to classify proteins and their genes for high-throughput analysis.

Essential genes are indispensable genes for organisms to grow and reproduce offspring under certain environment. However, being essential is highly dependent on the circumstances in which an organism lives. For instance, a gene required to digest starch is only essential if starch is the only source of energy. Recently, systematic attempts have been made to identify those genes that are absolutely required to maintain life, provided that all nutrients are available. Such experiments have led to the conclusion that the absolutely required number of genes for bacteria is on the order of about 250–300. Essential genes of single-celled organisms encode proteins for three basic functions including genetic information processing, cell envelopes and energy production. Those gene functions are used to maintain a central metabolism, replicate DNA, translate genes into proteins, maintain a basic cellular structure, and mediate transport processes into and out of the cell. Compared with single-celled organisms, multicellular organisms have more essential genes related to communication and development. Most of the essential genes in viruses are related to the processing and maintenance of genetic information. In contrast to most single-celled organisms, viruses lack many essential genes for metabolism, which forces them to hijack the host's metabolism. Most genes are not essential but convey selective advantages and increased fitness. Hence, the vast majority of genes are not essential and many can be deleted without consequences, at least under most circumstances.

ATUM is an American biotechnology company which provides tools for the life sciences, from design and synthesis of optimized DNA to protein production and GMP cell line development.

<span class="mw-page-title-main">NBPF1</span> Protein-coding gene in the species Homo sapiens

Neuroblastoma breakpoint family, member 1, or NBPF1, is a protein that is encoded by the gene NBPF1 in humans. This protein is member of the neuroblastoma breakpoint family of proteins, a group of proteins that are thought to be involved in the development of the nervous system.

LUX or Phytoclock1 (PCL1) is a gene that codes for LUX ARRHYTHMO, a protein necessary for circadian rhythms in Arabidopsis thaliana. LUX protein associates with Early Flowering 3 (ELF3) and Early Flowering 4 (ELF4) to form the Evening Complex (EC), a core component of the Arabidopsis repressilator model of the plant circadian clock. The LUX protein functions as a transcription factor that negatively regulates Pseudo-Response Regulator 9 (PRR9), a core gene of the Midday Complex, another component of the Arabidopsis repressilator model. LUX is also associated with circadian control of hypocotyl growth factor genes PHYTOCHROME INTERACTING FACTOR 4 (PIF4) and PHYTOCHROME INTERACTING FACTOR 5 (PIF5).

In molecular phylogenetics, relationships among individuals are determined using character traits, such as DNA, RNA or protein, which may be obtained using a variety of sequencing technologies. High-throughput next-generation sequencing has become a popular technique in transcriptomics, which represent a snapshot of gene expression. In eukaryotes, making phylogenetic inferences using RNA is complicated by alternative splicing, which produces multiple transcripts from a single gene. As such, a variety of approaches may be used to improve phylogenetic inference using transcriptomic data obtained from RNA-Seq and processed using computational phylogenetics.

Transmembrane protein 39B (TMEM39B) is a protein that in humans is encoded by the gene TMEM39B. TMEM39B is a multi-pass membrane protein with eight transmembrane domains. The protein localizes to the plasma membrane and vesicles. The precise function of TMEM39B is not yet well-understood by the scientific community, but differential expression is associated with survival of B cell lymphoma, and knockdown of TMEM39B is associated with decreased autophagy in cells infected with the Sindbis virus. Furthermore, the TMEM39B protein been found to interact with the SARS-CoV-2 ORF9C protein. TMEM39B is expressed at moderate levels in most tissues, with higher expression in the testis, placenta, white blood cells, adrenal gland, thymus, and fetal brain.

<span class="mw-page-title-main">MROH9</span> Mammalian gene

Maestro heat-like repeat-containing protein family member 9 (MROH9) is a protein which in humans is encoded by the MROH9 gene. The word ‘maestro’ itself is an acronym, standing for male-specific transcription in the developing reproductive organs (MRO). MRO genes belong to the MROH family, which includes MROH9.

References

  1. This technology is covered by United States issued patent US 8,005,620
  2. Fitch, Walter M. (1970). "Distinguishing Homologous from Analogous Proteins". Systematic Biology. 19 (2): 99–113. doi:10.2307/2412448. JSTOR   2412448. PMID   5449325.
  3. Nasvall, J.; Sun, L.; Roth, J. R.; Andersson, D. I. (2012). "Real-Time Evolution of New Genes by Innovation, Amplification, and Divergence". Science. 338 (6105): 384–7. Bibcode:2012Sci...338..384N. doi:10.1126/science.1226521. PMC   4392837 . PMID   23087246.
  4. Enzyme Engineering Conference Presentation: "Using Infologs to Engineer Biological Systems"

Further reading