Functional genomics is a field of molecular biology that attempts to describe gene (and protein) functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects (such as genome sequencing projects and RNA sequencing). Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.
In order to understand functional genomics it is important to first define function. In their paper [1] Graur et al. define function in two possible ways. These are "selected effect" and "causal role". The "selected effect" function refers to the function for which a trait (DNA, RNA, protein etc.) is selected for. The "causal role" function refers to the function that a trait is sufficient and necessary for. Functional genomics usually tests the "causal role" definition of function.
The goal of functional genomics is to understand the function of genes or proteins, eventually all components of a genome. The term functional genomics is often used to refer to the many technical approaches to study an organism's genes and proteins, including the "biochemical, cellular, and/or physiological properties of each and every gene product" [2] while some authors include the study of nongenic elements in their definition. [3] Functional genomics may also include studies of natural genetic variation over time (such as an organism's development) or space (such as its body regions), as well as functional disruptions such as mutations.
The promise of functional genomics is to generate and synthesize genomic and proteomic knowledge into an understanding of the dynamic properties of an organism. This could potentially provide a more complete picture of how the genome specifies function compared to studies of single genes. Integration of functional genomics data is often a part of systems biology approaches.
Functional genomics includes function-related aspects of the genome itself such as mutation and polymorphism (such as single nucleotide polymorphism (SNP) analysis), as well as the measurement of molecular activities. The latter comprise a number of "-omics" such as transcriptomics (gene expression), proteomics (protein production), and metabolomics. Functional genomics uses mostly multiplex techniques to measure the abundance of many or all gene products such as mRNAs or proteins within a biological sample. A more focused functional genomics approach might test the function of all variants of one gene and quantify the effects of mutants by using sequencing as a readout of activity. Together these measurement modalities endeavor to quantitate the various biological processes and improve our understanding of gene and protein functions and interactions.
Systematic pairwise deletion of genes or inhibition of gene expression can be used to identify genes with related function, even if they do not interact physically. Epistasis refers to the fact that effects for two different gene knockouts may not be additive; that is, the phenotype that results when two genes are inhibited may be different from the sum of the effects of single knockouts.
Proteins formed by the translation of the mRNA (messenger RNA, a coded information from DNA for protein synthesis) play a major role in regulating gene expression. To understand how they regulate gene expression it is necessary to identify DNA sequences that they interact with. Techniques have been developed to identify sites of DNA-protein interactions. These include ChIP-sequencing, CUT&RUN sequencing and Calling Cards. [4]
Assays have been developed to identify regions of the genome that are accessible. These regions of accessible chromatin are candidate regulatory regions. These assays include ATAC-seq, DNase-Seq and FAIRE-Seq.
Microarrays measure the amount of mRNA in a sample that corresponds to a given gene or probe DNA sequence. Probe sequences are immobilized on a solid surface and allowed to hybridize with fluorescently labeled "target" mRNA. The intensity of fluorescence of a spot is proportional to the amount of target sequence that has hybridized to that spot and therefore to the abundance of that mRNA sequence in the sample. Microarrays allow for the identification of candidate genes involved in a given process based on variation between transcript levels for different conditions and shared expression patterns with genes of known function.
Serial analysis of gene expression (SAGE) is an alternate method of analysis based on RNA sequencing rather than hybridization. SAGE relies on the sequencing of 10–17 base pair tags which are unique to each gene. These tags are produced from poly-A mRNA and ligated end-to-end before sequencing. SAGE gives an unbiased measurement of the number of transcripts per cell, since it does not depend on prior knowledge of what transcripts to study (as microarrays do).
RNA sequencing has taken over microarray and SAGE technology in recent years, as noted in 2016, and has become the most efficient way to study transcription and gene expression. This is typically done by next-generation sequencing. [5]
A subset of sequenced RNAs are small RNAs, a class of non-coding RNA molecules that are key regulators of transcriptional and post-transcriptional gene silencing, or RNA silencing. Next-generation sequencing is the gold standard tool for non-coding RNA discovery, profiling and expression analysis.
Massively parallel reporter assays is a technology to test the cis-regulatory activity of DNA sequences. [6] [7] MPRAs use a plasmid with a synthetic cis-regulatory element upstream of a promoter driving a synthetic gene such as Green Fluorescent Protein. A library of cis-regulatory elements is usually tested using MPRAs, a library can contain from hundreds to thousands of cis-regulatory elements. The cis-regulatory activity of the elements is assayed by using the downstream reporter activity. The activity of all the library members is assayed in parallel using barcodes for each cis-regulatory element. One limitation of MPRAs is that the activity is assayed on a plasmid and may not capture all aspects of gene regulation observed in the genome.
STARR-seq is a technique similar to MPRAs to assay enhancer activity of randomly sheared genomic fragments. In the original publication, [8] randomly sheared fragments of the Drosophila genome were placed downstream of a minimal promoter. Candidate enhancers amongst the randomly sheared fragments will transcribe themselves using the minimal promoter. By using sequencing as a readout and controlling for input amounts of each sequence the strength of putative enhancers are assayed by this method.
Perturb-seq couples CRISPR mediated gene knockdowns with single-cell gene expression. Linear models are used to calculate the effect of the knockdown of a single gene on the expression of multiple genes.
A yeast two-hybrid screening (Y2H) tests a "bait" protein against many potential interacting proteins ("prey") to identify physical protein–protein interactions. This system is based on a transcription factor, originally GAL4, [9] whose separate DNA-binding and transcription activation domains are both required in order for the protein to cause transcription of a reporter gene. In a Y2H screen, the "bait" protein is fused to the binding domain of GAL4, and a library of potential "prey" (interacting) proteins is recombinantly expressed in a vector with the activation domain. In vivo interaction of bait and prey proteins in a yeast cell brings the activation and binding domains of GAL4 close enough together to result in expression of a reporter gene. It is also possible to systematically test a library of bait proteins against a library of prey proteins to identify all possible interactions in a cell.
Mass spectrometry (MS) can identify proteins and their relative levels, hence it can be used to study protein expression. When used in combination with affinity purification, mass spectrometry (AP/MS) can be used to study protein complexes, that is, which proteins interact with one another in complexes and in which ratios. In order to purify protein complexes, usually a "bait" protein is tagged with a specific protein or peptide that can be used to pull out the complex from a complex mix. The purification is usually done using an antibody or a compound that binds to the fusion part. The proteins are then digested into short peptide fragments and mass spectrometry is used to identify the proteins based on the mass-to-charge ratios of those fragments.
In deep mutational scanning every possible amino acid change in a given protein is first synthesized [10] . The activity of each of these protein variants is assayed in parallel using barcodes for each variant. [11] By comparing the activity to the wild-type protein, the effect of each mutation is identified. While it is possible to assay every possible single amino-acid change due to combinatorics two or more concurrent mutations are hard to test. Deep mutational scanning experiments have also been used to infer protein structure and protein-protein interactions
An important functional feature of genes is the phenotype caused by mutations. Mutants can be produced by random mutations or by directed mutagenesis, including site-directed mutagenesis, deleting complete genes, or other techniques.
Gene function can be investigated by systematically "knocking out" genes one by one. This is done by either deletion or disruption of function (such as by insertional mutagenesis) and the resulting organisms are screened for phenotypes that provide clues to the function of the disrupted gene. Knock-outs have been produced for whole genomes, i.e. by deleting all genes in a genome. For essential genes, this is not possible, so other techniques are used, e.g. deleting a gene while expressing the gene from a plasmid, using an inducible promoter, so that the level of gene product can be changed at will (and thus a "functional" deletion achieved).
Site-directed mutagenesis is used to mutate specific bases (and thus amino acids). This is critical to investigate the function of specific amino acids in a protein, e.g. in the active site of an enzyme.
RNA interference (RNAi) methods can be used to transiently silence or knockdown gene expression using ~20 base-pair double-stranded RNA typically delivered by transfection of synthetic ~20-mer short-interfering RNA molecules (siRNAs) or by virally encoded short-hairpin RNAs (shRNAs). RNAi screens, typically performed in cell culture-based assays or experimental organisms (such as C. elegans) can be used to systematically disrupt nearly every gene in a genome or subsets of genes (sub-genomes); possible functions of disrupted genes can be assigned based on observed phenotypes.
CRISPR-Cas9 has been used to delete genes in a multiplexed manner in cell-lines. Quantifying the amount of guide-RNAs for each gene before and after the experiment can point towards essential genes. If a guide-RNA disrupts an essential gene it will lead to the loss of that cell and hence there will be a depletion of that particular guide-RNA after the screen. In a recent CRISPR-cas9 experiment in mammalian cell-lines, around 2000 genes were found to be essential in multiple cell-lines. [13] [14] Some of these genes were essential in only one cell-line. Most of genes are part of multi-protein complexes. This approach can be used to identify synthetic lethality by using the appropriate genetic background. CRISPRi and CRISPRa enable loss-of-function and gain-of-function screens in a similar manner. CRISPRi identified ~2100 essential genes in the K562 cell-line. [15] [16] CRISPR deletion screens have also been used to identify potential regulatory elements of a gene. For example, a technique called ScanDel was published which attempted this approach. The authors deleted regions outside a gene of interest(HPRT1 involved in a Mendelian disorder) in an attempt to identify regulatory elements of this gene. [17] Gassperini et al. did not identify any distal regulatory elements for HPRT1 using this approach, however such approaches can be extended to other genes of interest.
Putative genes can be identified by scanning a genome for regions likely to encode proteins, based on characteristics such as long open reading frames, transcriptional initiation sequences, and polyadenylation sites. A sequence identified as a putative gene must be confirmed by further evidence, such as similarity to cDNA or EST sequences from the same organism, similarity of the predicted protein sequence to known proteins, association with promoter sequences, or evidence that mutating the sequence produces an observable phenotype.
The Rosetta stone approach is a computational method for de-novo protein function prediction. It is based on the hypothesis that some proteins involved in a given physiological process may exist as two separate genes in one organism and as a single gene in another. Genomes are scanned for sequences that are independent in one organism and in a single open reading frame in another. If two genes have fused, it is predicted that they have similar biological functions that make such co-regulation advantageous.
Because of the large quantity of data produced by these techniques and the desire to find biologically meaningful patterns, bioinformatics is crucial to analysis of functional genomics data. Examples of techniques in this class are data clustering or principal component analysis for unsupervised machine learning (class detection) as well as artificial neural networks or support vector machines for supervised machine learning (class prediction, classification). Functional enrichment analysis is used to determine the extent of over- or under-expression (positive- or negative- regulators in case of RNAi screens) of functional categories relative to a background sets. Gene ontology based enrichment analysis are provided by DAVID and gene set enrichment analysis (GSEA), [18] pathway based analysis by Ingenuity [19] and Pathway studio [20] and protein complex based analysis by COMPLEAT. [21]
New computational methods have been developed for understanding the results of a deep mutational scanning experiment. 'phydms' compares the result of a deep mutational scanning experiment to a phylogenetic tree. [22] This allows the user to infer if the selection process in nature applies similar constraints on a protein as the results of the deep mutational scan indicate. This may allow an experimenter to choose between different experimental conditions based on how well they reflect nature. Deep mutational scanning has also been used to infer protein-protein interactions. [23] The authors used a thermodynamic model to predict the effects of mutations in different parts of a dimer. Deep mutational structure can also be used to infer protein structure. Strong positive epistasis between two mutations in a deep mutational scan can be indicative of two parts of the protein that are close to each other in 3-D space. This information can then be used to infer protein structure. A proof of principle of this approach was shown by two groups using the protein GB1. [24] [25]
Results from MPRA experiments have required machine learning approaches to interpret the data. A gapped k-mer SVM model has been used to infer the kmers that are enriched within cis-regulatory sequences with high activity compared to sequences with lower activity. [26] These models provide high predictive power. Deep learning and random forest approaches have also been used to interpret the results of these high-dimensional experiments. [27] These models are beginning to help develop a better understanding of non-coding DNA function towards gene-regulation.
The ENCODE (Encyclopedia of DNA elements) project is an in-depth analysis of the human genome whose goal is to identify all the functional elements of genomic DNA, in both coding and non-coding regions. Important results include evidence from genomic tiling arrays that most nucleotides are transcribed as coding transcripts, non-coding RNAs, or random transcripts, the discovery of additional transcriptional regulatory sites, further elucidation of chromatin-modifying mechanisms.
The GTEx project is a human genetics project aimed at understanding the role of genetic variation in shaping variation in the transcriptome across tissues. The project has collected a variety of tissue samples (> 50 different tissues) from more than 700 post-mortem donors. This has resulted in the collection of >11,000 samples. GTEx has helped understand the tissue-sharing and tissue-specificity of eQTLs. [28] The genomic resource was developed to "enrich our understanding of how differences in our DNA sequence contribute to health and disease." [29]
The human genome is a complete set of nucleic acid sequences for humans, encoded as DNA within the 23 chromosome pairs in cell nuclei and in a small DNA molecule found within individual mitochondria. These are usually treated separately as the nuclear genome and the mitochondrial genome. Human genomes include both protein-coding DNA sequences and various types of DNA that does not encode proteins. The latter is a diverse category that includes DNA coding for non-translated RNA, such as that for ribosomal RNA, transfer RNA, ribozymes, small nuclear RNAs, and several types of regulatory RNAs. It also includes promoters and their associated gene-regulatory elements, DNA playing structural and replicatory roles, such as scaffolding regions, telomeres, centromeres, and origins of replication, plus large numbers of transposable elements, inserted viral DNA, non-functional pseudogenes and simple, highly repetitive sequences. Introns make up a large percentage of non-coding DNA. Some of this non-coding DNA is non-functional junk DNA, such as pseudogenes, but there is no firm consensus on the total amount of junk DNA.
Non-coding DNA (ncDNA) sequences are components of an organism's DNA that do not encode protein sequences. Some non-coding DNA is transcribed into functional non-coding RNA molecules. Other functional regions of the non-coding DNA fraction include regulatory sequences that control gene expression; scaffold attachment regions; origins of DNA replication; centromeres; and telomeres. Some non-coding regions appear to be mostly nonfunctional, such as introns, pseudogenes, intergenic DNA, and fragments of transposons and viruses. Regions that are completely nonfunctional are called junk DNA.
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, proteins or non-coding RNA, and ultimately affect a phenotype. These products are often proteins, but in non-protein-coding genes such as transfer RNA (tRNA) and small nuclear RNA (snRNA), the product is a functional non-coding RNA. The process of gene expression is used by all known life—eukaryotes, prokaryotes, and utilized by viruses—to generate the macromolecular machinery for life.
Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to code for multiple proteins. In this process, particular exons of a gene may be included within or excluded from the final, processed messenger RNA (mRNA) produced from that gene. This means the exons are joined in different combinations, leading to different (alternative) mRNA strands. Consequently, the proteins translated from alternatively spliced mRNAs usually contain differences in their amino acid sequence and, often, in their biological functions.
In bioinformatics, sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of a wide range of analytical methods to understand its features, function, structure, or evolution. It can be performed on the entire genome, transcriptome or proteome of an organism, and can also involve only selected segments or regions, like tandem repeats and transposable elements. Methodologies used include sequence alignment, searches against biological databases, and others.
Molecular genetics is a branch of biology that addresses how differences in the structures or expression of DNA molecules manifests as variation among organisms. Molecular genetics often applies an "investigative approach" to determine the structure and/or function of genes in an organism's genome using genetic screens.
In computational biology, gene prediction or gene finding refers to the process of identifying the regions of genomic DNA that encode genes. This includes protein-coding genes as well as RNA genes, but may also include prediction of other functional elements such as regulatory regions. Gene finding is one of the first and most important steps in understanding the genome of a species once it has been sequenced.
The Encyclopedia of DNA Elements (ENCODE) is a public research project which aims "to build a comprehensive parts list of functional elements in the human genome."
In biology, the word gene has two meanings. The Mendelian gene is a basic unit of heredity. The molecular gene is a sequence of nucleotides in DNA, that is transcribed to produce a functional RNA. There are two types of molecular genes: protein-coding genes and non-coding genes.
ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.
Epigenomics is the study of the complete set of epigenetic modifications on the genetic material of a cell, known as the epigenome. The field is analogous to genomics and proteomics, which are the study of the genome and proteome of a cell. Epigenetic modifications are reversible modifications on a cell's DNA or histones that affect gene expression without altering the DNA sequence. Epigenomic maintenance is a continuous process and plays an important role in stability of eukaryotic genomes by taking part in crucial biological mechanisms like DNA repair. Plant flavones are said to be inhibiting epigenomic marks that cause cancers. Two of the most characterized epigenetic modifications are DNA methylation and histone modification. Epigenetic modifications play an important role in gene expression and regulation, and are involved in numerous cellular processes such as in differentiation/development and tumorigenesis. The study of epigenetics on a global level has been made possible only recently through the adaptation of genomic high-throughput assays.
Cas9 is a 160 kilodalton protein which plays a vital role in the immunological defense of certain bacteria against DNA viruses and plasmids, and is heavily utilized in genetic engineering applications. Its main function is to cut DNA and thereby alter a cell's genome. The CRISPR-Cas9 genome editing technique was a significant contributor to the Nobel Prize in Chemistry in 2020 being awarded to Emmanuelle Charpentier and Jennifer Doudna.
STARR-seq is a method to assay enhancer activity for millions of candidates from arbitrary sources of DNA. It is used to identify the sequences that act as transcriptional enhancers in a direct, quantitative, and genome-wide manner.
CRISPR interference (CRISPRi) is a genetic perturbation technique that allows for sequence-specific repression of gene expression in prokaryotic and eukaryotic cells. It was first developed by Stanley Qi and colleagues in the laboratories of Wendell Lim, Adam Arkin, Jonathan Weissman, and Jennifer Doudna. Sequence-specific activation of gene expression refers to CRISPR activation (CRISPRa).
CRISPR-Display (CRISP-Disp) is a modification of the CRISPR/Cas9 system for genome editing. The CRISPR/Cas9 system uses a short guide RNA (sgRNA) sequence to direct a Streptococcus pyogenes Cas9 nuclease, acting as a programmable DNA binding protein, to cleave DNA at a site of interest.
FANTOM is an international research consortium first established in 2000 as part of the RIKEN research institute in Japan. The original meeting gathered international scientists from diverse backgrounds to help annotate the function of mouse cDNA clones generated by the Hayashizaki group. Since the initial FANTOM1 effort, the consortium has released multiple projects that look to understand the mechanisms governing the regulation of mammalian genomes. Their work has generated a large collection of shared data and helped advance biochemical and bioinformatic methodologies in genomics research.
H3K79me2 is an epigenetic modification to the DNA packaging protein Histone H3. It is a mark that indicates the di-methylation at the 79th lysine residue of the histone H3 protein. H3K79me2 is detected in the transcribed regions of active genes.
H3Y41P is an epigenetic modification to the DNA packaging protein histone H3. It is a mark that indicates the phosphorylation the 41st tyrosine residue of the histone H3 protein.
H3T45P is an epigenetic modification to the DNA packaging protein histone H3. It is a mark that indicates the phosphorylation the 45th threonine residue of the histone H3 protein.
H3T6P is an epigenetic modification to the DNA packaging protein histone H3. It is a mark that indicates the phosphorylation of the 6th threonine residue of the histone H3 protein.