Gene expression profiling

Last updated July 25, 2024

Heat maps of gene expression values show how experimental conditions influenced production (expression) of mRNA for a set of genes. Green indicates reduced expression. Cluster analysis has placed a group of down regulated genes in the upper left corner. Heatmap.png — Heat maps of gene expression values show how experimental conditions influenced production (expression) of mRNA for a set of genes. Green indicates reduced expression. Cluster analysis has placed a group of down regulated genes in the upper left corner.

In the field of molecular biology, gene expression profiling is the measurement of the activity (the expression) of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.

Background
Comparison to proteomics
Use in hypothesis generation and testing
Limitations
Validation of high throughput measurements
Statistical analysis
Gene annotation
Categorizing regulated genes
Finding patterns among regulated genes
Cause and effect relationships
Using patterns to find regulated genes
Conclusions
See also
References
External links

Several transcriptomics technologies can be used to generate the necessary data to analyse. DNA microarrays ^[1] measure the relative activity of previously identified target genes. Sequence based techniques, like RNA-Seq, provide information on the sequences of genes in addition to their expression level.

Background

Expression profiling is a logical next step after sequencing a genome: the sequence tells us what the cell could possibly do, while the expression profile tells us what it is actually doing at a point in time. Genes contain the instructions for making messenger RNA (mRNA), but at any moment each cell makes mRNA from only a fraction of the genes it carries. If a gene is used to produce mRNA, it is considered "on", otherwise "off". Many factors determine whether a gene is on or off, such as the time of day, whether or not the cell is actively dividing, its local environment, and chemical signals from other cells. For instance, skin cells, liver cells and nerve cells turn on (express) somewhat different genes and that is in large part what makes them different. Therefore, an expression profile allows one to deduce a cell's type, state, environment, and so forth.

Expression profiling experiments often involve measuring the relative amount of mRNA expressed in two or more experimental conditions. This is because altered levels of a specific sequence of mRNA suggest a changed need for the protein coded by the mRNA, perhaps indicating a homeostatic response or a pathological condition. For example, higher levels of mRNA coding for alcohol dehydrogenase suggest that the cells or tissues under study are responding to increased levels of ethanol in their environment. Similarly, if breast cancer cells express higher levels of mRNA associated with a particular transmembrane receptor than normal cells do, it might be that this receptor plays a role in breast cancer. A drug that interferes with this receptor may prevent or treat breast cancer. In developing a drug, one may perform gene expression profiling experiments to help assess the drug's toxicity, perhaps by looking for changing levels in the expression of cytochrome P450 genes, which may be a biomarker of drug metabolism.^[2] Gene expression profiling may become an important diagnostic test.^[3]^[4]

Comparison to proteomics

The human genome contains on the order of 20,000 genes which work in concert to produce roughly 1,000,000 distinct proteins. This is due to alternative splicing, and also because cells make important changes to proteins through posttranslational modification after they first construct them, so a given gene serves as the basis for many possible versions of a particular protein. In any case, a single mass spectrometry experiment can identify about 2,000 proteins^[5] or 0.2% of the total. While knowledge of the precise proteins a cell makes (proteomics) is more relevant than knowing how much messenger RNA is made from each gene,^{[ why? ]} gene expression profiling provides the most global picture possible in a single experiment. However, proteomics methodology is improving. In other species, such as yeast, it is possible to identify over 4,000 proteins in just over one hour.^[6]

Use in hypothesis generation and testing

Sometimes, a scientist already has an idea of what is going on, a hypothesis, and he or she performs an expression profiling experiment with the idea of potentially disproving this hypothesis. In other words, the scientist is making a specific prediction about levels of expression that could turn out to be false.

More commonly, expression profiling takes place before enough is known about how genes interact with experimental conditions for a testable hypothesis to exist. With no hypothesis, there is nothing to disprove, but expression profiling can help to identify a candidate hypothesis for future experiments. Most early expression profiling experiments, and many current ones, have this form^[7] which is known as class discovery. A popular approach to class discovery involves grouping similar genes or samples together using one of the many existing clustering methods such the traditional k-means or hierarchical clustering, or the more recent MCL.^[8] Apart from selecting a clustering algorithm, user usually has to choose an appropriate proximity measure (distance or similarity) between data objects.^[9] The figure above represents the output of a two dimensional cluster, in which similar samples (rows, above) and similar gene probes (columns) were organized so that they would lie close together. The simplest form of class discovery would be to list all the genes that changed by more than a certain amount between two experimental conditions.

Class prediction is more difficult than class discovery, but it allows one to answer questions of direct clinical significance such as, given this profile, what is the probability that this patient will respond to this drug? This requires many examples of profiles that responded and did not respond, as well as cross-validation techniques to discriminate between them.

Limitations

In general, expression profiling studies report those genes that showed statistically significant differences under changed experimental conditions. This is typically a small fraction of the genome for several reasons. First, different cells and tissues express a subset of genes as a direct consequence of cellular differentiation so many genes are turned off. Second, many of the genes code for proteins that are required for survival in very specific amounts so many genes do not change. Third, cells use many other mechanisms to regulate proteins in addition to altering the amount of mRNA, so these genes may stay consistently expressed even when protein concentrations are rising and falling. Fourth, financial constraints limit expression profiling experiments to a small number of observations of the same gene under identical conditions, reducing the statistical power of the experiment, making it impossible for the experiment to identify important but subtle changes. Finally, it takes a great amount of effort to discuss the biological significance of each regulated gene, so scientists often limit their discussion to a subset. Newer microarray analysis techniques automate certain aspects of attaching biological significance to expression profiling results, but this remains a very difficult problem.

The relatively short length of gene lists published from expression profiling experiments limits the extent to which experiments performed in different laboratories appear to agree. Placing expression profiling results in a publicly accessible microarray database makes it possible for researchers to assess expression patterns beyond the scope of published results, perhaps identifying similarity with their own work.

Validation of high throughput measurements

Both DNA microarrays and quantitative PCR exploit the preferential binding or "base pairing" of complementary nucleic acid sequences, and both are used in gene expression profiling, often in a serial fashion. While high throughput DNA microarrays lack the quantitative accuracy of qPCR, it takes about the same time to measure the gene expression of a few dozen genes via qPCR as it would to measure an entire genome using DNA microarrays. So it often makes sense to perform semi-quantitative DNA microarray analysis experiments to identify candidate genes, then perform qPCR on some of the most interesting candidate genes to validate the microarray results. Other experiments, such as a Western blot of some of the protein products of differentially expressed genes, make conclusions based on the expression profile more persuasive, since the mRNA levels do not necessarily correlate to the amount of expressed protein.

Statistical analysis

Data analysis of microarrays has become an area of intense research.^[10] Simply stating that a group of genes were regulated by at least twofold, once a common practice, lacks a solid statistical footing. With five or fewer replicates in each group, typical for microarrays, a single outlier observation can create an apparent difference greater than two-fold. In addition, arbitrarily setting the bar at two-fold is not biologically sound, as it eliminates from consideration many genes with obvious biological significance.

Rather than identify differentially expressed genes using a fold change cutoff, one can use a variety of statistical tests or omnibus tests such as ANOVA, all of which consider both fold change and variability to create a p-value, an estimate of how often we would observe the data by chance alone. Applying p-values to microarrays is complicated by the large number of multiple comparisons (genes) involved. For example, a p-value of 0.05 is typically thought to indicate significance, since it estimates a 5% probability of observing the data by chance. But with 10,000 genes on a microarray, 500 genes would be identified as significant at p < 0.05 even if there were no difference between the experimental groups. One obvious solution is to consider significant only those genes meeting a much more stringent p value criterion, e.g., one could perform a Bonferroni correction on the p-values, or use a false discovery rate calculation to adjust p-values in proportion to the number of parallel tests involved. Unfortunately, these approaches may reduce the number of significant genes to zero, even when genes are in fact differentially expressed. Current statistics such as Rank products aim to strike a balance between false discovery of genes due to chance variation and non-discovery of differentially expressed genes. Commonly cited methods include the Significance Analysis of Microarrays (SAM)^[11] and a wide variety of methods are available from Bioconductor and a variety of analysis packages from bioinformatics companies.

Selecting a different test usually identifies a different list of significant genes^[12] since each test operates under a specific set of assumptions, and places a different emphasis on certain features in the data. Many tests begin with the assumption of a normal distribution in the data, because that seems like a sensible starting point and often produces results that appear more significant. Some tests consider the joint distribution of all gene observations to estimate general variability in measurements,^[13] while others look at each gene in isolation. Many modern microarray analysis techniques involve bootstrapping (statistics), machine learning or Monte Carlo methods.^[14]

As the number of replicate measurements in a microarray experiment increases, various statistical approaches yield increasingly similar results, but lack of concordance between different statistical methods makes array results appear less trustworthy. The MAQC Project^[15] makes recommendations to guide researchers in selecting more standard methods (e.g. using p-value and fold-change together for selecting the differentially expressed genes) so that experiments performed in different laboratories will agree better.

Different from the analysis on differentially expressed individual genes, another type of analysis focuses on differential expression or perturbation of pre-defined gene sets and is called gene set analysis.^[16]^[17] Gene set analysis demonstrated several major advantages over individual gene differential expression analysis.^[16]^[17] Gene sets are groups of genes that are functionally related according to current knowledge. Therefore, gene set analysis is considered a knowledge based analysis approach.^[16] Commonly used gene sets include those derived from KEGG pathways, Gene Ontology terms, gene groups that share some other functional annotations, such as common transcriptional regulators etc. Representative gene set analysis methods include Gene Set Enrichment Analysis (GSEA),^[16] which estimates significance of gene sets based on permutation of sample labels, and Generally Applicable Gene-set Enrichment (GAGE),^[17] which tests the significance of gene sets based on permutation of gene labels or a parametric distribution.

Gene annotation

While the statistics may identify which gene products change under experimental conditions, making biological sense of expression profiling rests on knowing which protein each gene product makes and what function this protein performs. Gene annotation provides functional and other information, for example the location of each gene within a particular chromosome. Some functional annotations are more reliable than others; some are absent. Gene annotation databases change regularly, and various databases refer to the same protein by different names, reflecting a changing understanding of protein function. Use of standardized gene nomenclature helps address the naming aspect of the problem, but exact matching of transcripts to genes^[18]^[19] remains an important consideration.

Categorizing regulated genes

Having identified some set of regulated genes, the next step in expression profiling involves looking for patterns within the regulated set. Do the proteins made from these genes perform similar functions? Are they chemically similar? Do they reside in similar parts of the cell? Gene ontology analysis provides a standard way to define these relationships. Gene ontologies start with very broad categories, e.g., "metabolic process" and break them down into smaller categories, e.g., "carbohydrate metabolic process" and finally into quite restrictive categories like "inositol and derivative phosphorylation".

Genes have other attributes beside biological function, chemical properties and cellular location. One can compose sets of genes based on proximity to other genes, association with a disease, and relationships with drugs or toxins. The Molecular Signatures Database^[20] and the Comparative Toxicogenomics Database ^[21] are examples of resources to categorize genes in numerous ways.

Finding patterns among regulated genes

Ingenuity Gene Network Diagram which dynamically assembles genes with known relationships. Green indicates reduced expression, red indicates increased expression. The algorithm has included unregulated genes, white, to improve connectivity. ExampleNet.png — Ingenuity Gene Network Diagram which dynamically assembles genes with known relationships. Green indicates reduced expression, red indicates increased expression. The algorithm has included unregulated genes, white, to improve connectivity.

Regulated genes are categorized in terms of what they are and what they do, important relationships between genes may emerge.^[23] For example, we might see evidence that a certain gene creates a protein to make an enzyme that activates a protein to turn on a second gene on our list. This second gene may be a transcription factor that regulates yet another gene from our list. Observing these links we may begin to suspect that they represent much more than chance associations in the results, and that they are all on our list because of an underlying biological process. On the other hand, it could be that if one selected genes at random, one might find many that seem to have something in common. In this sense, we need rigorous statistical procedures to test whether the emerging biological themes is significant or not. That is where gene set analysis^[16]^[17] comes in.

Cause and effect relationships

Fairly straightforward statistics provide estimates of whether associations between genes on lists are greater than what one would expect by chance. These statistics are interesting, even if they represent a substantial oversimplification of what is really going on. Here is an example. Suppose there are 10,000 genes in an experiment, only 50 (0.5%) of which play a known role in making cholesterol. The experiment identifies 200 regulated genes. Of those, 40 (20%) turn out to be on a list of cholesterol genes as well. Based on the overall prevalence of the cholesterol genes (0.5%) one expects an average of 1 cholesterol gene for every 200 regulated genes, that is, 0.005 times 200. This expectation is an average, so one expects to see more than one some of the time. The question becomes how often we would see 40 instead of 1 due to pure chance.

According to the hypergeometric distribution, one would expect to try about 10^57 times (10 followed by 56 zeroes) before picking 39 or more of the cholesterol genes from a pool of 10,000 by drawing 200 genes at random. Whether one pays much attention to how infinitesimally small the probability of observing this by chance is, one would conclude that the regulated gene list is enriched^[24] in genes with a known cholesterol association.

One might further hypothesize that the experimental treatment regulates cholesterol, because the treatment seems to selectively regulate genes associated with cholesterol. While this may be true, there are a number of reasons why making this a firm conclusion based on enrichment alone represents an unwarranted leap of faith. One previously mentioned issue has to do with the observation that gene regulation may have no direct impact on protein regulation: even if the proteins coded for by these genes do nothing other than make cholesterol, showing that their mRNA is altered does not directly tell us what is happening at the protein level. It is quite possible that the amount of these cholesterol-related proteins remains constant under the experimental conditions. Second, even if protein levels do change, perhaps there is always enough of them around to make cholesterol as fast as it can be possibly made, that is, another protein, not on our list, is the rate determining step in the process of making cholesterol. Finally, proteins typically play many roles, so these genes may be regulated not because of their shared association with making cholesterol but because of a shared role in a completely independent process.

Bearing the foregoing caveats in mind, while gene profiles do not in themselves prove causal relationships between treatments and biological effects, they do offer unique biological insights that would often be very difficult to arrive at in other ways.

Using patterns to find regulated genes

As described above, one can identify significantly regulated genes first and then find patterns by comparing the list of significant genes to sets of genes known to share certain associations. One can also work the problem in reverse order. Here is a very simple example. Suppose there are 40 genes associated with a known process, for example, a predisposition to diabetes. Looking at two groups of expression profiles, one for mice fed a high carbohydrate diet and one for mice fed a low carbohydrate diet, one observes that all 40 diabetes genes are expressed at a higher level in the high carbohydrate group than the low carbohydrate group. Regardless of whether any of these genes would have made it to a list of significantly altered genes, observing all 40 up, and none down appears unlikely to be the result of pure chance: flipping 40 heads in a row is predicted to occur about one time in a trillion attempts using a fair coin.

For a type of cell, the group of genes whose combined expression pattern is uniquely characteristic to a given condition constitutes the gene signature of this condition. Ideally, the gene signature can be used to select a group of patients at a specific state of a disease with accuracy that facilitates selection of treatments.^[25]^[26] Gene Set Enrichment Analysis (GSEA)^[16] and similar methods^[17] take advantage of this kind of logic but uses more sophisticated statistics, because component genes in real processes display more complex behavior than simply moving up or down as a group, and the amount the genes move up and down is meaningful, not just the direction. In any case, these statistics measure how different the behavior of some small set of genes is compared to genes not in that small set.

GSEA uses a Kolmogorov Smirnov style statistic to see whether any previously defined gene sets exhibited unusual behavior in the current expression profile. This leads to a multiple hypothesis testing challenge, but reasonable methods exist to address it.^[27]

Conclusions

Expression profiling provides new information about what genes do under various conditions. Overall, microarray technology produces reliable expression profiles.^[28] From this information one can generate new hypotheses about biology or test existing ones. However, the size and complexity of these experiments often results in a wide variety of possible interpretations. In many cases, analyzing expression profiling results takes far more effort than performing the initial experiments.

Most researchers use multiple statistical methods and exploratory data analysis before publishing their expression profiling results, coordinating their efforts with a bioinformatician or other expert in DNA microarrays. Good experimental design, adequate biological replication and follow up experiments play key roles in successful expression profiling experiments.

Related Research Articles

Proteomics is the large-scale study of proteins. Proteins are vital macromolecules of all living organisms, with many functions such as the formation of structural fibers of muscle tissue, enzymatic digestion of food, or synthesis and replication of DNA. In addition, other kinds of proteins include antibodies that protect an organism from infection, and hormones that send important signals throughout the body.

<span class="mw-page-title-main">Gene expression</span> Conversion of a genes sequence into a mature gene product or products

Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, proteins or non-coding RNA, and ultimately affect a phenotype. These products are often proteins, but in non-protein-coding genes such as transfer RNA (tRNA) and small nuclear RNA (snRNA), the product is a functional non-coding RNA. The process of gene expression is used by all known life—eukaryotes, prokaryotes, and utilized by viruses—to generate the macromolecular machinery for life.

Alternative splicing, or alternative RNA splicing, or differential splicing, is an alternative splicing process during gene expression that allows a single gene to produce different splice variants. For example, some exons of a gene may be included within or excluded from the final RNA product of the gene. This means the exons are joined in different combinations, leading to different splice variants. In the case of protein-coding genes, the proteins translated from these splice variants may contain differences in their amino acid sequence and in their biological functions.

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.

Regulation of gene expression, or gene regulation, includes a wide range of mechanisms that are used by cells to increase or decrease the production of specific gene products. Sophisticated programs of gene expression are widely observed in biology, for example to trigger developmental pathways, respond to environmental stimuli, or adapt to new food sources. Virtually any step of gene expression can be modulated, from transcriptional initiation, to RNA processing, and to the post-translational modification of a protein. Often, one gene regulator controls another, and so on, in a gene regulatory network.

Functional genomics is a field of molecular biology that attempts to describe gene functions and interactions. Functional genomics make use of the vast data generated by genomic and transcriptomic projects. Functional genomics focuses on the dynamic aspects such as gene transcription, translation, regulation of gene expression and protein–protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. A key characteristic of functional genomics studies is their genome-wide approach to these questions, generally involving high-throughput methods rather than a more traditional "candidate-gene" approach.

The transcriptome is the set of all RNA transcripts, including coding and non-coding, in an individual or a population of cells. The term can also sometimes be used to refer to all RNAs, or just mRNA, depending on the particular experiment. The term transcriptome is a portmanteau of the words transcript and genome; it is associated with the process of transcript production during the biological process of transcription.

Regulome refers to the whole set of regulatory components in a cell. Those components can be regulatory elements, genes, mRNAs, proteins, and metabolites. The description includes the interplay of regulatory effects between these components, and their dependence on variables such as subcellular localization, tissue, developmental stage, and pathological state.

Genetic analysis is the overall process of studying and researching in fields of science that involve genetics and molecular biology. There are a number of applications that are developed from this research, and these are also considered parts of the process. The base system of analysis revolves around general genetics. Basic studies include identification of genes and inherited disorders. This research has been conducted for centuries on both a large-scale physical observation basis and on a more microscopic scale. Genetic analysis can be used generally to describe methods both used in and resulting from the sciences of genetics and molecular biology, or to applications resulting from this research.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes – in many cases, an organism's entire genome – in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult – if not impossible – to analyze without the help of computer programs.

ChIP-sequencing, also known as ChIP-seq, is a method used to analyze protein interactions with DNA. ChIP-seq combines chromatin immunoprecipitation (ChIP) with massively parallel DNA sequencing to identify the binding sites of DNA-associated proteins. It can be used to map global binding sites precisely for any protein of interest. Previously, ChIP-on-chip was the most common technique utilized to study these protein–DNA relations.

RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.

Escherichia coli contains a number of small RNAs located in intergenic regions of its genome. The presence of at least 55 of these has been verified experimentally. 275 potential sRNA-encoding loci were identified computationally using the QRNA program. These loci will include false positives, so the number of sRNA genes in E. coli is likely to be less than 275. A computational screen based on promoter sequences recognised by the sigma factor sigma 70 and on Rho-independent terminators predicted 24 putative sRNA genes, 14 of these were verified experimentally by northern blotting. The experimentally verified sRNAs included the well characterised sRNAs RprA and RyhB. Many of the sRNAs identified in this screen, including RprA, RyhB, SraB and SraL, are only expressed in the stationary phase of bacterial cell growth. A screen for sRNA genes based on homology to Salmonella and Klebsiella identified 59 candidate sRNA genes. From this set of candidate genes, microarray analysis and northern blotting confirmed the existence of 17 previously undescribed sRNAs, many of which bind to the chaperone protein Hfq and regulate the translation of RpoS. UptR sRNA transcribed from the uptR gene is implicated in suppressing extracytoplasmic toxicity by reducing the amount of membrane-bound toxic hybrid protein.

Immunomics is the study of immune system regulation and response to pathogens using genome-wide approaches. With the rise of genomic and proteomic technologies, scientists have been able to visualize biological networks and infer interrelationships between genes and/or proteins; recently, these technologies have been used to help better understand how the immune system functions and how it is regulated. Two thirds of the genome is active in one or more immune cell types and less than 1% of genes are uniquely expressed in a given type of cell. Therefore, it is critical that the expression patterns of these immune cell types be deciphered in the context of a network, and not as an individual, so that their roles be correctly characterized and related to one another. Defects of the immune system such as autoimmune diseases, immunodeficiency, and malignancies can benefit from genomic insights on pathological processes. For example, analyzing the systematic variation of gene expression can relate these patterns with specific diseases and gene networks important for immune functions.

The Cancer Genome Anatomy Project (CGAP), created by the National Cancer Institute (NCI) in 1997 and introduced by Al Gore, is an online database on normal, pre-cancerous and cancerous genomes. It also provides tools for viewing and analysis of the data, allowing for identification of genes involved in various aspects of tumor progression. The goal of CGAP is to characterize cancer at a molecular level by providing a platform with readily accessible updated data and a set of tools such that researchers can easily relate their findings to existing knowledge. There is also a focus on development of software tools that improve the usage of large and complex datasets. The project is directed by Daniela S. Gerhard, and includes sub-projects or initiatives, with notable ones including the Cancer Chromosome Aberration Project (CCAP) and the Genetic Annotation Initiative (GAI). CGAP contributes to many databases and organisations such as the NCBI contribute to CGAP's databases.

Gene set enrichment analysis (GSEA) (also called functional enrichment analysis or pathway enrichment analysis) is a method to identify classes of genes or proteins that are over-represented in a large set of genes or proteins, and may have an association with different phenotypes (e.g. different organism growth patterns or diseases). The method uses statistical approaches to identify significantly enriched or depleted groups of genes. Transcriptomics technologies and proteomics results often identify thousands of genes, which are used for the analysis.

The Expression Atlas is a database maintained by the European Bioinformatics Institute that provides information on gene expression patterns from RNA-Seq and Microarray studies, and protein expression from Proteomics studies. The Expression Atlas allows searches by gene, splice variant, protein attribute, disease, treatment or organism part. Individual genes or gene sets can be searched for. All datasets in Expression Atlas have its metadata manually curated and its data analysed through standardised analysis pipelines. There are two components to the Expression Atlas, the Baseline Atlas and the Differential Atlas:

<span class="mw-page-title-main">Pathway analysis</span>

Pathway is the term from molecular biology for a curated schematic representation of a well characterized segment of the molecular physiological machinery, such as a metabolic pathway describing an enzymatic process within a cell or tissue or a signaling pathway model representing a regulatory process that might, in its turn, enable a metabolic or another regulatory process downstream. A typical pathway model starts with an extracellular signaling molecule that activates a specific receptor, thus triggering a chain of molecular interactions. A pathway is most often represented as a relatively small graph with gene, protein, and/or small molecule nodes connected by edges of known functional relations. While a simpler pathway might appear as a chain, complex pathway topologies with loops and alternative routes are much more common. Computational analyses employ special formats of pathway representation. In the simplest form, however, a pathway might be represented as a list of member molecules with order and relations unspecified. Such a representation, generally called Functional Gene Set (FGS), can also refer to other functionally characterised groups such as protein families, Gene Ontology (GO) and Disease Ontology (DO) terms etc. In bioinformatics, methods of pathway analysis might be used to identify key genes/ proteins within a previously known pathway in relation to a particular experiment / pathological condition or building a pathway de novo from proteins that have been identified as key affected elements. By examining changes in e.g. gene expression in a pathway, its biological activity can be explored. However most frequently, pathway analysis refers to a method of initial characterization and interpretation of an experimental condition that was studied with omics tools or genome-wide association study. Such studies might identify long lists of altered genes. A visual inspection is then challenging and the information is hard to summarize, since the altered genes map to a broad range of pathways, processes, and molecular functions. In such situations, the most productive way of exploring the list is to identify enrichment of specific FGSs in it. The general approach of enrichment analyses is to identify FGSs, members of which were most frequently or most strongly altered in the given condition, in comparison to a gene set sampled by chance. In other words, enrichment can map canonical prior knowledge structured in the form of FGSs to the condition represented by altered genes.

Transcriptomics technologies are the techniques used to study an organism's transcriptome, the sum of all of its RNA transcripts. The information content of an organism is recorded in the DNA of its genome and expressed through transcription. Here, mRNA serves as a transient intermediary molecule in the information network, whilst non-coding RNAs perform additional diverse functions. A transcriptome captures a snapshot in time of the total transcripts present in a cell. Transcriptomics technologies provide a broad account of which cellular processes are active and which are dormant. A major challenge in molecular biology is to understand how a single genome gives rise to a variety of cells. Another is how gene expression is regulated.

References

↑ "Microarrays Factsheet" . Retrieved 2007-12-28.
↑ Suter L, Babiss LE, Wheeldon EB (2004). "Toxicogenomics in predictive toxicology in drug development". Chem. Biol. 11 (2): 161–71. doi: 10.1016/j.chembiol.2004.02.003 . PMID 15123278.
↑ Magic Z, Radulovic S, Brankovic-Magic M (2007). "cDNA microarrays: identification of gene signatures and their application in clinical practice". J BUON. 12 (Suppl 1): S39–44. PMID 17935276.
↑ Cheung AN (2007). "Molecular targets in gynaecological cancers". Pathology. 39 (1): 26–45. doi:10.1080/00313020601153273. PMID 17365821. S2CID 40896577.
↑ Mirza SP, Olivier M (2007). "Methods and approaches for the comprehensive characterization and quantification of cellular proteomes using mass spectrometry". Physiol Genomics. 33 (1): 3–11. doi:10.1152/physiolgenomics.00292.2007. PMC 2771641 . PMID 18162499.
↑ Hebert AS, Richards AL, et al. (2014). "The One Hour Yeast Proteome". Mol Cell Proteomics. 13 (1): 339–347. doi: 10.1074/mcp.M113.034769 . PMC 3879625 . PMID 24143002.
↑ Chen JJ (2007). "Key aspects of analyzing microarray gene-expression data". Pharmacogenomics. 8 (5): 473–82. doi:10.2217/14622416.8.5.473. PMID 17465711.
↑ van Dongen, Stijn (2000). Graph Clustering by Flow Simulation. University of Utrecht.
↑ Jaskowiak, Pablo A; Campello, Ricardo JGB; Costa, Ivan G (24 January 2014). "On the selection of appropriate distances for gene expression data clustering". BMC Bioinformatics. 15 (Suppl 2): S2. doi: 10.1186/1471-2105-15-S2-S2 . PMC 4072854 . PMID 24564555.
↑ Vardhanabhuti S, Blakemore SJ, Clark SM, Ghosh S, Stephens RJ, Rajagopalan D (2006). "A comparison of statistical tests for detecting differential expression using Affymetrix oligonucleotide microarrays". OMICS. 10 (4): 555–66. doi:10.1089/omi.2006.10.555. PMID 17233564.
↑ "Significance Analysis of Microarrays". Archived from the original on 2008-01-20. Retrieved 2007-12-27.
↑ Yauk CL, Berndt ML (2007). "Review of the literature examining the correlation among DNA microarray technologies". Environ. Mol. Mutagen. 48 (5): 380–94. Bibcode:2007EnvMM..48..380Y. doi:10.1002/em.20290. PMC 2682332 . PMID 17370338.
↑ Breitling R (2006). "Biological microarray interpretation: the rules of engagement" (PDF). Biochim. Biophys. Acta. 1759 (7): 319–27. doi:10.1016/j.bbaexp.2006.06.003. PMID 16904203. S2CID 1857997.
↑ Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J (2008). "Monte Carlo feature selection for supervised classification". Bioinformatics. 24 (1): 110–7. doi: 10.1093/bioinformatics/btm486 . PMID 18048398.
↑ Dr. Leming Shi, National Center for Toxicological Research. "MicroArray Quality Control (MAQC) Project". U.S. Food and Drug Administration. Retrieved 2007-12-26.
1 2 3 4 5 6 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005). "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles". Proc. Natl. Acad. Sci. U.S.A. 102 (43): 15545–50. doi: 10.1073/pnas.0506580102 . PMC 1239896 . PMID 16199517.
1 2 3 4 5 Luo W, Friedman M, Shedden K, Hankenson KD, Woolf JP (2009). "GAGE: generally applicable gene set enrichment for pathway analysis". BMC Bioinformatics. 10: 161. doi: 10.1186/1471-2105-10-161 . PMC 2696452 . PMID 19473525.
↑ Dai M, Wang P, Boyd AD, et al. (2005). "Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data". Nucleic Acids Res. 33 (20): e175. doi:10.1093/nar/gni179. PMC 1283542 . PMID 16284200.
↑ Alberts R, Terpstra P, Hardonk M, et al. (2007). "A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat". BMC Bioinformatics. 8: 132. doi: 10.1186/1471-2105-8-132 . PMC 1865557 . PMID 17448222.
↑ "GSEA – MSigDB" . Retrieved 2008-01-03.
↑ "CTD: The Comparative Toxicogenomics Database" . Retrieved 2008-01-03.
↑ "Ingenuity Systems" . Retrieved 2007-12-27.
↑ Alekseev OM, Richardson RT, Alekseev O, O'Rand MG (2009). "Analysis of gene expression profiles in HeLa cells in response to overexpression or siRNA-mediated depletion of NASP". Reprod. Biol. Endocrinol. 7: 45. doi: 10.1186/1477-7827-7-45 . PMC 2686705 . PMID 19439102.
↑ Curtis RK, Oresic M, Vidal-Puig A (2005). "Pathways to the analysis of microarray data". Trends Biotechnol. 23 (8): 429–35. doi:10.1016/j.tibtech.2005.05.011. PMID 15950303.
↑ Mook S, Van't Veer LJ, Rutgers EJ, Piccart-Gebhart MJ, Cardoso F (2007). "Individualization of therapy using Mammaprint: from development to the MINDACT Trial". Cancer Genomics Proteomics. 4 (3): 147–55. PMID 17878518.
↑ Corsello SM, Roti G, Ross KN, Chow KT, Galinsky I, DeAngelo DJ, Stone RM, Kung AL, Golub TR, Stegmaier K (June 2009). "Identification of AML1-ETO modulators by chemical genomics". Blood. 113 (24): 6193–205. doi:10.1182/blood-2008-07-166090. PMC 2699238 . PMID 19377049.
↑ "GSEA" . Retrieved 2008-01-09.
↑ Couzin J (2006). "Genomics. Microarray data reproduced, but some concerns remain". Science. 313 (5793): 1559. doi:10.1126/science.313.5793.1559a. PMID 16973852. S2CID 58528299.

External links

Comparative Transcriptomics Analysis in Reference Module in Life Sciences

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Microarrays Factsheet" . Retrieved 2007-12-28.

[pmid15123278-2] Suter L, Babiss LE, Wheeldon EB (2004). "Toxicogenomics in predictive toxicology in drug development". Chem. Biol. 11 (2): 161–71. doi: 10.1016/j.chembiol.2004.02.003 . PMID 15123278.

[3] Magic Z, Radulovic S, Brankovic-Magic M (2007). "cDNA microarrays: identification of gene signatures and their application in clinical practice". J BUON. 12 (Suppl 1): S39–44. PMID 17935276.

[pmid17365821-4] Cheung AN (2007). "Molecular targets in gynaecological cancers". Pathology. 39 (1): 26–45. doi:10.1080/00313020601153273. PMID 17365821. S2CID 40896577.

[5] Mirza SP, Olivier M (2007). "Methods and approaches for the comprehensive characterization and quantification of cellular proteomes using mass spectrometry". Physiol Genomics. 33 (1): 3–11. doi:10.1152/physiolgenomics.00292.2007. PMC 2771641 . PMID 18162499.

[6] Hebert AS, Richards AL, et al. (2014). "The One Hour Yeast Proteome". Mol Cell Proteomics. 13 (1): 339–347. doi: 10.1074/mcp.M113.034769 . PMC 3879625 . PMID 24143002.

[7] Chen JJ (2007). "Key aspects of analyzing microarray gene-expression data". Pharmacogenomics. 8 (5): 473–82. doi:10.2217/14622416.8.5.473. PMID 17465711.

[8] van Dongen, Stijn (2000). Graph Clustering by Flow Simulation. University of Utrecht.

[9] Jaskowiak, Pablo A; Campello, Ricardo JGB; Costa, Ivan G (24 January 2014). "On the selection of appropriate distances for gene expression data clustering". BMC Bioinformatics. 15 (Suppl 2): S2. doi: 10.1186/1471-2105-15-S2-S2 . PMC 4072854 . PMID 24564555.

[pmid17233564-10] Vardhanabhuti S, Blakemore SJ, Clark SM, Ghosh S, Stephens RJ, Rajagopalan D (2006). "A comparison of statistical tests for detecting differential expression using Affymetrix oligonucleotide microarrays". OMICS. 10 (4): 555–66. doi:10.1089/omi.2006.10.555. PMID 17233564.

[11] "Significance Analysis of Microarrays". Archived from the original on 2008-01-20. Retrieved 2007-12-27.

[pmid17370338-12] Yauk CL, Berndt ML (2007). "Review of the literature examining the correlation among DNA microarray technologies". Environ. Mol. Mutagen. 48 (5): 380–94. Bibcode:2007EnvMM..48..380Y. doi:10.1002/em.20290. PMC 2682332 . PMID 17370338.

[13] Breitling R (2006). "Biological microarray interpretation: the rules of engagement" (PDF). Biochim. Biophys. Acta. 1759 (7): 319–27. doi:10.1016/j.bbaexp.2006.06.003. PMID 16904203. S2CID 1857997.

[14] Draminski M, Rada-Iglesias A, Enroth S, Wadelius C, Koronacki J, Komorowski J (2008). "Monte Carlo feature selection for supervised classification". Bioinformatics. 24 (1): 110–7. doi: 10.1093/bioinformatics/btm486 . PMID 18048398.

[15] Dr. Leming Shi, National Center for Toxicological Research. "MicroArray Quality Control (MAQC) Project". U.S. Food and Drug Administration. Retrieved 2007-12-26.

[pmid16199517-16] 1 2 3 4 5 6 Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP (2005). "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles". Proc. Natl. Acad. Sci. U.S.A. 102 (43): 15545–50. doi: 10.1073/pnas.0506580102 . PMC 1239896 . PMID 16199517.

[pmid19473525-17] 1 2 3 4 5 Luo W, Friedman M, Shedden K, Hankenson KD, Woolf JP (2009). "GAGE: generally applicable gene set enrichment for pathway analysis". BMC Bioinformatics. 10: 161. doi: 10.1186/1471-2105-10-161 . PMC 2696452 . PMID 19473525.

[18] Dai M, Wang P, Boyd AD, et al. (2005). "Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data". Nucleic Acids Res. 33 (20): e175. doi:10.1093/nar/gni179. PMC 1283542 . PMID 16284200.

[19] Alberts R, Terpstra P, Hardonk M, et al. (2007). "A verification protocol for the probe sequences of Affymetrix genome arrays reveals high probe accuracy for studies in mouse, human and rat". BMC Bioinformatics. 8: 132. doi: 10.1186/1471-2105-8-132 . PMC 1865557 . PMID 17448222.

[20] "GSEA – MSigDB" . Retrieved 2008-01-03.

[21] "CTD: The Comparative Toxicogenomics Database" . Retrieved 2008-01-03.

[22] "Ingenuity Systems" . Retrieved 2007-12-27.

[pmid19439102-23] Alekseev OM, Richardson RT, Alekseev O, O'Rand MG (2009). "Analysis of gene expression profiles in HeLa cells in response to overexpression or siRNA-mediated depletion of NASP". Reprod. Biol. Endocrinol. 7: 45. doi: 10.1186/1477-7827-7-45 . PMC 2686705 . PMID 19439102.

[pmid15950303-24] Curtis RK, Oresic M, Vidal-Puig A (2005). "Pathways to the analysis of microarray data". Trends Biotechnol. 23 (8): 429–35. doi:10.1016/j.tibtech.2005.05.011. PMID 15950303.

[pmid17878518-25] Mook S, Van't Veer LJ, Rutgers EJ, Piccart-Gebhart MJ, Cardoso F (2007). "Individualization of therapy using Mammaprint: from development to the MINDACT Trial". Cancer Genomics Proteomics. 4 (3): 147–55. PMID 17878518.

[pmid19377049-26] Corsello SM, Roti G, Ross KN, Chow KT, Galinsky I, DeAngelo DJ, Stone RM, Kung AL, Golub TR, Stegmaier K (June 2009). "Identification of AML1-ETO modulators by chemical genomics". Blood. 113 (24): 6193–205. doi:10.1182/blood-2008-07-166090. PMC 2699238 . PMID 19377049.

[27] "GSEA" . Retrieved 2008-01-09.

[28] Couzin J (2006). "Genomics. Microarray data reproduced, but some concerns remain". Science. 313 (5793): 1559. doi:10.1126/science.313.5793.1559a. PMID 16973852. S2CID 58528299.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[23]

[24]

[25]

[26]

[27]

[28]