Gene co-expression network

Last updated
A gene co-expression network constructed from a microarray dataset containing gene expression profiles of 7221 genes for 18 gastric cancer patients Gene co-expression network with 7221 genes for 18 gastric cancer patients.png
A gene co-expression network constructed from a microarray dataset containing gene expression profiles of 7221 genes for 18 gastric cancer patients

A gene co-expression network (GCN) is an undirected graph, where each node corresponds to a gene, and a pair of nodes is connected with an edge if there is a significant co-expression relationship between them. [1] Having gene expression profiles of a number of genes for several samples or experimental conditions, a gene co-expression network can be constructed by looking for pairs of genes which show a similar expression pattern across samples, since the transcript levels of two co-expressed genes rise and fall together across samples. Gene co-expression networks are of biological interest since co-expressed genes are controlled by the same transcriptional regulatory program, functionally related, or members of the same pathway or protein complex. [2]

Contents

The direction and type of co-expression relationships are not determined in gene co-expression networks; whereas in a gene regulatory network (GRN) a directed edge connects two genes, representing a biochemical process such as a reaction, transformation, interaction, activation or inhibition. [3] Compared to a GRN, a GCN does not attempt to infer the causality relationships between genes and in a GCN the edges represent only a correlation or dependency relationship among genes. [4] Modules or the highly connected subgraphs in gene co-expression networks correspond to clusters of genes that have a similar function or involve in a common biological process which causes many interactions among themselves. [3]

The direction of edges is overlooked in gene co-expression networks. While three genes X, Y and Z are found to be co-expressed, it is not determined whether X activates Y and Y activates Z, or Y activates X and Z, or another gene activates three of them. Gene co-expression vs regulation.png
The direction of edges is overlooked in gene co-expression networks. While three genes X, Y and Z are found to be co-expressed, it is not determined whether X activates Y and Y activates Z, or Y activates X and Z, or another gene activates three of them.

Gene co-expression networks are usually constructed using datasets generated by high-throughput gene expression profiling technologies such as Microarray or RNA-Seq. Co-expression networks are used to analyze single cell RNA-Seq data, in order to better characterize the gene to gene relations in a cohort of cells from a specific cell type [ citation needed ].

History

The concept of gene co-expression networks was first introduced by Butte and Kohane in 1999 as relevance networks. [5] They gathered the measurement data of medical laboratory tests (e.g. hemoglobin level ) for a number of patients and they calculated the Pearson correlation between the results for each pair of tests and the pairs of tests which showed a correlation higher than a certain level were connected in the network (e.g. insulin level with blood sugar). Butte and Kohane used this approach later with mutual information as the co-expression measure and using gene expression data for constructing the first gene co-expression network. [6]

Constructing gene co-expression networks

A good number of methods have been developed for constructing gene co-expression networks. In principle, they all follow a two step approach: calculating co-expression measure, and selecting significance threshold. In the first step, a co-expression measure is selected and a similarity score is calculated for each pair of genes using this measure. Then, a threshold is determined and gene pairs which have a similarity score higher than the selected threshold are considered to have a significant co-expression relationship and are connected by an edge in the network.

The two general steps for constructing a gene co-expression network: calculating co-expression score (e.g. the absolute value of Pearson correlation coefficient) for each pair of genes, and selecting a significance threshold (e.g. correlation > 0.8). Gene co-expression network construction steps.png
The two general steps for constructing a gene co-expression network: calculating co-expression score (e.g. the absolute value of Pearson correlation coefficient) for each pair of genes, and selecting a significance threshold (e.g. correlation > 0.8).

The input data for constructing a gene co-expression network is often represented as a matrix. If we have the gene expression values of m genes for n samples (conditions), the input data would be an m×n matrix, called expression matrix. For instance, in a microarray experiment the expression values of thousands of genes are measured for several samples. In first step, a similarity score (co-expression measure) is calculated between each pair of rows in expression matrix. The resulting matrix is an m×m matrix called the similarity matrix. Each element in this matrix shows how similarly the expression levels of two genes change together. In the second step, the elements in the similarity matrix which are above a certain threshold (i.e. indicate significant co-expression) are replaced by 1 and the remaining elements are replaced by 0. The resulting matrix, called the adjacency matrix, represents the graph of the constructed gene co-expression network. In this matrix, each element shows whether two genes are connected in the network (the 1 elements) or not (the 0 elements).

Co-expression measure

The expression values of a gene for different samples can be represented as a vector, thus calculating the co-expression measure between a pair of genes is the same as calculating the selected measure for two vectors of numbers.

Pearson's correlation coefficient, Mutual Information, Spearman's rank correlation coefficient and Euclidean distance are the four mostly used co-expression measures for constructing gene co-expression networks. Euclidean distance measures the geometric distance between two vectors, and so considers both the direction and the magnitude of the vectors of gene expression values. Mutual information measures how much knowing the expression levels of one gene reduces the uncertainty about the expression levels of another. Pearson’s correlation coefficient measures the tendency of two vectors to increase or decrease together, giving a measure of their overall correspondence. Spearman's rank correlation is the Pearson’s correlation calculated for the ranks of gene expression values in a gene expression vector. [2] Several other measures such as partial correlation, [7] regression, [8] and combination of partial correlation and mutual information [9] have also been used.

Each of these measures have their own advantages and disadvantages. The Euclidean distance is not appropriate when the absolute levels of functionally related genes are highly different. Furthermore, if two genes have consistently low expression levels but are otherwise randomly correlated, they might still appear close in Euclidean space. [2] One advantage to mutual information is that it can detect non-linear relationships; however this can turn into a disadvantage because of detecting sophisticated non-linear relationships which does not look biologically meaningful. In addition, for calculating mutual information one should estimate the distribution of the data which needs a large number of samples for a good estimate. Spearman’s rank correlation coefficient is more robust to outliers, but on the other hand it is less sensitive to expression values and in datasets with small number of samples may detect many false positives.

Pearson’s correlation coefficient is the most popular co-expression measure used in constructing gene co-expression networks. The Pearson's correlation coefficient takes a value between -1 and 1 where absolute values close to 1 show strong correlation. The positive values correspond to an activation mechanism where the expression of one gene increases with the increase in the expression of its co-expressed gene and vice versa. When the expression value of one gene decreases with the increase in the expression of its co-expressed gene, it corresponds to an underlying suppression mechanism and would have a negative correlation.

There are two disadvantages for Pearson correlation measure: it can only detect linear relationships and it is sensitive to outliers. Moreover, Pearson correlation assumes that the gene expression data follow a normal distribution. Song et al. [10] have suggested biweight midcorrelation (bicor) as a good alternative for Pearson’s correlation. "Bicor is a median based correlation measure, and is more robust than the Pearson correlation but often more powerful than the Spearman's correlation". Furthermore, it has been shown that "most gene pairs satisfy linear or monotonic relationships" which indicates that "mutual information networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data [10] ".

Threshold selection

Several methods have been used for selecting a threshold in constructing gene co-expression networks. A simple thresholding method is to choose a co-expression cutoff and select relationships which their co-expression exceeds this cutoff. Another approach is to use Fisher’s Z-transformation which calculates a z-score for each correlation based on the number of samples. This z-score is then converted into a p-value for each correlation and a cutoff is set on the p-value. Some methods permute the data and calculate a z-score using the distribution of correlations found between genes in permuted dataset. [2] Some other approaches have also been used such as threshold selection based on clustering coefficient [11] or random matrix theory. [12]

The problem with p-value based methods is that the final cutoff on the p-value is chosen based on statistical routines(e.g. a p-value of 0.01 or 0.05 is considered significant), not based on a biological insight.

WGCNA is a framework for constructing and analyzing weighted gene co-expression networks. [13] The WGCNA method selects the threshold for constructing the network based on the scale-free topology of gene co-expression networks. This method constructs the network for several thresholds and selects the threshold which leads to a network with scale-free topology. Moreover, the WGCNA method constructs a weighted network which means all possible edges appear in the network, but each edge has a weight which shows how significant is the co-expression relationship corresponding to that edge. Of note, threshold selection is intended to coerce networks into a scale-free topology. However, the underlying premise that biological networks are scale-free is contentious. [14] [15] [16]

lmQCM is an alternative for WGCNA achieving the same goal of gene co-expression networks analysis. lmQCM, [17] stands for local maximal Quasi-Clique Merger, aiming to exploit the locally dense structures in the network, thus can mine smaller and densely co-expressed modules by allowing module overlapping. the algorithm lmQCM has its R package and python module (bundled in Biolearns). The generally smaller size of mined modules can also generate more meaningful gene ontology (GO) enrichment results.

Challenges

Co-expression networks try to estimate the direct and sometimes the indirect correlations between pairs of genes. However, an individual gene may be controlled by multiple regulators. [18] Second, as discussed in the previous sections, each co-expression computational measure is designed specifically to capture a unique feature that is not necessarily optimal for depicting all types of gene-to-gene transcriptional inter-relation, for example, Pearson correlation for linear relations, Spearman for the ranking of the genes, and so on. Third and last, calculating the gene to gene co-expression networks for whole genome results in very large matrices which contain a considerable amount of noise, which raises a significant difficulty in exploring their differentiation across cohorts. These challenges should be referred when applying advanced methods of co-expression on gene expression data.

Applications

See also

Related Research Articles

Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

<span class="mw-page-title-main">Correlation</span> Statistical concept

In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.

<span class="mw-page-title-main">DNA microarray</span> Collection of microscopic DNA spots attached to a solid surface

A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.

Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.

Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.

In fluorescence microscopy, colocalization refers to observation of the spatial overlap between two different fluorescent labels, each having a separate emission wavelength, to see if the different "targets" are located in the same area of the cell or very near to one another. The definition can be split into two different phenomena, co-occurrence, which refers to the presence of two fluorophores in the same pixel, and correlation, a much more significant statistical relationship between the fluorophores indicative of a biological interaction. This technique is important to many cell biological and physiological studies during the demonstration of a relationship between pairs of bio-molecules.

<span class="mw-page-title-main">Gene expression profiling</span>

In the field of molecular biology, gene expression profiling is the measurement of the activity of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.

<span class="mw-page-title-main">Flux balance analysis</span>

Flux balance analysis (FBA) is a mathematical method for simulating metabolism in genome-scale reconstructions of metabolic networks. In comparison to traditional methods of modeling, FBA is less intensive in terms of the input data required for constructing the model. Simulations performed using FBA are computationally inexpensive and can calculate steady-state metabolic fluxes for large models in a few seconds on modern personal computers. The related method of metabolic pathway analysis seeks to find and list all possible pathways between metabolites.

<span class="mw-page-title-main">Microarray analysis techniques</span>

Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes – in many cases, an organism's entire genome – in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult – if not impossible – to analyze without the help of computer programs.

<span class="mw-page-title-main">Biological network inference</span>

Biological network inference is the process of making inferences and predictions about biological networks. By using these networks to analyze patterns in biological systems, such as food-webs, we can visualize the nature and strength of these interactions between species, DNA, proteins, and more.

<span class="mw-page-title-main">RNA-Seq</span> Lab technique in cellular biology

RNA-Seq is a sequencing technique that uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample, representing an aggregated snapshot of the cells' dynamic pool of RNAs, also known as transcriptome.

<span class="mw-page-title-main">Biological network</span> Method of representing systems

A biological network is a method of representing systems as complex sets of binary interactions or relations between various biological entities. In general, networks or graphs are used to capture relationships between entities or objects. A typical graphing representation consists of a set of nodes connected by edges.

In statistics, the phi coefficient is a measure of association for two binary variables.

<span class="mw-page-title-main">Weighted network</span> Network where the ties among nodes have weights assigned to them

A weighted network is a network where the ties among nodes have weights assigned to them. A network is a system whose elements are somehow connected. The elements of a system are represented as nodes and the connections among interacting elements are known as ties, edges, arcs, or links. The nodes might be neurons, individuals, groups, organisations, airports, or even countries, whereas ties can take the form of friendship, communication, collaboration, alliance, flow, or trade, to name a few.

GeneNetwork is a combined database and open-source bioinformatics data analysis software resource for systems genetics. This resource is used to study gene regulatory networks that link DNA sequence differences to corresponding differences in gene and protein expression and to variation in traits such as health and disease risk. Data sets in GeneNetwork are typically made up of large collections of genotypes and phenotypes from groups of individuals, including humans, strains of mice and rats, and organisms as diverse as Drosophila melanogaster, Arabidopsis thaliana, and barley. The inclusion of genotypes makes it practical to carry out web-based gene mapping to discover those regions of genomes that contribute to differences among individuals in mRNA, protein, and metabolite levels, as well as differences in cell function, anatomy, physiology, and behavior.

In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.

Weighted correlation network analysis, also known as weighted gene co-expression network analysis (WGCNA), is a widely used data mining method especially for studying biological networks based on pairwise correlations between variables. While it can be applied to most high-dimensional data sets, it has been most widely used in genomic applications. It allows one to define modules (clusters), intramodular hubs, and network nodes with regard to module membership, to study the relationships between co-expression modules, and to compare the network topology of different networks. WGCNA can be used as a data reduction technique, as a clustering method, as a feature selection method, as a framework for integrating complementary (genomic) data, and as a data exploratory technique. Although WGCNA incorporates traditional data exploratory techniques, its intuitive network language and analysis framework transcend any standard analysis technique. Since it uses network methodology and is well suited for integrating complementary genomic data sets, it can be interpreted as systems biologic or systems genetic data analysis method. By selecting intramodular hubs in consensus modules, WGCNA also gives rise to network based meta analysis techniques.

Denoising Algorithm based on Relevance network Topology (DART) is an unsupervised algorithm that estimates an activity score for a pathway in a gene expression matrix, following a denoising step. In DART, a weighted average is used where the weights reflect the degree of the nodes in the pruned network. The denoising step removes prior information that is inconsistent with a data set. This strategy substantially improves unsupervised predictions of pathway activity that are based on a prior model, which was learned from a different biological system or context.

Single-cell transcriptomics examines the gene expression level of individual cells in a given population by simultaneously measuring the RNA concentration of hundreds to thousands of genes. Single-cell transcriptomics makes it possible to unravel heterogeneous cell populations, reconstruct cellular developmental pathways, and model transcriptional dynamics — all previously masked in bulk RNA sequencing.

<span class="mw-page-title-main">Cellular deconvolution</span> Set of computational techniques

Cellular deconvolution refers to computational techniques aiming at estimating the proportions of different cell types in samples collected from a tissue. For example, samples collected from the human brain are a mixture of various neuronal and glial cell types in different proportions, where each cell type has a diverse gene expression profile. Since most high-throughput technologies use bulk samples and measure the aggregated levels of molecular information for all cells in a sample, the measured values would be an aggregate of the values pertaining to the expression landscape of different cell types. Therefore, many downstream analyses such as differential gene expression might be confounded by the variations in cell type proportions when using the output of high-throughput technologies applied to bulk samples. The development of statistical methods to identify cell type proportions in large-scale bulk samples is an important step for better understanding of the relationship between cell type composition and diseases.

References

  1. Stuart, Joshua M; Segal, Eran; Koller, Daphne; Kim, Stuart K (2003). "A gene-coexpression network for global discovery of conserved genetic modules". Science. 302 (5643): 249–55. Bibcode:2003Sci...302..249S. CiteSeerX   10.1.1.119.6331 . doi:10.1126/science.1087447. PMID   12934013. S2CID   3131371.
  2. 1 2 3 4 Weirauch, Matthew T (2011). "Gene coexpression networks for the analysis of DNA microarray data". Applied Statistics for Network Biology: Methods in Systems Biology. pp. 215–250. doi:10.1002/9783527638079.ch11. ISBN   9783527638079.
  3. 1 2 Roy, Swarup; Bhattacharyya, Dhruba K; Kalita, Jugal K (2014). "Reconstruction of gene co-expression network from microarray data using local expression patterns". BMC Bioinformatics. 15 (Suppl 7): S10. doi: 10.1186/1471-2105-15-s7-s10 . PMC   4110735 . PMID   25079873.
  4. De Smet, Riet; Marchal, Kathleen (2010). "Advantages and limitations of current network inference methods". Nature Reviews Microbiology. 8 (10): 717–29. doi:10.1038/nrmicro2419. PMID   20805835. S2CID   27629033.
  5. Butte, Atul J; Kohane, Isaac S (1999). "Unsupervised knowledge discovery in medical databases using relevance networks". Proceedings of the AMIA Symposium: 711–715. PMC   2232846 . PMID   10566452.
  6. Butte, Atul J; Kohane, Isaac S (2000). "Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements". Pac Symp Biocomput. 5.
  7. Villa-Vialaneix, Nathalie; Liaubet, Laurence; Laurent, Thibault; Cherel, Pierre; Gamot, Adrien; SanCristobal, Magali (2013). "The structure of a gene co-expression network reveals biological functions underlying eQTLs". PLOS ONE. 8 (4): 60045. Bibcode:2013PLoSO...860045V. doi: 10.1371/journal.pone.0060045 . PMC   3618335 . PMID   23577081.
  8. Persson, Staffan; Wei, Hairong; Milne, Jennifer; Page, Grier P; Somerville, Christopher R (2005). "Identification of genes required for cellulose synthesis by regression analysis of public microarray data sets". Proceedings of the National Academy of Sciences of the United States of America. 102 (24): 8633–8. Bibcode:2005PNAS..102.8633P. doi: 10.1073/pnas.0503392102 . PMC   1142401 . PMID   15932943.
  9. Reverter, Antonio; Chan, Eva KF (2008). "Combining partial correlation and an information theory approach to the reversed engineering of gene co-expression networks". Bioinformatics. 24 (21): 2491–2497. doi: 10.1093/bioinformatics/btn482 . PMID   18784117.
  10. 1 2 Song, Lin; Langfelder, Peter; Horvath, Steve (2012). "Comparison of co-expression measures: mutual information, correlation, and model based indices". BMC Bioinformatics. 13 (1): 328. doi: 10.1186/1471-2105-13-328 . PMC   3586947 . PMID   23217028.
  11. Elo, Laura L; Järvenpää, Henna; Orešič, Matej; Lahesmaa, Riitta; Aittokallio, Tero (2007). "Systematic construction of gene coexpression networks with applications to human T helper cell differentiation process". Bioinformatics. 23 (16): 2096–2103. doi: 10.1093/bioinformatics/btm309 . PMID   17553854.
  12. Luo, Feng; Yang, Yunfeng; Zhong, Jianxin; Gao, Haichun; Khan, Latifur; Thompson, Dorothea K; Zhou, Jizhong (2007). "Constructing gene co-expression networks and predicting functions of unknown genes by random matrix theory". BMC Bioinformatics. 8 (1): 299. doi: 10.1186/1471-2105-8-299 . PMC   2212665 . PMID   17697349.
  13. Zhang, Bin; Horvath, Steve (2005). "A general framework for weighted gene co-expression network analysis". Statistical Applications in Genetics and Molecular Biology. 4 (1): Article17. CiteSeerX   10.1.1.471.9599 . doi:10.2202/1544-6115.1128. PMID   16646834. S2CID   7756201.
  14. Khanin, R.; Wit, E. (2006). "How scale-free are biological networks". Journal of Computational Biology. 13 (3): 810–8. doi:10.1089/cmb.2006.13.810. PMID   16706727.
  15. Broido, Anna D.; Clauset, Aaron (2019). "Scale-free networks are rare". Nature Communications. 10 (1): 1017. arXiv: 1801.03400 . Bibcode:2019NatCo..10.1017B. doi:10.1038/s41467-019-08746-5. PMC   6399239 . PMID   30833554. S2CID   24825063.
  16. Clote, P. (2020). "Are RNA networks scale-free?". Journal of Mathematical Biology. 80 (5): 1291–1321. doi:10.1007/s00285-019-01463-z. PMC   7052049 . PMID   31950258.
  17. Zhang, Jie; Huang, Kun (2014). "Normalized ImQCM: An Algorithm for Detecting Weak Quasi-Cliques in Weighted Graph with Applications in Gene Co-Expression Module Discovery in Cancers". Cancer Informatics. 13 (3): 137–46. doi: 10.4137/CIN.S14021 . PMC   4962959 . PMID   27486298.
  18. Alon, Uri (2006). Design Principles of Biological Circuits. doi:10.1201/9781420011432. ISBN   9780429092794.
  19. Mercatelli, Daniele; Ray, Forest; Giorgi, Federico M. (2019). "Pan-Cancer and Single-Cell Modeling of Genomic Alterations Through Gene Expression". Frontiers in Genetics. 10: 671. doi: 10.3389/fgene.2019.00671 . ISSN   1664-8021. PMC   6657420 . PMID   31379928.
  20. Mercatelli, Daniele; Scalambra, Laura; Triboli, Luca; Ray, Forest; Giorgi, Federico M. (2020). "Gene regulatory network inference resources: A practical overview". Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms. 1863 (6): 194430. doi:10.1016/j.bbagrm.2019.194430. ISSN   1874-9399. PMID   31678629. S2CID   207895066.
  21. Usadel, Bjoern; Obayashi, Takeshi; Mutwil, Marek; Giorgi, Federico M.; Bassel, George W.; Tanimoto, Mimi; Chow, Amanda; Steinhauser, Dirk; Persson, Staffan; Provart, Nicholas J. (2009). "Co-expression tools for plant biology: opportunities for hypothesis generation and caveats". Plant, Cell & Environment. 32 (12): 1633–1651. doi: 10.1111/j.1365-3040.2009.02040.x . ISSN   0140-7791. PMID   19712066.