A gene co-expression network (GCN) is an undirected graph, where each node corresponds to a gene, and a pair of nodes is connected with an edge if there is a significant co-expression relationship between them. [1] Having gene expression profiles of a number of genes for several samples or experimental conditions, a gene co-expression network can be constructed by looking for pairs of genes which show a similar expression pattern across samples, since the transcript levels of two co-expressed genes rise and fall together across samples. Gene co-expression networks are of biological interest since co-expressed genes are controlled by the same transcriptional regulatory program, functionally related, or members of the same pathway or protein complex. [2]
The direction and type of co-expression relationships are not determined in gene co-expression networks; whereas in a gene regulatory network (GRN) a directed edge connects two genes, representing a biochemical process such as a reaction, transformation, interaction, activation or inhibition. [3] Compared to a GRN, a GCN does not attempt to infer the causality relationships between genes and in a GCN the edges represent only a correlation or dependency relationship among genes. [4] Modules or the highly connected subgraphs in gene co-expression networks correspond to clusters of genes that have a similar function or involve in a common biological process which causes many interactions among themselves. [3]
Gene co-expression networks are usually constructed using datasets generated by high-throughput gene expression profiling technologies such as Microarray or RNA-Seq. Co-expression networks are used to analyze single cell RNA-Seq data, in order to better characterize the gene to gene relations in a cohort of cells from a specific cell type. [5]
The concept of gene co-expression networks was first introduced by Butte and Kohane in 1999 as relevance networks. [6] They gathered the measurement data of medical laboratory tests (e.g. hemoglobin level ) for a number of patients and they calculated the Pearson correlation between the results for each pair of tests and the pairs of tests which showed a correlation higher than a certain level were connected in the network (e.g. insulin level with blood sugar). Butte and Kohane used this approach later with mutual information as the co-expression measure and using gene expression data for constructing the first gene co-expression network. [7]
A good number of methods have been developed for constructing gene co-expression networks. In principle, they all follow a two step approach: calculating co-expression measure, and selecting significance threshold. In the first step, a co-expression measure is selected and a similarity score is calculated for each pair of genes using this measure. Then, a threshold is determined and gene pairs which have a similarity score higher than the selected threshold are considered to have a significant co-expression relationship and are connected by an edge in the network.
The input data for constructing a gene co-expression network is often represented as a matrix. If we have the gene expression values of m genes for n samples (conditions), the input data would be an m×n matrix, called expression matrix. For instance, in a microarray experiment the expression values of thousands of genes are measured for several samples. In first step, a similarity score (co-expression measure) is calculated between each pair of rows in expression matrix. The resulting matrix is an m×m matrix called the similarity matrix. Each element in this matrix shows how similarly the expression levels of two genes change together. In the second step, the elements in the similarity matrix which are above a certain threshold (i.e. indicate significant co-expression) are replaced by 1 and the remaining elements are replaced by 0. The resulting matrix, called the adjacency matrix, represents the graph of the constructed gene co-expression network. In this matrix, each element shows whether two genes are connected in the network (the 1 elements) or not (the 0 elements).
The expression values of a gene for different samples can be represented as a vector, thus calculating the co-expression measure between a pair of genes is the same as calculating the selected measure for two vectors of numbers.
Pearson's correlation coefficient, Mutual Information, Spearman's rank correlation coefficient and Euclidean distance are the four mostly used co-expression measures for constructing gene co-expression networks. Euclidean distance measures the geometric distance between two vectors, and so considers both the direction and the magnitude of the vectors of gene expression values. Mutual information measures how much knowing the expression levels of one gene reduces the uncertainty about the expression levels of another. Pearson’s correlation coefficient measures the tendency of two vectors to increase or decrease together, giving a measure of their overall correspondence. Spearman's rank correlation is the Pearson’s correlation calculated for the ranks of gene expression values in a gene expression vector. [2] Several other measures such as partial correlation, [8] regression, [9] and combination of partial correlation and mutual information [10] have also been used.
Each of these measures have their own advantages and disadvantages. The Euclidean distance is not appropriate when the absolute levels of functionally related genes are highly different. Furthermore, if two genes have consistently low expression levels but are otherwise randomly correlated, they might still appear close in Euclidean space. [2] One advantage to mutual information is that it can detect non-linear relationships; however this can turn into a disadvantage because of detecting sophisticated non-linear relationships which does not look biologically meaningful. In addition, for calculating mutual information one should estimate the distribution of the data which needs a large number of samples for a good estimate. Spearman’s rank correlation coefficient is more robust to outliers, but on the other hand it is less sensitive to expression values and in datasets with small number of samples may detect many false positives.
Pearson’s correlation coefficient is the most popular co-expression measure used in constructing gene co-expression networks. The Pearson's correlation coefficient takes a value between -1 and 1 where absolute values close to 1 show strong correlation. The positive values correspond to an activation mechanism where the expression of one gene increases with the increase in the expression of its co-expressed gene and vice versa. When the expression value of one gene decreases with the increase in the expression of its co-expressed gene, it corresponds to an underlying suppression mechanism and would have a negative correlation.
There are two disadvantages for Pearson correlation measure: it can only detect linear relationships and it is sensitive to outliers. Moreover, Pearson correlation assumes that the gene expression data follow a normal distribution. Song et al. [11] have suggested biweight midcorrelation (bicor) as a good alternative for Pearson’s correlation. "Bicor is a median based correlation measure, and is more robust than the Pearson correlation but often more powerful than the Spearman's correlation". Furthermore, it has been shown that "most gene pairs satisfy linear or monotonic relationships" which indicates that "mutual information networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data [11] ".
Several methods have been used for selecting a threshold in constructing gene co-expression networks. A simple thresholding method is to choose a co-expression cutoff and select relationships which their co-expression exceeds this cutoff. Another approach is to use Fisher’s Z-transformation which calculates a z-score for each correlation based on the number of samples. This z-score is then converted into a p-value for each correlation and a cutoff is set on the p-value. Some methods permute the data and calculate a z-score using the distribution of correlations found between genes in permuted dataset. [2] Some other approaches have also been used such as threshold selection based on clustering coefficient [12] or random matrix theory. [13]
The problem with p-value based methods is that the final cutoff on the p-value is chosen based on statistical routines(e.g. a p-value of 0.01 or 0.05 is considered significant), not based on a biological insight.
WGCNA is a framework for constructing and analyzing weighted gene co-expression networks. [14] The WGCNA method selects the threshold for constructing the network based on the scale-free topology of gene co-expression networks. This method constructs the network for several thresholds and selects the threshold which leads to a network with scale-free topology. Moreover, the WGCNA method constructs a weighted network which means all possible edges appear in the network, but each edge has a weight which shows how significant is the co-expression relationship corresponding to that edge. Of note, threshold selection is intended to coerce networks into a scale-free topology. However, the underlying premise that biological networks are scale-free is contentious. [15] [16] [17]
lmQCM is an alternative for WGCNA achieving the same goal of gene co-expression networks analysis. lmQCM, [18] stands for local maximal Quasi-Clique Merger, aiming to exploit the locally dense structures in the network, thus can mine smaller and densely co-expressed modules by allowing module overlapping. the algorithm lmQCM has its R package and python module (bundled in Biolearns). The generally smaller size of mined modules can also generate more meaningful gene ontology (GO) enrichment results.
Co-expression networks try to estimate the direct and sometimes the indirect correlations between pairs of genes. However, an individual gene may be controlled by multiple regulators. [19] Second, as discussed in the previous sections, each co-expression computational measure is designed specifically to capture a unique feature that is not necessarily optimal for depicting all types of gene-to-gene transcriptional inter-relation, for example, Pearson correlation for linear relations, Spearman for the ranking of the genes, and so on. Third and last, calculating the gene to gene co-expression networks for whole genome results in very large matrices which contain a considerable amount of noise, which raises a significant difficulty in exploring their differentiation across cohorts. These challenges should be referred when applying advanced methods of co-expression on gene expression data.
Biostatistics is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics it usually refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the height of parents and their offspring, and the correlation between the price of a good and the quantity the consumers are willing to purchase, as it is depicted in the so-called demand curve.
A DNA microarray is a collection of microscopic DNA spots attached to a solid surface. Scientists use DNA microarrays to measure the expression levels of large numbers of genes simultaneously or to genotype multiple regions of a genome. Each DNA spot contains picomoles of a specific DNA sequence, known as probes. These can be a short section of a gene or other DNA element that are used to hybridize a cDNA or cRNA sample under high-stringency conditions. Probe-target hybridization is usually detected and quantified by detection of fluorophore-, silver-, or chemiluminescence-labeled targets to determine relative abundance of nucleic acid sequences in the target. The original nucleic acid arrays were macro arrays approximately 9 cm × 12 cm and the first computerized image based analysis was published in 1981. It was invented by Patrick O. Brown. An example of its application is in SNPs arrays for polymorphisms in cardiovascular diseases, cancer, pathogens and GWAS analysis. It is also used for the identification of structural variations and the measurement of gene expression.
Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.
Computational genomics refers to the use of computational and statistical analysis to decipher biology from genome sequences and related data, including both DNA and RNA sequence as well as other "post-genomic" data. These, in combination with computational and statistical approaches to understanding the function of the genes and statistical association analysis, this field is also often referred to as Computational and Statistical Genetics/genomics. As such, computational genomics may be regarded as a subset of bioinformatics and computational biology, but with a focus on using whole genomes to understand the principles of how the DNA of a species controls its biology at the molecular level and beyond. With the current abundance of massive biological datasets, computational studies have become one of the most important means to biological discovery.
In fluorescence microscopy, colocalization refers to observation of the spatial overlap between two different fluorescent labels, each having a separate emission wavelength, to see if the different "targets" are located in the same area of the cell or very near to one another. The definition can be split into two different phenomena, co-occurrence, which refers to the presence of two fluorophores in the same pixel, and correlation, a much more significant statistical relationship between the fluorophores indicative of a biological interaction. This technique is important to many cell biological and physiological studies during the demonstration of a relationship between pairs of bio-molecules.
In the field of molecular biology, gene expression profiling is the measurement of the activity of thousands of genes at once, to create a global picture of cellular function. These profiles can, for example, distinguish between cells that are actively dividing, or show how the cells react to a particular treatment. Many experiments of this sort measure an entire genome simultaneously, that is, every gene present in a particular cell.
In biochemistry, flux balance analysis (FBA) is a mathematical method for simulating the metabolism of cells or entire unicellular organisms, such as E. coli or yeast, using genome-scale reconstructions of metabolic networks. Genome-scale reconstructions describe all the biochemical reactions in an organism based on its entire genome. These reconstructions model metabolism by focusing on the interactions between metabolites, identifying which metabolites are involved in the various reactions taking place in a cell or organism, and determining the genes that encode the enzymes which catalyze these reactions. In comparison to traditional methods of modeling, FBA is less intensive in terms of the input data required for constructing the model. Simulations performed using FBA are computationally inexpensive and can calculate steady-state metabolic fluxes for large models in a few seconds on modern personal computers. The related method of metabolic pathway analysis seeks to find and list all possible pathways between metabolites.
Microarray analysis techniques are used in interpreting the data generated from experiments on DNA, RNA, and protein microarrays, which allow researchers to investigate the expression state of a large number of genes – in many cases, an organism's entire genome – in a single experiment. Such experiments can generate very large amounts of data, allowing researchers to assess the overall state of a cell or organism. Data in such large quantities is difficult – if not impossible – to analyze without the help of computer programs.
Biological network inference is the process of making inferences and predictions about biological networks. By using these networks to analyze patterns in biological systems, such as food-webs, we can visualize the nature and strength of these interactions between species, DNA, proteins, and more.
RNA-Seq is a technique that uses next-generation sequencing to reveal the presence and quantity of RNA molecules in a biological sample, providing a snapshot of gene expression in the sample, also known as transcriptome.
A biological network is a method of representing systems as complex sets of binary interactions or relations between various biological entities. In general, networks or graphs are used to capture relationships between entities or objects. A typical graphing representation consists of a set of nodes connected by edges.
In statistics, the phi coefficient is a measure of association for two binary variables.
A weighted network is a network where the ties among nodes have weights assigned to them. A network is a system whose elements are somehow connected. The elements of a system are represented as nodes and the connections among interacting elements are known as ties, edges, arcs, or links. The nodes might be neurons, individuals, groups, organisations, airports, or even countries, whereas ties can take the form of friendship, communication, collaboration, alliance, flow, or trade, to name a few.
In bioinformatics, alignment-free sequence analysis approaches to molecular sequence and structure data provide alternatives over alignment-based approaches.
Weighted correlation network analysis, also known as weighted gene co-expression network analysis (WGCNA), is a widely used data mining method especially for studying biological networks based on pairwise correlations between variables. While it can be applied to most high-dimensional data sets, it has been most widely used in genomic applications. It allows one to define modules (clusters), intramodular hubs, and network nodes with regard to module membership, to study the relationships between co-expression modules, and to compare the network topology of different networks. WGCNA can be used as a data reduction technique, as a clustering method, as a feature selection method, as a framework for integrating complementary (genomic) data, and as a data exploratory technique. Although WGCNA incorporates traditional data exploratory techniques, its intuitive network language and analysis framework transcend any standard analysis technique. Since it uses network methodology and is well suited for integrating complementary genomic data sets, it can be interpreted as systems biologic or systems genetic data analysis method. By selecting intramodular hubs in consensus modules, WGCNA also gives rise to network based meta analysis techniques.
Denoising Algorithm based on Relevance network Topology (DART) is an unsupervised algorithm that estimates an activity score for a pathway in a gene expression matrix, following a denoising step. In DART, a weighted average is used where the weights reflect the degree of the nodes in the pruned network. The denoising step removes prior information that is inconsistent with a data set. This strategy substantially improves unsupervised predictions of pathway activity that are based on a prior model, which was learned from a different biological system or context.
Single-cell transcriptomics examines the gene expression level of individual cells in a given population by simultaneously measuring the RNA concentration of hundreds to thousands of genes. Single-cell transcriptomics makes it possible to unravel heterogeneous cell populations, reconstruct cellular developmental pathways, and model transcriptional dynamics — all previously masked in bulk RNA sequencing.
Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.
Cellular deconvolution refers to computational techniques aiming at estimating the proportions of different cell types in samples collected from a tissue. For example, samples collected from the human brain are a mixture of various neuronal and glial cell types in different proportions, where each cell type has a diverse gene expression profile. Since most high-throughput technologies use bulk samples and measure the aggregated levels of molecular information for all cells in a sample, the measured values would be an aggregate of the values pertaining to the expression landscape of different cell types. Therefore, many downstream analyses such as differential gene expression might be confounded by the variations in cell type proportions when using the output of high-throughput technologies applied to bulk samples. The development of statistical methods to identify cell type proportions in large-scale bulk samples is an important step for better understanding of the relationship between cell type composition and diseases.