Denoising Algorithm based on Relevance network Topology

Last updated

Denoising Algorithm based on Relevance network Topology (DART) is an unsupervised algorithm that estimates an activity score for a pathway in a gene expression matrix, following a denoising step. [1] In DART, a weighted average is used where the weights reflect the degree of the nodes in the pruned network. [1] The denoising step removes prior information that is inconsistent with a data set. This strategy substantially improves unsupervised predictions of pathway activity that are based on a prior model, which was learned from a different biological system or context. [1]

Contents

Pre-existing methods such as gene set enrichment analysis method attempt to infer. [2] However, it did not construct a structured list of genes. SPIA (Signaling Pathway Impact analysis) [3] is a method that uses the phenotype information to evaluate the pathway activity between two phenotypes. However, it does not identify the pathway gene subset that could be used to differentiate individual samples. [3] CORG is used to identify a relevant gene subset. It is a supervised method, which does not perform as well as DART in analyzing independent data set [1]

Understanding molecular pathway activity is crucial for risk assessment, clinical diagnosis and treatment. Meta-analysis of complex genomic data is often associated with difficulties such as extracting useful information from big data, eliminating confounding factors and providing more sensible interpretation. Different approaches have been taken to highlight the identification of relevant pathway in order to provide better gene expression prediction.

Method

Strategy

  1. Build a network of all genes that are involved in the pathway
  2. Evaluate the consistency of the prior regulatory information
  3. Remove inconsistent prior information-the denoising step
  4. Estimate pathway activity

Pearson correlations were first computed between regulatory genes at the level of transcription and a gene expression data set. The correlation coefficient then underwent a Fisher's transform:

Where cij is the correlation coefficient between gene i and j, and where γij is the variable that under the null hypothesis, its mean is zero and standard deviation 1/n_s-3, where ns is the number of tumor samples. The threshold of p-value was set at 0.0001. Gene pairs with significant correlation will be considered relevant in the network. To predict the activity score in which genes that are nearby are also taken into consideration:

Where ki is the number of neighbors of gene i, zi is the normalized z-score and σi is a binary variable ( i.e 1 means upregulated upon activation and -1 means downregulated). This step is to estimate the activation level, in which sw AV is the activity score. A linear regression model was then applied to estimate the pathway activation levels. Thus, tij and pij denote the t-statistics and p-value associated with, whereas p<0.05 indicates a significance. To assess the consistency in a validation data set D, the performance measure Vij is denoted:

Where S is defined by

S is the threshold function of a given pair of pathways. And where

σij is the score that tells the directionality of a correlation, in which an opposite prediction will be panelized by given a value of -1. tij is the t-statistics of interpathway correlation. The performance measure Vij accounts for the significance of correlation between pathways, the direction of correlation, and the weights in the magnitude of the correlation. A two-tailed paired Wilcoxon test is performed to compare the distribution under hypothesis. Advantages and limitation: DART gives an improved performance and higher accuracy in inferring pathway activity from prior information of pathway databases. Pre-existed information and large database are needed in order for DART to run. In other words, DART requires well-established prior gene expression data to start with, and then it can proceed evaluation of consistency and denoise any irrelevant information.

Application

DART is an algorithm that is applicable and used successfully in Cancer Genomics. The DART algorithm has been shown to be a strong method for estimating the pathway activity and perturbation signature activity in breast and lung cancer gene expression data sets. [1] Imaging traits such as mammography (Mammography is the process of using low-energy X-rays to examine the human breast tissue) plays an important role in cancer tumor diagnosis. Studies have shown that women with increased mammographic density have a higher risk of developing Breast cancer. [4] Estrogen receptor alpha gene 1 encodes Estrogen Receptor-alpha, which is activated by estrogen. Polymorphisms in ESR1 are associated with breast cancer risk through differences in different level of breast density. DART successfully predicted an inverse correlation between ESR1 signaling and MMD. It can be used in simulated and real multidimensional cancer genomic data. It gives more reliable prediction about pathway activation, which would be helpful in association studies.

Related Research Articles

In statistics, propagation of uncertainty is the effect of variables' uncertainties on the uncertainty of a function based on them. When the variables are the values of experimental measurements they have uncertainties due to measurement limitations which propagate due to the combination of variables in the function.

<span class="mw-page-title-main">Osteoprotegerin</span> Mammalian protein found in Homo sapiens

Osteoprotegerin (OPG), also known as osteoclastogenesis inhibitory factor (OCIF) or tumour necrosis factor receptor superfamily member 11B (TNFRSF11B), is a cytokine receptor of the tumour necrosis factor (TNF) receptor superfamily encoded by the TNFRSF11B gene.

Feature selection is the process of selecting a subset of relevant features for use in model construction. Stylometry and DNA microarray analysis are two cases where feature selection is used. It should be distinguished from feature extraction.

<span class="mw-page-title-main">Aromatase</span> Enzyme involved in estrogen production

Aromatase, also called estrogen synthetase or estrogen synthase, is an enzyme responsible for a key step in the biosynthesis of estrogens. It is CYP19A1, a member of the cytochrome P450 superfamily, which are monooxygenases that catalyze many reactions involved in steroidogenesis. In particular, aromatase is responsible for the aromatization of androgens into estrogens. The enzyme aromatase can be found in many tissues including gonads, brain, adipose tissue, placenta, blood vessels, skin, and bone, as well as in tissue of endometriosis, uterine fibroids, breast cancer, and endometrial cancer. It is an important factor in sexual development.

<span class="mw-page-title-main">Estrogen receptor</span> Proteins activated by the hormone estrogen

Estrogen receptors (ERs) are a group of proteins found inside cells. They are receptors that are activated by the hormone estrogen (17β-estradiol). Two classes of ER exist: nuclear estrogen receptors, which are members of the nuclear receptor family of intracellular receptors, and membrane estrogen receptors (mERs), which are mostly G protein-coupled receptors. This article refers to the former (ER).

<span class="mw-page-title-main">Progesterone receptor</span> Cytoplasmic receptor protein found inside cells

The progesterone receptor (PR), also known as NR3C3 or nuclear receptor subfamily 3, group C, member 3, is a protein found inside cells. It is activated by the steroid hormone progesterone.

<span class="mw-page-title-main">HER2</span> Mammalian protein found in humans

Receptor tyrosine-protein kinase erbB-2 is a protein that normally resides in the membranes of cells and is encoded by the ERBB2 gene. ERBB is abbreviated from erythroblastic oncogene B, a gene originally isolated from the avian genome. The human protein is also frequently referred to as HER2 or CD340.

In statistics, a rank correlation is any of several statistics that measure an ordinal association—the relationship between rankings of different ordinal variables or different rankings of the same variable, where a "ranking" is the assignment of the ordering labels "first", "second", "third", etc. to different observations of a particular variable. A rank correlation coefficient measures the degree of similarity between two rankings, and can be used to assess the significance of the relation between them. For example, two common nonparametric methods of significance that use rank correlation are the Mann–Whitney U test and the Wilcoxon signed-rank test.

<span class="mw-page-title-main">Annexin A1</span> Protein-coding gene in the species Homo sapiens

Annexin A1, also known as lipocortin I, is a protein that is encoded by the ANXA1 gene in humans.

<span class="mw-page-title-main">Nuclear receptor coactivator 3</span> Protein found in humans

The nuclear receptor coactivator 3 also known as NCOA3 is a protein that, in humans, is encoded by the NCOA3 gene. NCOA3 is also frequently called 'amplified in breast 1' (AIB1), steroid receptor coactivator-3 (SRC-3), or thyroid hormone receptor activator molecule 1 (TRAM-1).

<span class="mw-page-title-main">STAT5A</span> Protein-coding gene in the species Homo sapiens

Signal transducer and activator of transcription 5A is a protein that in humans is encoded by the STAT5A gene. STAT5A orthologs have been identified in several placentals for which complete genome data are available.

<span class="mw-page-title-main">Secreted frizzled-related protein 1</span> Protein-coding gene in the species Homo sapiens

Secreted frizzled-related protein 1, also known as SFRP1, is a protein which in humans is encoded by the SFRP1 gene.

In statistics, an additive model (AM) is a nonparametric regression method. It was suggested by Jerome H. Friedman and Werner Stuetzle (1981) and is an essential part of the ACE algorithm. The AM uses a one-dimensional smoother to build a restricted class of nonparametric regression models. Because of this, it is less affected by the curse of dimensionality than a p-dimensional smoother. Furthermore, the AM is more flexible than a standard linear model, while being more interpretable than a general regression surface at the cost of approximation errors. Problems with AM, like many other machine-learning methods, include model selection, overfitting, and multicollinearity.

DirectHit is a pharmacodiagnostic test used to determine the tumor sensitivity or resistance to drug regimens recommended for the treatment of breast cancer by the National Comprehensive Cancer Network. It is a noninvasive test performed on small amounts of tissue removed during the original surgery lumpectomy, mastectomy, or core biopsy. DirectHit was developed by CCC Diagnostics Inc., a biotechnology company established by former researchers from Johns Hopkins University. DirectHit was launched on 14 January 2010. Currently, it is the only available test for predicting treatment outcomes for anticancer chemotherapy drugs for breast cancer.

Antineoplastic resistance, often used interchangeably with chemotherapy resistance, is the resistance of neoplastic (cancerous) cells, or the ability of cancer cells to survive and grow despite anti-cancer therapies. In some cases, cancers can evolve resistance to multiple drugs, called multiple drug resistance.

Weighted correlation network analysis, also known as weighted gene co-expression network analysis (WGCNA), is a widely used data mining method especially for studying biological networks based on pairwise correlations between variables. While it can be applied to most high-dimensional data sets, it has been most widely used in genomic applications. It allows one to define modules (clusters), intramodular hubs, and network nodes with regard to module membership, to study the relationships between co-expression modules, and to compare the network topology of different networks. WGCNA can be used as a data reduction technique, as a clustering method, as a feature selection method, as a framework for integrating complementary (genomic) data, and as a data exploratory technique. Although WGCNA incorporates traditional data exploratory techniques, its intuitive network language and analysis framework transcend any standard analysis technique. Since it uses network methodology and is well suited for integrating complementary genomic data sets, it can be interpreted as systems biologic or systems genetic data analysis method. By selecting intramodular hubs in consensus modules, WGCNA also gives rise to network based meta analysis techniques.

<span class="mw-page-title-main">GREB1</span> Protein-coding gene in the species Homo sapiens

Growth regulation by estrogen in breast cancer 1 is a protein that in humans is encoded by the GREB1 gene.

<span class="mw-page-title-main">LMTK3</span> Protein-coding gene in the species Homo sapiens

Lemur tail kinase 3 is a protein that in humans is encoded by the LMTK3 gene.

E-SCREEN is a cell proliferation assay based on the enhanced proliferation of human breast cancer cells (MCF-7) in the presence of estrogen active substances. The E-SCREEN test is a tool to easily and rapidly assess estrogenic activity of suspected xenoestrogens. This bioassay measures estrogen-induced increase of the number of human breast cancer cell, which is biologically equivalent to the increase of mitotic activity in tissues of the genital tract. It was originally developed by Soto et al. and was included in the first version of the OECD Conceptual Framework for Testing and Assessment of Endocrine Disrupters published in 2012. However, due to failed validation, it was not included in the updated version of the framework published in 2018.

Benita S. Katzenellenbogen née Schulman is an American physiologist and cell biologist at the University of Illinois at Urbana-Champaign. She has studied cancer, endocrinology, and women's health, focusing on nuclear receptors. She also dedicated efforts to focusing on improving the effectiveness of endocrine therapies in breast cancer.

References

  1. 1 2 3 4 5 Jiao, yan; Katherine Lawler (19 October 2011). "DART: Denoising Algorithm based on Relevance network Topology improves molecular pathway activity inference". BMC Bioinformatics. 12: 403. doi: 10.1186/1471-2105-12-403 . PMC   3228554 . PMID   22011170.
  2. Subramanian, Tamayo; Mukherjee, Ebert BL (Sep 30, 2005). "Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles". PNAS. 102 (43): 15545–50. doi: 10.1073/pnas.0506580102 . PMC   1239896 . PMID   16199517.
  3. 1 2 Tarca AL, Draghici; Khatri P; Hassan SS (2009). "A novel signaling pathway impact analysis". Bioinformatics. 25 (1): 75–82. doi:10.1093/bioinformatics/btn577. PMC   2732297 . PMID   18990722.
  4. Li J, Eriksson L; Humphreys K; Czene K (2010). "Genetic variation in the estrogen metabolic pathway and mammographic density as an intermediate phenotype of breast cancer". Breast Cancer Res. 12 (2): R19. doi: 10.1186/bcr2488 . PMC   2879563 . PMID   20214802.