PageRank algorithm in biochemistry

Last updated

The PageRank algorithm has several applications in biochemistry. ("PageRank" is an algorithm used in Google Search for ranking websites in their results, but it has been adopted for other purposes also. According to Google, PageRank works by "counting the number and quality of links to a page to determine a rough estimate of how important the website is," the underlying assumption being that more important websites are likely to receive more links from other websites. [1] )

Contents

Application in analyzing protein networks

The relative importance-measuring property of the PageRank link analysis algorithm could be used to identify new possible drug targets in proteins. [2] A PageRank-based algorithm could identify important protein targets in the pathogen organism better than a method considering only the number of incoming edges (in-degree) of a node in the metabolic network. The reason for this is that some already known, important protein targets do not have a high degree (are not hubs) and also, perturbing some hubs could result in unwanted physiological effects. [3]

Description

The clinical use of most antibiotics result in a mutation of the pathogen organism leading to their resistance against the drug. Therefore, development of new drugs is always needed. A potential first step in developing new drugs against currently threatening diseases (e.g. tuberculosis) is to find new drug targets in the causative agent of the disease, i.e. the pathogen microorganism, let it be either a bacterium, or a protozoan parasite. After finding the target protein in the bacterium (or protozoan parasite), one could design small molecular drug compounds that bind to the protein and inhibit it.

Public availability of biological network data [4] [5] [6] [7] makes the process of searching for new drug targets easier than it was before. By using the available metabolic networks, it is possible to find important nodes with link analysis algorithms, like PageRank. In a recently published paper, [8] biochemical reactions are treated as nodes of the metabolic network. In this directed network, reaction A has a directed edge towards reaction B if the product of the former enters the latter reaction as a substrate or co-factor.

To select important nodes that could serve as drug targets, we might think of selecting high in-degree nodes (hubs; nodes with many incoming edges). It was shown however[2], that targeting hub proteins with many vital functions may unintentionally harm the living cell as well. A PageRank-based scoring method could detect important nodes that are not hubs and therefore might be better drug targets.

The PageRank of a node A is the stationary limit probability distribution that the random walker is at node A. [2] In its original application, the personalization vector w captured the personal interest of a web-surfer: interesting websites to a surfer appeared with a higher probability in the distribution given in vector w. [8] In this metabolic network, w is personalized to proteins; w is larger for those proteins that appear in higher concentrations in the proteomics analysis of certain diseases. This personalized PageRank may identify other related proteins to the disease. [2] [8]

However, by using only the personalized PageRank to identify important nodes, hubs still get a high score on average. [9] To find non-hub important nodes instead, we should consider scoring the nodes by their "relativized personalized PageRank"; i.e. their personalized PageRank scores over the number of edges pointing towards them (over their in-degree):

The relativized personalized PageRank (rPPR(v)) for a node v is given by:

where PpageRank(v) is the personalized PageRank score of node v, and d_(v) is its in-degree. It was shown, that by using this method, numerous already validated drug targets can be found (e.g. in the Mycobacterium tuberculosis), therefore, new, currently unknown targets might be detected as well. [8]

Related Research Articles

<span class="mw-page-title-main">Computational biology</span> Branch of biology

Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.

<span class="mw-page-title-main">Gene regulatory network</span> Collection of molecular regulators

A generegulatory network (GRN) is a collection of molecular regulators that interact with each other and with other substances in the cell to govern the gene expression levels of mRNA and proteins which, in turn, determine the function of the cell. GRN also play a central role in morphogenesis, the creation of body structures, which in turn is central to evolutionary developmental biology (evo-devo).

<span class="mw-page-title-main">Network theory</span> Study of graphs as a representation of relations between discrete objects

In mathematics, computer science and network science, network theory is a part of graph theory. It defines networks as graphs where the vertices or edges possess attributes. Network theory analyses these networks over the symmetric relations or asymmetric relations between their (discrete) components.

<span class="mw-page-title-main">Interactome</span> Complete set of molecular interactions in a biological cell

In molecular biology, an interactome is the whole set of molecular interactions in a particular cell. The term specifically refers to physical interactions among molecules but can also describe sets of indirect interactions among genes.

Modelling biological systems is a significant task of systems biology and mathematical biology. Computational systems biology aims to develop and use efficient algorithms, data structures, visualization and communication tools with the goal of computer modelling of biological systems. It involves the use of computer simulations of biological systems, including cellular subsystems, to both analyze and visualize the complex connections of these cellular processes.

<span class="mw-page-title-main">Centrality</span> Degree of connectedness within a graph

In graph theory and network analysis, indicators of centrality assign numbers or rankings to nodes within a graph corresponding to their network position. Applications include identifying the most influential person(s) in a social network, key infrastructure nodes in the Internet or urban networks, super-spreaders of disease, and brain networks. Centrality concepts were first developed in social network analysis, and many of the terms used to measure centrality reflect their sociological origin.

<span class="mw-page-title-main">Personalized medicine</span> Medical model that tailors medical practices to the individual patient

Personalized medicine, also referred to as precision medicine, is a medical model that separates people into different groups—with medical decisions, practices, interventions and/or products being tailored to the individual patient based on their predicted response or risk of disease. The terms personalized medicine, precision medicine, stratified medicine and P4 medicine are used interchangeably to describe this concept, though some authors and organizations differentiate between these expressions based on particular nuances. P4 is short for "predictive, preventive, personalized and participatory".

<span class="mw-page-title-main">KEGG</span> Collection of bioinformatics databases

KEGG is a collection of databases dealing with genomes, biological pathways, diseases, drugs, and chemical substances. KEGG is utilized for bioinformatics research and education, including data analysis in genomics, metagenomics, metabolomics and other omics studies, modeling and simulation in systems biology, and translational research in drug development.

<span class="mw-page-title-main">Flux balance analysis</span> Method of modeling the metabolism of cells or microbes

In biochemistry, flux balance analysis (FBA) is a mathematical method for simulating the metabolism of cells or entire unicellular organisms, such as E. coli or yeast, using genome-scale reconstructions of metabolic networks. Genome-scale reconstructions describe all the biochemical reactions in an organism based on its entire genome. These reconstructions model metabolism by focusing on the interactions between metabolites, identifying which metabolites are involved in the various reactions taking place in a cell or organism, and determining the genes that encode the enzymes which catalyze these reactions. In comparison to traditional methods of modeling, FBA is less intensive in terms of the input data required for constructing the model. Simulations performed using FBA are computationally inexpensive and can calculate steady-state metabolic fluxes for large models in a few seconds on modern personal computers. The related method of metabolic pathway analysis seeks to find and list all possible pathways between metabolites.

<span class="mw-page-title-main">Enzyme inhibitor</span> Molecule that blocks enzyme activity

An enzyme inhibitor is a molecule that binds to an enzyme and blocks its activity. Enzymes are proteins that speed up chemical reactions necessary for life, in which substrate molecules are converted into products. An enzyme facilitates a specific chemical reaction by binding the substrate to its active site, a specialized area on the enzyme that accelerates the most difficult step of the reaction.

<span class="mw-page-title-main">Biological network inference</span> Type of inference

Biological network inference is the process of making inferences and predictions about biological networks. By using these networks to analyze patterns in biological systems, such as food-webs, we can visualize the nature and strength of these interactions between species, DNA, proteins, and more.

<span class="mw-page-title-main">Modularity (networks)</span> Measure of network community structure

Modularity is a measure of the structure of networks or graphs which measures the strength of division of a network into modules. Networks with high modularity have dense connections between the nodes within modules but sparse connections between nodes in different modules. Modularity is often used in optimization methods for detecting community structure in networks. Biological networks, including animal brains, exhibit a high degree of modularity. However, modularity maximization is not statistically consistent, and finds communities in its own null model, i.e. fully random graphs, and therefore it cannot be used to find statistically significant community structures in empirical networks. Furthermore, it has been shown that modularity suffers a resolution limit and, therefore, it is unable to detect small communities.

In computational biology, power graph analysis is a method for the analysis and representation of complex networks. Power graph analysis is the computation, analysis and visual representation of a power graph from a graph (networks).

<span class="mw-page-title-main">Biological network</span> Method of representing systems

A biological network is a method of representing systems as complex sets of binary interactions or relations between various biological entities. In general, networks or graphs are used to capture relationships between entities or objects. A typical graphing representation consists of a set of nodes connected by edges.

<span class="mw-page-title-main">PageRank</span> Algorithm used by Google Search to rank web pages

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Immunomics is the study of immune system regulation and response to pathogens using genome-wide approaches. With the rise of genomic and proteomic technologies, scientists have been able to visualize biological networks and infer interrelationships between genes and/or proteins; recently, these technologies have been used to help better understand how the immune system functions and how it is regulated. Two thirds of the genome is active in one or more immune cell types and less than 1% of genes are uniquely expressed in a given type of cell. Therefore, it is critical that the expression patterns of these immune cell types be deciphered in the context of a network, and not as an individual, so that their roles be correctly characterized and related to one another. Defects of the immune system such as autoimmune diseases, immunodeficiency, and malignancies can benefit from genomic insights on pathological processes. For example, analyzing the systematic variation of gene expression can relate these patterns with specific diseases and gene networks important for immune functions.

<span class="mw-page-title-main">Network controllability</span>

Network controllability concerns the structural controllability of a network. Controllability describes our ability to guide a dynamical system from any initial state to any desired final state in finite time, with a suitable choice of inputs. This definition agrees well with our intuitive notion of control. The controllability of general directed and weighted complex networks has recently been the subject of intense study by a number of groups in wide variety of networks, worldwide. Recent studies by Sharma et al. on multi-type biological networks identified control targets in phenotypically characterized Osteosarcoma showing important role of genes and proteins responsible for maintaining tumor microenvironment.

The host-pathogen interaction is defined as how microbes or viruses sustain themselves within host organisms on a molecular, cellular, organismal or population level. This term is most commonly used to refer to disease-causing microorganisms although they may not cause illness in all hosts. Because of this, the definition has been expanded to how known pathogens survive within their host, whether they cause disease or not.

<span class="mw-page-title-main">Single-cell variability</span>

In cell biology, single-cell variability occurs when individual cells in an otherwise similar population differ in shape, size, position in the cell cycle, or molecular-level characteristics. Such differences can be detected using modern single-cell analysis techniques. Investigation of variability within a population of cells contributes to understanding of developmental and pathological processes,

Network medicine is the application of network science towards identifying, preventing, and treating diseases. This field focuses on using network topology and network dynamics towards identifying diseases and developing medical drugs. Biological networks, such as protein-protein interactions and metabolic pathways, are utilized by network medicine. Disease networks, which map relationships between diseases and biological factors, also play an important role in the field. Epidemiology is extensively studied using network science as well; social networks and transportation networks are used to model the spreading of disease across populations. Network medicine is a medically focused area of systems biology.

References

  1. "Facts about Google and Competition". Archived from the original on 4 November 2011. Retrieved 12 July 2014.
  2. 1 2 3 Iván, Gábor; Grolmusz, Vince (2010-12-12). "When the Web meets the cell: using personalized PageRank for analyzing protein interaction networks". Bioinformatics. 27 (3): 405–407. doi:10.1093/bioinformatics/btq680. ISSN   1367-4811. PMID   21149343.
  3. Russell RB, Aloy P (2008). "Targeting and tinkering with interaction networks". Nat Chem Biol 4: 666–673.
  4. Ts, Prasad; K, Kandasamy; A, Pandey (2009). "Human Protein Reference Database and Human Proteinpedia as discovery tools for systems biology". Reverse Chemical Genetics. Methods in Molecular Biology (Clifton, N.J.). Vol. 577. pp. 67–79. doi:10.1007/978-1-60761-232-2_6. ISBN   978-1-60761-231-5. ISSN   1940-6029. PMID   19718509.
  5. "FEBS Lett 513: 135–140 - Search Results - PubMed". PubMed. Retrieved 2024-10-14.
  6. Xenarios, Ioannis; Salwínski, Łukasz; Duan, Xiaoqun Joyce; Higney, Patrick; Kim, Sul-Min; Eisenberg, David (2002-01-01). "DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions". Nucleic Acids Research. 30 (1): 303–305. doi:10.1093/nar/30.1.303. ISSN   0305-1048. PMC   99070 . PMID   11752321.
  7. Farkas IJ, Korcsmaros T, Kovacs IA, Mihalik A, Palotai R, et al. (2011). "Network-based tools for the identification of novel drug targets". Sci Signal 4: pt3.
  8. 1 2 3 4 Bánky, Dániel; Iván, Gábor; Grolmusz, Vince (2013-01-29). "Equal Opportunity for Low-Degree Network Nodes: A PageRank-Based Method for Protein Target Identification in Metabolic Graphs". PLOS ONE. 8 (1): e54204. Bibcode:2013PLoSO...854204B. doi: 10.1371/journal.pone.0054204 . ISSN   1932-6203. PMC   3558500 . PMID   23382878.
  9. Fortunato, Santo; Boguñá, Marián; Flammini, Alessandro; Menczer, Filippo (2008). "Approximating PageRank from In-Degree". In Aiello, William; Broder, Andrei; Janssen, Jeannette; Milios, Evangelos (eds.). Algorithms and Models for the Web-Graph. Lecture Notes in Computer Science. Vol. 4936. Berlin, Heidelberg: Springer. pp. 59–71. doi:10.1007/978-3-540-78808-9_6. ISBN   978-3-540-78808-9.