Biological network inference is the process of making inferences and predictions about biological networks. [1] By using these networks to analyze patterns in biological systems, such as food-webs, we can visualize the nature and strength of these interactions between species, DNA, proteins, and more.
The analysis of biological networks with respect to diseases has led to the development of the field of network medicine. [2] Recent examples of application of network theory in biology include applications to understanding the cell cycle [3] as well as a quantitative framework for developmental processes. [4] Good network inference requires proper planning and execution of an experiment, thereby ensuring quality data acquisition. Optimal experimental design in principle refers to the use of statistical and or mathematical concepts to plan for data acquisition. This must be done in such a way that the data information content is enriched, and a sufficient amount of data is collected with enough technical and biological replicates where necessary.[ citation needed ]
The general cycle to modeling biological networks is as follows:[ citation needed ]
A network is a set of nodes and a set of directed or undirected edges between the nodes. Many types of biological networks exist, including transcriptional, signalling and metabolic. Few such networks are known in anything approaching their complete structure, even in the simplest bacteria. Still less is known on the parameters governing the behavior of such networks over time, how the networks at different levels in a cell interact, and how to predict the complete state description of a eukaryotic cell or bacterial organism at a given point in the future. Systems biology, in this sense, is still in its infancy [ citation needed ].
There is great interest in network medicine for the modelling biological systems. This article focuses on inference of biological network structure using the growing sets of high-throughput expression data for genes, proteins, and metabolites. [10] Briefly, methods using high-throughput data for inference of regulatory networks rely on searching for patterns of partial correlation or conditional probabilities that indicate causal influence. [7] [11] Such patterns of partial correlations found in the high-throughput data, possibly combined with other supplemental data on the genes or proteins in the proposed networks, or combined with other information on the organism, form the basis upon which such algorithms work. Such algorithms can be of use in inferring the topology of any network where the change in state of one node can affect the state of other nodes.
Genes are the nodes and the edges are directed. A gene serves as the source of a direct regulatory edge to a target gene by producing an RNA or protein molecule that functions as a transcriptional activator or inhibitor of the target gene. If the gene is an activator, then it is the source of a positive regulatory connection; if an inhibitor, then it is the source of a negative regulatory connection. Computational algorithms take as primary input data measurements of mRNA expression levels of the genes under consideration for inclusion in the network, returning an estimate of the network topology. Such algorithms are typically based on linearity, independence or normality assumptions, which must be verified on a case-by-case basis. [12] Clustering or some form of statistical classification is typically employed to perform an initial organization of the high-throughput mRNA expression values derived from microarray experiments, in particular to select sets of genes as candidates for network nodes. [13] The question then arises: how can the clustering or classification results be connected to the underlying biology? Such results can be useful for pattern classification – for example, to classify subtypes of cancer, or to predict differential responses to a drug (pharmacogenomics). But to understand the relationships between the genes, that is, to more precisely define the influence of each gene on the others, the scientist typically attempts to reconstruct the transcriptional regulatory network.
A gene co-expression network is an undirected graph, where each node corresponds to a gene, and a pair of nodes is connected with an edge if there is a significant co-expression relationship between them.
Signal transduction networks use proteins for the nodes and directed edges to represent interaction in which the biochemical conformation of the child is modified by the action of the parent (e.g. mediated by phosphorylation, ubiquitylation, methylation, etc.). Primary input into the inference algorithm would be data from a set of experiments measuring protein activation / inactivation (e.g., phosphorylation / dephosphorylation) across a set of proteins. Inference for such signalling networks is complicated by the fact that total concentrations of signalling proteins will fluctuate over time due to transcriptional and translational regulation. Such variation can lead to statistical confounding. Accordingly, more sophisticated statistical techniques must be applied to analyse such datasets. [14] (very important in the biology of cancer)
Metabolite networks use nodes to represent chemical reactions and directed edges for the metabolic pathways and regulatory interactions that guide these reactions. Primary input into an algorithm would be data from a set of experiments measuring metabolite levels.
One of the most intensely studied networks in biology, Protein-protein interaction networks (PINs) visualize the physical relationships between proteins inside a cell. in a PIN, proteins are the nodes and their interactions are the undirected edges. PINs can be discovered with a variety of methods including; Two-hybrid Screening, in vitro: co-immunoprecipitation, [15] blue native gel electrophoresis, [16] and more. [17]
A neuronal network is composed to represent neurons with each node and synapses for the edges, which are typically weighted and directed. the weights of edges are usually adjusted by the activation of connected nodes. The network is usually organized into input layers, hidden layers, and output layers.
A food web is an interconnected directional graph of what eats what in an ecosystem. The members of the ecosystem are the nodes and if a member eats another member then there is a directed edge between those 2 nodes.
These networks are defined by a set of pairwise interactions between and within a species that is used to understand the structure and function of larger ecological networks. [18] By using network analysis we can discover and understand how these interactions link together within the system's network. It also allows us to quantify associations between individuals, which makes it possible to infer details about the network as a whole at the species and/or population level. [19]
DNA-DNA chromatin networks are used to clarify the activation or suppression of genes via the relative location of strands of chromatin. These interactions can be understood by analyzing commonalities amongst different loci, a fixed position on a chromosome where a particular gene or genetic marker is located. Network analysis can provide vital support in understanding relationships among different areas of the genome.
A gene regulatory network [20] is a set of molecular regulators that interact with each other and with other substances in the cell. The regulator can be DNA, RNA, protein and complexes of these. Gene regulatory networks can be modeled in numerous ways including; Coupled ordinary differential equations, Boolean networks, Continuous networks, and Stochastic gene networks.
This section is empty. You can help by adding to it. (September 2022) |
The initial data used to make the inference can have a huge impact on the accuracy of the final inference. Network data is inherently noisy and incomplete sometimes due to evidence from multiple sources that don't overlap or contradictory data. Data can be sourced in multiple ways to include manual curation of scientific literature put into databases, High-throughput datasets, computational predictions, and text mining of old scholarly articles from before the digital era.
A network's diameter is the maximum number of steps separating any two nodes and can be used to determine the How connected a graph is, in topology analysis, and clustering analysis.
The transitivity or clustering coefficient of a network is a measure of the tendency of the nodes to cluster together. High transitivity means that the network contains communities or groups of nodes that are densely connected internally. In biological networks, finding these communities is very important, because they can reflect functional modules and protein complexes [21] The uncertainty about the connectivity may distort the results and should be taken into account when the transitivity and other topological descriptors are computed for inferred networks. [9]
Network confidence is a way to measure how sure one can be that the network represents a real biological interaction. We can do this via contextual biological information, counting the number of times an interaction is reported in the literature, or group different strategies into a single score. the MIscore method for assessing the reliability of protein-protein interaction data is based on the use of standards. [22] MIscore gives an estimation of confidence weighting on all available evidence for an interacting pair of proteins. The method allows weighting of evidence provided by different sources, provided the data is represented following the standards created by the IMEx consortium. The weights are number of publications, detection method, interaction evidence type.
Closeness, a.k.a. closeness centrality, is a measure of centrality in a network and is calculated as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph. This measure can be used to make inferences in all graph types and analysis methods.
Betweeness, a.k.a. betweenness centrality, is a measure of centrality in a graph based on shortest paths. The betweenness for each node is the number of these shortest paths that pass through the node.
For our purposes, network analysis is closely related to graph theory. By measuring the attributes in the previous section we can utilize many different techniques to create accurate inferences based on biological data.
Topology Analysis analyzes the topology of a network to identify relevant participates and substructures that may be of biological significance. The term encompasses an entire class of techniques such as network motif search, centrality analysis, topological clustering, and shortest paths. These are but a few examples, each of these techniques use the general idea of focusing on the topology of a network to make inferences.
A motif is defined as a frequent and unique sub-graph. By counting all the possible instances, listing all patterns, and testing isomorphisms we can derive crucial information about a network. They're suggested to be the basic building blocks complex biological networks. The computational research has focused on improving existing motif detection tools to assist the biological investigations and allow larger networks to be analyzed. Several different algorithms have been provided so far, which are elaborated in the next section.
Centrality gives an estimation on how important a node or edge is for the connectivity or the information flow of the network. It is a useful parameter in signalling networks and it is often used when trying to find drug targets. [23] It is most commonly used in PINs to determine important proteins and their functions. Centrality can be measured in different ways depending on the graph and the question that needs answering, they include the degree of nodes or the number of connected edges to a node, global centrality measures, or via random walks which is used by the Google PageRank algorithm to assign weight to each webpage. [24] The centrality measures may be affected by errors due to noise on measurement and other causes. [25] Therefore, the topological descriptors should be defined as random variable with the associated probability distribution encoding the uncertainty on their value. [9]
Topological Clustering or Topological Data Analysis (TDA) provides a general framework to analyze high dimensional, incomplete, and noisy data in a way that reduces dimensional and gives a robustness to noise. The idea that is that the shape of data sets contains relevant information. When this information is a homology group there is a mathematical interpretation that assumes that features that persist for a wide range of parameters are "true" features and features persisting for only a narrow range of parameters are noise, although the theoretical justification for this is unclear. [26] This technique has been used for progression analysis of disease, [27] [28] viral evolution, [29] propagation of contagions on networks, [30] bacteria classification using molecular spectroscopy, [31] and much more in and outside of biology.
The shortest path problem is a common problem in graph theory that tries to find the path between two vertices (or nodes) in a graph such that the sum of the weights of its constituent edges is minimized. This method can be used to determine the network diameter or redundancy in a network. there are many algorithms for this including Dijkstra's algorithm, Bellman–Ford algorithm, and the Floyd–Warshall algorithm just to name a few.
Cluster analysis groups objects (nodes) such that objects in the same cluster are more similar to each other than to those in other clusters. This can be used to perform pattern recognition, image analysis, information retrieval, statistical data analysis, and so much more. It has applications in Plant and animal ecology, Sequence analysis, antimicrobial activity analysis, and many other fields. Cluster analysis algorithms come in many forms as well such as Hierarchical clustering, k-means clustering, Distribution-based clustering, Density-based clustering, and Grid-based clustering.
Gene annotation databases are commonly used to evaluate the functional properties of experimentally derived gene sets. Annotation Enrichment Analysis (AEA) is used to overcome biases from overlap statistical methods used to assess these associations. [32] It does this by using gene/protein annotations to infer which annotations are over-represented in a list of genes/proteins taken from a network.
Network | Analysis Tools |
---|---|
Transcriptional regulatory networks | FANMOD, [33] ChIP-on-chip, [34] position–weight |
Gene Co-Expression Networks | FANMOD, Paired Design, [35] WGCNA [36] |
Signal transduction | FANMOD, PathLinker [37] |
Metabolic Network | FANMOD, Pathway Tools, Ergo, KEGGtranslator, ModelSEED |
Protein-Protein Interaction Networks | FANMOD, NETBOX, [38] Text Mining, [39] STRING |
Neuronal Network | FANMOD, Neural Designer, Neuroph, Darknet |
Food Webs | FANMOD, RCN, R |
Within Species and Between Species Interaction Networks | FANMOD, NETBOX |
DNA-DNA Chromatin Networks | FANMOD, |
Gene Regulatory Networks | FANMOD, |
Bioinformatics is an interdisciplinary field of science that develops methods and software tools for understanding biological data, especially when the data sets are large and complex. Bioinformatics uses biology, chemistry, physics, computer science, computer programming, information engineering, mathematics and statistics to analyze and interpret biological data. The subsequent process of analyzing and interpreting data is referred to as computational biology.
Computational biology refers to the use of data analysis, mathematical modeling and computational simulations to understand biological systems and relationships. An intersection of computer science, biology, and big data, the field also has foundations in applied mathematics, chemistry, and genetics. It differs from biological computing, a subfield of computer science and engineering which uses bioengineering to build computers.
A generegulatory network (GRN) is a collection of molecular regulators that interact with each other and with other substances in the cell to govern the gene expression levels of mRNA and proteins which, in turn, determine the function of the cell. GRN also play a central role in morphogenesis, the creation of body structures, which in turn is central to evolutionary developmental biology (evo-devo).
In bioinformatics, sequence clustering algorithms attempt to group biological sequences that are somehow related. The sequences can be either of genomic, "transcriptomic" (ESTs) or protein origin. For proteins, homologous sequences are typically grouped into families. For EST data, clustering is important to group sequences originating from the same gene before the ESTs are assembled to reconstruct the original mRNA.
In molecular biology, an interactome is the whole set of molecular interactions in a particular cell. The term specifically refers to physical interactions among molecules but can also describe sets of indirect interactions among genes.
Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.
Computational phylogenetics, phylogeny inference, or phylogenetic inference focuses on computational and optimization algorithms, heuristics, and approaches involved in phylogenetic analyses. The goal is to find a phylogenetic tree representing optimal evolutionary ancestry between a set of genes, species, or taxa. Maximum likelihood, parsimony, Bayesian, and minimum evolution are typical optimality criteria used to assess how well a phylogenetic tree topology describes the sequence data. Nearest Neighbour Interchange (NNI), Subtree Prune and Regraft (SPR), and Tree Bisection and Reconnection (TBR), known as tree rearrangements, are deterministic algorithms to search for optimal or the best phylogenetic tree. The space and the landscape of searching for the optimal phylogenetic tree is known as phylogeny search space.
In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.
Network motifs are recurrent and statistically significant subgraphs or patterns of a larger graph. All networks, including biological networks, social networks, technological networks and more, can be represented as graphs, which include a wide variety of subgraphs.
In computational biology, power graph analysis is a method for the analysis and representation of complex networks. Power graph analysis is the computation, analysis and visual representation of a power graph from a graph (networks).
A biological network is a method of representing systems as complex sets of binary interactions or relations between various biological entities. In general, networks or graphs are used to capture relationships between entities or objects. A typical graphing representation consists of a set of nodes connected by edges.
Biological data visualization is a branch of bioinformatics concerned with the application of computer graphics, scientific visualization, and information visualization to different areas of the life sciences. This includes visualization of sequences, genomes, alignments, phylogenies, macromolecular structures, systems biology, microscopy, and magnetic resonance imaging data. Software tools used for visualizing biological data range from simple, standalone programs to complex, integrated systems.
Graphlets in mathematics are induced subgraph isomorphism classes in a graph, i.e. two graphlet occurrences are isomorphic, whereas two graphlets are non-isomorphic. Graphlets differ from network motifs in a statistical sense, network motifs are defined as over- or under-represented graphlets with respect to some random graph null model.
Ron Shamir is an Israeli professor of computer science known for his work in graph theory and in computational biology. He holds the Raymond and Beverly Sackler Chair in Bioinformatics, and is the founder and former head of the Edmond J. Safra Center for Bioinformatics at Tel Aviv University.
A gene co-expression network (GCN) is an undirected graph, where each node corresponds to a gene, and a pair of nodes is connected with an edge if there is a significant co-expression relationship between them. Having gene expression profiles of a number of genes for several samples or experimental conditions, a gene co-expression network can be constructed by looking for pairs of genes which show a similar expression pattern across samples, since the transcript levels of two co-expressed genes rise and fall together across samples. Gene co-expression networks are of biological interest since co-expressed genes are controlled by the same transcriptional regulatory program, functionally related, or members of the same pathway or protein complex.
Network medicine is the application of network science towards identifying, preventing, and treating diseases. This field focuses on using network topology and network dynamics towards identifying diseases and developing medical drugs. Biological networks, such as protein-protein interactions and metabolic pathways, are utilized by network medicine. Disease networks, which map relationships between diseases and biological factors, also play an important role in the field. Epidemiology is extensively studied using network science as well; social networks and transportation networks are used to model the spreading of disease across populations. Network medicine is a medically focused area of systems biology.
Pathway is the term from molecular biology for a curated schematic representation of a well characterized segment of the molecular physiological machinery, such as a metabolic pathway describing an enzymatic process within a cell or tissue or a signaling pathway model representing a regulatory process that might, in its turn, enable a metabolic or another regulatory process downstream. A typical pathway model starts with an extracellular signaling molecule that activates a specific receptor, thus triggering a chain of molecular interactions. A pathway is most often represented as a relatively small graph with gene, protein, and/or small molecule nodes connected by edges of known functional relations. While a simpler pathway might appear as a chain, complex pathway topologies with loops and alternative routes are much more common. Computational analyses employ special formats of pathway representation. In the simplest form, however, a pathway might be represented as a list of member molecules with order and relations unspecified. Such a representation, generally called Functional Gene Set (FGS), can also refer to other functionally characterised groups such as protein families, Gene Ontology (GO) and Disease Ontology (DO) terms etc. In bioinformatics, methods of pathway analysis might be used to identify key genes/ proteins within a previously known pathway in relation to a particular experiment / pathological condition or building a pathway de novo from proteins that have been identified as key affected elements. By examining changes in e.g. gene expression in a pathway, its biological activity can be explored. However most frequently, pathway analysis refers to a method of initial characterization and interpretation of an experimental condition that was studied with omics tools or genome-wide association study. Such studies might identify long lists of altered genes. A visual inspection is then challenging and the information is hard to summarize, since the altered genes map to a broad range of pathways, processes, and molecular functions. In such situations, the most productive way of exploring the list is to identify enrichment of specific FGSs in it. The general approach of enrichment analyses is to identify FGSs, members of which were most frequently or most strongly altered in the given condition, in comparison to a gene set sampled by chance. In other words, enrichment can map canonical prior knowledge structured in the form of FGSs to the condition represented by altered genes.
Machine learning in bioinformatics is the application of machine learning algorithms to bioinformatics, including genomics, proteomics, microarrays, systems biology, evolution, and text mining.
Trajectory inference or pseudotemporal ordering is a computational technique used in single-cell transcriptomics to determine the pattern of a dynamic process experienced by cells and then arrange cells based on their progression through the process. Single-cell protocols have much higher levels of noise than bulk RNA-seq, so a common step in a single-cell transcriptomics workflow is the clustering of cells into subgroups. Clustering can contend with this inherent variation by combining the signal from many cells, while allowing for the identification of cell types. However, some differences in gene expression between cells are the result of dynamic processes such as the cell cycle, cell differentiation, or response to an external stimuli. Trajectory inference seeks to characterize such differences by placing cells along a continuous path that represents the evolution of the process rather than dividing cells into discrete clusters. In some methods this is done by projecting cells onto an axis called pseudotime which represents the progression through the process.
In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.