In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications. [1]
Consider a network , where represents the entity nodes in the network and x represents the set of "true" links across entities in the network. We are given the set of entities and a subset of true links which are referred to as observed links. The goal of link prediction is to identify the unobserved true links. In the temporal formulation of link prediction the observed links correspond to true links at a time , and the goal is to infer the set of true links at time Usually, we are also given a subset of unobserved links called potential links, and we need to identify true links among these potential links.
In the binary classification formulation of the link prediction task the potential links are classified as either true links or false links. Link prediction approaches for this setting learn a classifier that maps links in to positive and negative labels i.e. . In the probability estimation formulation, potential links are associated with existence probabilities. Link prediction approaches for this setting learn a model that maps links in to a probability i.e. .
Single link approaches learn a model that classifies each link independently. Structured prediction approaches capture the correlation between potential links by formulating the task as a collective link prediction task. Collective link prediction approaches learn a model that jointly identify all the true links among the set of potential links.
Link prediction task can also be formulated as an instance of missing value estimation task. Here, the graph is represented as an adjacency matrix with missing values. The task is to complete the matrix by identifying the missing values. Matrix factorization based methods commonly use this formulation.
The task of link prediction has attracted attention from several research communities ranging from statistics and network science to machine learning and data mining. In statistics, generative random graph models such as stochastic block models propose an approach to generate links between nodes in a random graph. For social networks, Liben-Nowell and Kleinberg proposed a link prediction models based on different graph proximity measures. [2] Several statistical models have been proposed for link prediction by the machine learning and data mining community. For example, Popescul et al. proposed a structured logistic regression model that can make use of relational features. [3] Local conditional probability models based on attribute and structural features were proposed by O’Madadhain et al. [4] Several models based on directed graphical models for collective link prediction have been proposed by Getoor. [5] Other approached based on random walks. [6] and matrix factorization have also been proposed [7] With the advent of deep learning, several graph embedding based approaches for link prediction have also been proposed. [8] For more information on link prediction refer to the survey by Getoor et al. [9] and Yu et al. [10]
Several link predication approaches have been proposed including unsupervised approaches such as similarity measures computed on the entity attributes, random walk and matrix factorization based approaches, and supervised approaches based on graphical models and deep learning. Link prediction approaches can be divided into two broad categories based on the type of the underlying network: (1) link prediction approaches for homogeneous networks (2) link prediction approaches for heterogeneous networks. Based on the type of information used to predict links, approaches can be categorized as topology-based approaches, content-based approaches, and mixed methods. [11]
Topology-based methods broadly make the assumption that nodes with similar network structure are more likely to form a link.
This is a common approach to link prediction that computes the number of common neighbors. Entities with more neighbors in common are more likely to have a link. It is computed as follows:
A weakness of this approach is that it does not take into account the relative number of common neighbors.
The Jaccard Measure addresses the problem of Common Neighbors by computing the relative number of neighbors in common:
The Adamic–Adar measure [12] is the sum of the log of the intersection of the neighbors of two nodes. This captures a two-hop similarity, which can yield better results than simple one-hop methods. It is computed as follows:
where is the set of nodes adjacent to .
Neighbor based methods can be effective when the number of neighbors is large, but this is not the case in sparse graphs. In these situations it is appropriate to use methods that account for longer walks. The Katz Measure [13] is one metric that captures this. It is computed by searching the graph for paths of length in the graph and adding the counts of each path length weighted by user specified weights.
Let A be the adjacency matrix of a network under consideration. Elements of A are variables that take a value 1 if a node i is connected to node j and 0 otherwise. The powers of A indicate the presence (or absence) of links between two nodes through intermediaries. For instance, in matrix , if element , it indicates that node 2 and node 12 are connected through some walk of length 3. If denotes Katz centrality of a node i, then mathematically:
Note that the above definition uses the fact that the element at location of reflects the total number of degree connections between nodes and .
Node-similarity methods predict the existence of a link based on the similarity of the node attributes.
The attribute values are represented as normalized vector and the distance between the vectors used to measure similarity. Small distances indicate higher similarity.
After normalizing the attribute values, computing the cosine between the two vectors is a good measure of similarity, with higher values indicating higher similarity.
Mixed methods combine attribute and topology based methods.
Graph embeddings also offer a convenient way to predict links. [8] Graph embedding algorithms, such as Node2vec, learn an embedding space in which neighboring nodes are represented by vectors so that vector similarity measures, such as dot product similarity, or euclidean distance, hold in the embedding space. These similarities are functions of both topological features and attribute-based similarity. One can then use other machine learning techniques to predict edges on the basis of vector similarity.
A probabilistic relational model (PRM) specifies a template for a probability distribution over a databases. The template describes the relational schema for the domain, and the probabilistic dependencies between attributes in the domain. A PRM, together with a particular database of entities and unobserved links, defines a probability distribution over the unobserved links. [5]
Probabilistic soft logic (PSL) is a probabilistic graphical model over hinge-loss Markov random field (HL-MRF). HL-MRFs are created by a set of templated first-order logic-like rules, which are then grounded over the data. PSL can combine attribute, or local, information with topological, or relational, information. While PSL can incorporate local predictors, such as cosine similarity, it also supports relational rules, such as triangle completion in a network. [14]
Markov logic networks (MLNs) is a probabilistic graphical model defined over Markov networks. These networks are defined by templated first-order logic-like rules, which is then grounded over the training data. MLNs are able to incorporate both local and relational rules for the purpose of link prediction. [15]
R-Models (RMLs) is a neural network model created to provide a deep learning approach to the link weight prediction problem. This model uses a node embedding technique that extracts node embeddings (knowledge of nodes) from the known links’ weights (relations between nodes) and uses this knowledge to predict the unknown links’ weights. [16]
Link prediction has found varied uses, but any domain in which entities interact in a structures way can benefit from link prediction. [17] A common applications of link prediction is improving similarity measures for collaborative filtering approaches to recommendation. Link prediction is also frequently used in social networks to suggest friends to users. It has also been used to predict criminal associations.
In biology, link prediction has been used to predict interactions between proteins in protein-protein interaction networks. [18] Link prediction has also been used to infer interactions between drugs and targets using link prediction [19] Another application is found in collaboration prediction in scientific co-authorship networks.
Entity resolution, also known as deduplication, commonly uses link prediction to predict whether two entities in a network are references to the same physical entity. Some authors have used context information in network structured domains to improve entity resolution. [20]
A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.
A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.
Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear decomposition methods, onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.
In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are N elements, this matrix will have size N×N. In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature. The term semantic similarity is often confused with semantic relatedness. Semantic relatedness includes any relation between two terms, while semantic similarity only includes "is a" relations. For example, "car" is similar to "bus", but is also related to "road" and "driving".
Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.
A method for pruning dense networks to highlight key links
In the study of complex networks, a network is said to have community structure if the nodes of the network can be easily grouped into sets of nodes such that each set of nodes is densely connected internally. In the particular case of non-overlapping community finding, this implies that the network divides naturally into groups of nodes with dense connections internally and sparser connections between groups. But overlapping communities are also allowed. The more general definition is based on the principle that pairs of nodes are more likely to be connected if they are both members of the same community(ies), and less likely to be connected if they do not share communities. A related but different problem is community search, where the goal is to find a community that a certain vertex belongs to.
In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.
Network science is an academic field which studies complex networks such as telecommunication networks, computer networks, biological networks, cognitive and semantic networks, and social networks, considering distinct elements or actors represented by nodes and the connections between the elements or actors as links. The field draws on theories and methods including graph theory from mathematics, statistical mechanics from physics, data mining and information visualization from computer science, inferential modeling from statistics, and social structure from sociology. The United States National Research Council defines network science as "the study of network representations of physical, biological, and social phenomena leading to predictive models of these phenomena."
SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model. SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, SimRank is a measure that says "two objects are considered to be similar if they are referenced by similar objects." Although SimRank is widely adopted, it may output unreasonable similarity scores which are influenced by different factors, and can be solved in several ways, such as introducing an evidence weight factor, inserting additional terms that are neglected by SimRank or using PageRank-based alternatives.
Statistical relational learning (SRL) is a subdiscipline of artificial intelligence and machine learning that is concerned with domain models that exhibit both uncertainty and complex, relational structure. Typically, the knowledge representation formalisms developed in SRL use first-order logic to describe relational properties of a domain in a general manner and draw upon probabilistic graphical models to model the uncertainty; some also build upon the methods of inductive logic programming. Significant contributions to the field have been made since the late 1990s.
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.
Probabilistic Soft Logic (PSL) is a statistical relational learning (SRL) framework for modeling probabilistic and relational domains. It is applicable to a variety of machine learning problems, such as collective classification, entity resolution, link prediction, and ontology alignment. PSL combines two tools: first-order logic, with its ability to succinctly represent complex phenomena, and probabilistic graphical models, which capture the uncertainty and incompleteness inherent in real-world knowledge. More specifically, PSL uses "soft" logic as its logical component and Markov random fields as its statistical model. PSL provides sophisticated inference techniques for finding the most likely answer (i.e. the maximum a posteriori (MAP) state). The "softening" of the logical formulas makes inference a polynomial time operation rather than an NP-hard operation.
In network science, a sparse network has much fewer links than the possible maximum number of links within that network. The study of sparse networks is a relatively new area primarily stimulated by the study of real networks, such as social and computer networks.
The following outline is provided as an overview of and topical guide to machine learning:
Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. This family of methods became widely known during the Netflix prize challenge due to its effectiveness as reported by Simon Funk in his 2006 blog post, where he shared his findings with the research community. The prediction results can be improved by assigning different regularization weights to the latent factors based on items' popularity and users' activeness.
In network theory, collective classification is the simultaneous prediction of the labels for multiple objects, where each label is predicted using information about the object's observed features, the observed features and labels of its neighbors, and the unobserved labels of its neighbors. Collective classification problems are defined in terms of networks of random variables, where the network structure determines the relationship between the random variables. Inference is performed on multiple random variables simultaneously, typically by propagating information between nodes in the network to perform approximate inference. Approaches that use collective classification can make use of relational information when performing inference. Examples of collective classification include predicting attributes of individuals in a social network, classifying webpages in the World Wide Web, and inferring the research area of a paper in a scientific publication dataset.
In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.
A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.