In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, [1] is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. [1] [2] [3] Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction. [1] [4]
A knowledge graph is a collection of entities , relations , and facts . [5] A fact is a triple that denotes a link between the head and the tail of the triple. Another notation that is often used in the literature to represent a triple (or fact) is . This notation is called resource description framework (RDF). [1] [5] A knowledge graph represents the knowledge related to a specific domain; leveraging this structured representation, it is possible to infer a piece of new knowledge from it after some refinement steps. [6] However, nowadays, people have to deal with the sparsity of data and the computational inefficiency to use them in a real-world application. [3] [7]
The embedding of a knowledge graph translates each entity and relation of a knowledge graph, into a vector of a given dimension , called embedding dimension. [7] In the general case, we can have different embedding dimensions for the entities and the relations . [7] The collection of embedding vectors for all the entities and relations in the knowledge graph can then be used for downstream tasks.
A knowledge graph embedding is characterized by four different aspects: [1]
All the different knowledge graph embedding models follow roughly the same procedure to learn the semantic meaning of the facts. [7] First of all, to learn an embedded representation of a knowledge graph, the embedding vectors of the entities and relations are initialized to random values. [7] Then, starting from a training set until a stop condition is reached, the algorithm continuously optimizes the embeddings. [7] Usually, the stop condition is given by the overfitting over the training set. [7] For each iteration, is sampled a batch of size from the training set, and for each triple of the batch is sampled a random corrupted fact—i.e., a triple that does not represent a true fact in the knowledge graph. [7] The corruption of a triple involves substituting the head or the tail (or both) of the triple with another entity that makes the fact false. [7] The original triple and the corrupted triple are added in the training batch, and then the embeddings are updated, optimizing a scoring function. [5] [7] At the end of the algorithm, the learned embeddings should have extracted the semantic meaning from the triples and should correctly predict unseen true facts in the knowledge graph. [5]
The following is the pseudocode for the general embedding procedure. [9] [7]
algorithm Compute entity and relation embeddings isinput: The training set , entity set , relation set , embedding dimension output: Entity and relation embeddings initialization:the entitiesand relationsembeddings (vectors) are randomly initializedwhile stop condition do // From the training set randomly sample a batch of size b for eachindo // sample a corrupted fact of triple end for Update embeddings by minimizing the loss function end while
These indexes are often used to measure the embedding quality of a model. The simplicity of the indexes makes them very suitable for evaluating the performance of an embedding algorithm even on a large scale. [10] Given as the set of all ranked predictions of a model, it is possible to define three different performance indexes: Hits@K, MR, and MRR. [10]
Hits@K or in short, H@K, is a performance index that measures the probability to find the correct prediction in the first top K model predictions. [10] Usually, it is used . [10] Hits@K reflects the accuracy of an embedding model to predict the relation between two given triples correctly. [10]
Hits@K
Larger values mean better predictive performances. [10]
Mean rank is the average ranking position of the items predicted by the model among all the possible items. [10]
The smaller the value, the better the model. [10]
Mean reciprocal rank measures the number of triples predicted correctly. [10] If the first predicted triple is correct, then 1 is added, if the second is correct is summed, and so on. [10]
Mean reciprocal rank is generally used to quantify the effect of search algorithms. [10]
The larger the index, the better the model. [10]
Knowledge graph completion (KGC) is a collection of techniques to infer knowledge from an embedded knowledge graph representation. [11] In particular, this technique completes a triple inferring the missing entity or relation. [11] The corresponding sub-tasks are named link or entity prediction (i.e., guessing an entity from the embedding given the other entity of the triple and the relation), and relation prediction (i.e., forecasting the most plausible relation that connects two entities). [11]
Triple Classification is a binary classification problem. [1] Given a triple, the trained model evaluates the plausibility of the triple using the embedding to determine if a triple is true or false. [11] The decision is made with the model score function and a given threshold. [11] Clustering is another application that leverages the embedded representation of a sparse knowledge graph to condense the representation of similar semantic entities close in a 2D space. [4]
The use of knowledge graph embedding is increasingly pervasive in many applications. In the case of recommender systems, the use of knowledge graph embedding can overcome the limitations of the usual reinforcement learning. [12] [13] Training this kind of recommender system requires a huge amount of information from the users; however, knowledge graph techniques can address this issue by using a graph already constructed over a prior knowledge of the item correlation and using the embedding to infer from it the recommendation. [12] Drug repurposing is the use of an already approved drug, but for a therapeutic purpose different from the one for which it was initially designed. [14] It is possible to use the task of link prediction to infer a new connection between an already existing drug and a disease by using a biomedical knowledge graph built leveraging the availability of massive literature and biomedical databases. [14] Knowledge graph embedding can also be used in the domain of social politics. [4]
Given a collection of triples (or facts) , the knowledge graph embedding model produces, for each entity and relation present in the knowledge graph a continuous vector representation. [7] is the corresponding embedding of a triple with and , where is the embedding dimension for the entities, and for the relations. [7] The score function of a given model is denoted by and measures the distance of the embedding of the head from the embedding of tail given the embedding of the relation, or in other words, it quantifies the plausibility of the embedded representation of a given fact. [5]
Rossi et al. propose a taxonomy of the embedding models and identifies three main families of models: tensor decomposition models, geometric models, and deep learning models. [5]
The tensor decomposition is a family of knowledge graph embedding models that use a multi-dimensional matrix to represent a knowledge graph, [1] [5] [17] that is partially knowable due to the gaps of the knowledge graph describing a particular domain thoroughly. [5] In particular, these models use a three-way (3D) tensor, which is then factorized into low-dimensional vectors that are the entities and relations embeddings. [5] [17] The third-order tensor is a suitable methodology to represent a knowledge graph because it records only the existence or the absence of a relation between entities, [17] and for this reason is simple, and there is no need to know a priori the network structure, [15] making this class of embedding models light, and easy to train even if they suffer from high-dimensional and sparsity of data. [5] [17]
This family of models uses a linear equation to embed the connection between the entities through a relation. [1] In particular, the embedded representation of the relations is a bidimensional matrix. [5] These models, during the embedding procedure, only use the single facts to compute the embedded representation and ignore the other associations to the same entity or relation. [18]
The geometric space defined by this family of models encodes the relation as a geometric transformation between the head and tail of a fact. [5] For this reason, to compute the embedding of the tail, it is necessary to apply a transformation to the head embedding, and a distance function is used to measure the goodness of the embedding or to score the reliability of a fact. [5]
Geometric models are similar to the tensor decomposition model, but the main difference between the two is that they have to preserve the applicability of the transformation in the geometric space in which it is defined. [5]
This class of models is inspired by the idea of translation invariance introduced in word2vec. [7] A pure translational model relies on the fact that the embedding vector of the entities are close to each other after applying a proper relational translation in the geometric space in which they are defined. [18] In other words, given a fact, when the embedding of head is added to the embedding of relation, the expected result should be the embedding of the tail. [5] The closeness of the entities embedding is given by some distance measure and quantifies the reliability of a fact. [17]
It is possible to associate additional information to each element in the knowledge graph and their common representation facts. [1] Each entity and relation can be enriched with text descriptions, weights, constraints, and others in order to improve the overall description of the domain with a knowledge graph. [1] During the embedding of the knowledge graph, this information can be used to learn specialized embeddings for these characteristics together with the usual embedded representation of entities and relations, with the cost of learning a more significant number of vectors. [5]
This family of models, in addition or in substitution of a translation they employ a rotation-like transformation. [5]
This group of embedding models uses deep neural network to learn patterns from the knowledge graph that are the input data. [5] These models have the generality to distinguish the type of entity and relation, temporal information, path information, underlay structured information, [18] and resolve the limitations of distance-based and semantic-matching-based models in representing all the features of a knowledge graph. [1] The use of deep learning for knowledge graph embedding has shown good predictive performance even if they are more expensive in the training phase, data-hungry, and often required a pre-trained embedding representation of knowledge graph coming from a different embedding model. [1] [5]
This family of models, instead of using fully connected layers, employs one or more convolutional layers that convolve the input data applying a low-dimensional filter capable of embedding complex structures with few parameters by learning nonlinear features. [1] [5] [18]
This family of models uses capsule neural networks to create a more stable representation that is able to recognize a feature in the input without losing spatial information. [5] The network is composed of convolutional layers, but they are organized in capsules, and the overall result of a capsule is sent to a higher-capsule decided by a dynamic process routine. [5]
This class of models leverages the use of recurrent neural network. [5] The advantage of this architecture is to memorize a sequence of fact, rather than just elaborate single events. [40]
The machine learning task for knowledge graph embedding that is more often used to evaluate the embedding accuracy of the models is the link prediction. [1] [3] [5] [6] [7] [18] Rossi et al. [5] produced an extensive benchmark of the models, but also other surveys produces similar results. [3] [7] [18] [25] The benchmark involves five datasets FB15k, [9] WN18, [9] FB15k-237, [41] WN18RR, [36] and YAGO3-10. [42] More recently, it has been discussed that these datasets are far away from real-world applications, and other datasets should be integrated as a standard benchmark. [43]
Dataset name | Number of different entities | Number of different relations | Number of triples |
---|---|---|---|
FB15k [9] | 14951 | 1345 | 584,113 |
WN18 [9] | 40943 | 18 | 151,442 |
FB15k-237 [41] | 14541 | 237 | 310,116 |
WN18RR [36] | 40943 | 11 | 93,003 |
YAGO3-10 [42] | 123182 | 37 | 1,089,040 |
Model name | Memory complexity | FB15K (Hits@10) | FB15K (MR) | FB15K (MRR) | FB15K - 237 (Hits@10) | FB15K - 237 (MR) | FB15K - 237 (MRR) | WN18 (Hits@10) | WN18 (MR) | WN18 (MRR) | WN18RR (Hits@10) | WN18RR (MR) | WN18RR (MRR) | YAGO3-10 (Hits@10) | YAGO3-10 (MR) | YAGO3-10 (MRR) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
DistMul [19] | 0.863 | 173 | 0.784 | 0.490 | 199 | 0.313 | 0.946 | 675 | 0.824 | 0.502 | 5913 | 0.433 | 0.661 | 1107 | 0.501 | |
ComplEx [20] | 0.905 | 34 | 0.848 | 0.529 | 202 | 0.349 | 0.955 | 3623 | 0.949 | 0.521 | 4907 | 0.458 | 0.703 | 1112 | 0.576 | |
HolE [23] | 0.867 | 211 | 0.800 | 0.476 | 186 | 0.303 | 0.949 | 650 | 0.938 | 0.487 | 8401 | 0.432 | 0.651 | 6489 | 0.502 | |
ANALOGY [21] | 0.837 | 126 | 0.726 | 0.353 | 476 | 0.202 | 0.944 | 808 | 0.934 | 0.380 | 9266 | 0.366 | 0.456 | 2423 | 0.283 | |
SimplE [22] | 0.836 | 138 | 0.726 | 0.343 | 651 | 0.179 | 0.945 | 759 | 0.938 | 0.426 | 8764 | 0.398 | 0.631 | 2849 | 0.453 | |
TuckER [24] | 0.888 | 39 | 0.788 | 0.536 | 162 | 0.352 | 0.958 | 510 | 0.951 | 0.514 | 6239 | 0.459 | 0.680 | 2417 | 0.544 | |
MEI [26] | 0.552 | 145 | 0.365 | 0.551 | 3268 | 0.481 | 0.709 | 756 | 0.578 | |||||||
MEIM [27] | 0.557 | 137 | 0.369 | 0.577 | 2434 | 0.499 | 0.716 | 747 | 0.585 | |||||||
TransE [9] | 0.847 | 45 | 0.628 | 0.497 | 209 | 0.310 | 0.948 | 279 | 0.646 | 0.495 | 3936 | 0.206 | 0.673 | 1187 | 0.501 | |
STransE [32] | 0.796 | 69 | 0.543 | 0.495 | 357 | 0.315 | 0.934 | 208 | 0.656 | 0.422 | 5172 | 0.226 | 0.073 | 5797 | 0.049 | |
CrossE [33] | 0.862 | 136 | 0.702 | 0.470 | 227 | 0.298 | 0.950 | 441 | 0.834 | 0.449 | 5212 | 0.405 | 0.654 | 3839 | 0.446 | |
TorusE [34] | 0.839 | 143 | 0.746 | 0.447 | 211 | 0.281 | 0.954 | 525 | 0.947 | 0.535 | 4873 | 0.463 | 0.474 | 19455 | 0.342 | |
RotatE [35] | 0.881 | 42 | 0.791 | 0.522 | 178 | 0.336 | 0.960 | 274 | 0.949 | 0.573 | 3318 | 0.475 | 0.570 | 1827 | 0.498 | |
ConvE [36] | 0.849 | 51 | 0.688 | 0.521 | 281 | 0.305 | 0.956 | 413 | 0.945 | 0.507 | 4944 | 0.427 | 0.657 | 2429 | 0.488 | |
ConvKB [38] | 0.408 | 324 | 0.211 | 0.517 | 309 | 0.230 | 0.948 | 202 | 0.709 | 0.525 | 3429 | 0.249 | 0.604 | 1683 | 0.420 | |
ConvR [37] | 0.885 | 70 | 0.773 | 0.526 | 251 | 0.346 | 0.958 | 471 | 0.950 | 0.526 | 5646 | 0.467 | 0.673 | 2582 | 0.527 | |
CapsE [39] | 0.217 | 610 | 0.087 | 0.356 | 405 | 0.160 | 0.950 | 233 | 0.890 | 0.559 | 720 | 0.415 | 0 | 60676 | 0.000 | |
RSN [40] | 0.870 | 51 | 0.777 | 0.444 | 248 | 0.280 | 0.951 | 346 | 0.928 | 0.483 | 4210 | 0.395 | 0.664 | 1339 | 0.511 |
In mathematics, a binary relation associates elements of one set called the domain with elements of another set called the codomain. Precisely, a binary relation over sets and is a set of ordered pairs where is in and is in . It encodes the common concept of relation: an element is related to an element , if and only if the pair belongs to the set of ordered pairs that defines the binary relation.
A semantic network, or frame network is a knowledge base that represents semantic relations between concepts in a network. This is often used as a form of knowledge representation. It is a directed or undirected graph consisting of vertices, which represent concepts, and edges, which represent semantic relations between concepts, mapping or connecting semantic fields. A semantic network may be instantiated as, for example, a graph database or a concept map. Typical standardized semantic networks are expressed as semantic triples.
Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data, potentially existing across non-linear manifolds which cannot be adequately captured by linear decomposition methods, onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.
Multi-task learning (MTL) is a subfield of machine learning in which multiple learning tasks are solved at the same time, while exploiting commonalities and differences across tasks. This can result in improved learning efficiency and prediction accuracy for the task-specific models, when compared to training the models separately. Inherently, Multi-task learning is a multi-objective optimization problem having trade-offs between different tasks. Early versions of MTL were called "hints".
In universal algebra and in model theory, a structure consists of a set along with a collection of finitary operations and relations that are defined on it.
A logical matrix, binary matrix, relation matrix, Boolean matrix, or (0, 1)-matrix is a matrix with entries from the Boolean domain B = {0, 1}. Such a matrix can be used to represent a binary relation between a pair of finite sets. It is an important tool in combinatorial mathematics and theoretical computer science.
In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.
In natural language processing (NLP), a word embedding is a representation of a word. The embedding is used in text analysis. Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using language modeling and feature learning techniques, where words or phrases from the vocabulary are mapped to vectors of real numbers.
Word2vec is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. Word2vec was developed by Tomáš Mikolov and colleagues at Google and published in 2013.
The stochastic block model is a generative model for random graphs. This model tends to produce graphs containing communities, subsets of nodes characterized by being connected with one another with particular edge densities. For example, edges may be more common within communities than between communities. Its mathematical formulation was first introduced in 1983 in the field of social network analysis by Paul W. Holland et al. The stochastic block model is important in statistics, machine learning, and network science, where it serves as a useful benchmark for the task of recovering community structure in graph data.
A transformer is a deep learning architecture developed by researchers at Google and based on the multi-head attention mechanism, proposed in a 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via looking up from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism allowing the signal for key tokens to be amplified and less important tokens to be diminished.
Bidirectional encoder representations from transformers (BERT) is a language model introduced in October 2018 by researchers at Google. It learns to represent text as a sequence of vectors using self-supervised learning. It uses the encoder-only transformer architecture. It is notable for its dramatic improvement over previous state-of-the-art models, and as an early example of a large language model. As of 2020, BERT is a ubiquitous baseline in natural language processing (NLP) experiments.
In knowledge representation and reasoning, a knowledge graph is a knowledge base that uses a graph-structured data model or topology to represent and operate on data. Knowledge graphs are often used to store interlinked descriptions of entities – objects, events, situations or abstract concepts – while also encoding the free-form semantics or relationships underlying these entities.
In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.
Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.
A graph neural network (GNN) belongs to a class of artificial neural networks for processing data that can be represented as graphs.
A vision transformer (ViT) is a transformer designed for computer vision. A ViT breaks down an input image into a series of patches, serialises each patch into a vector, and maps it to a smaller dimension with a single matrix multiplication. These vector embeddings are then processed by a transformer encoder as if they were token embeddings.
Prompt engineering is the process of structuring an instruction that can be interpreted and understood by a generative AI model. A prompt is natural language text describing the task that an AI should perform: a prompt for a text-to-text language model can be a query such as "what is Fermat's little theorem?", a command such as "write a poem about leaves falling", or a longer statement including context, instructions, and conversation history.
Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.
Topological Deep Learning (TDL) is a research field that extends deep learning to handle complex, non-Euclidean data structures. Traditional deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), excel in processing data on regular grids and sequences. However, scientific and real-world data often exhibit more intricate data domains encountered in scientific computations, including point clouds, meshes, time series, scalar fields graphs, or general topological spaces like simplicial complexes and CW complexes. TDL addresses this by incorporating topological concepts to process data with higher-order relationships, such as interactions among multiple entities and complex hierarchies. This approach leverages structures like simplicial complexes and hypergraphs to capture global dependencies and qualitative spatial properties, offering a more nuanced representation of data. TDL also encompasses methods from computational and algebraic topology that permit studying properties of neural networks and their training process, such as their predictive performance or generalization properties.,.