Recommender systems |
---|
Concepts |
Methods and challenges |
Implementations |
Research |
Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices. [1] This family of methods became widely known during the Netflix prize challenge due to its effectiveness as reported by Simon Funk in his 2006 blog post, [2] where he shared his findings with the research community. The prediction results can be improved by assigning different regularization weights to the latent factors based on items' popularity and users' activeness. [3]
The idea behind matrix factorization is to represent users and items in a lower dimensional latent space. Since the initial work by Funk in 2006 a multitude of matrix factorization approaches have been proposed for recommender systems. Some of the most used and simpler ones are listed in the following sections.
The original algorithm proposed by Simon Funk in his blog post [2] factorized the user-item rating matrix as the product of two lower dimensional matrices, the first one has a row for each user, while the second has a column for each item. The row or column associated to a specific user or item is referred to as latent factors. [4] Note that, in Funk MF no singular value decomposition is applied, it is a SVD-like machine learning model. [2] The predicted ratings can be computed as , where is the user-item rating matrix, contains the user's latent factors and the item's latent factors.
Specifically, the predicted rating user u will give to item i is computed as:
It is possible to tune the expressive power of the model by changing the number of latent factors. It has been demonstrated [5] that a matrix factorization with one latent factor is equivalent to a most popular or top popular recommender (e.g. recommends the items with the most interactions without any personalization). Increasing the number of latent factors will improve personalization, therefore recommendation quality, until the number of factors becomes too high, at which point the model starts to overfit and the recommendation quality will decrease. A common strategy to avoid overfitting is to add regularization terms to the objective function. [6] [7] Funk MF was developed as a rating prediction problem, therefore it uses explicit numerical ratings as user-item interactions.
All things considered, Funk MF minimizes the following objective function:
Where is defined to be the frobenius norm whereas the other norms might be either frobenius or another norm depending on the specific recommending problem. [8]
While Funk MF is able to provide very good recommendation quality, its ability to use only explicit numerical ratings as user-items interactions constitutes a limitation. Modern day recommender systems should exploit all available interactions both explicit (e.g. numerical ratings) and implicit (e.g. likes, purchases, skipped, bookmarked). To this end SVD++ was designed to take into account implicit interactions as well. [9] [10] Compared to Funk MF, SVD++ takes also into account user and item bias.
The predicted rating user u will give to item i is computed as:
Where refers to the overall average rating over all items and and refers to the observed deviation of the item and the user respectively from the average. [11] SVD++ has however some disadvantages, with the main drawback being that this method is not model-based. This means that if a new user is added, the algorithm is incapable of modeling it unless the whole model is retrained. Even though the system might have gathered some interactions for that new user, its latent factors are not available and therefore no recommendations can be computed. This is an example of a cold-start problem, that is the recommender cannot deal efficiently with new users or items and specific strategies should be put in place to handle this disadvantage. [12]
A possible way to address this cold start problem is to modify SVD++ in order for it to become a model-based algorithm, therefore allowing to easily manage new items and new users.
As previously mentioned in SVD++ we don't have the latent factors of new users, therefore it is necessary to represent them in a different way. The user's latent factors represent the preference of that user for the corresponding item's latent factors, therefore user's latent factors can be estimated via the past user interactions. If the system is able to gather some interactions for the new user it is possible to estimate its latent factors. Note that this does not entirely solve the cold-start problem, since the recommender still requires some reliable interactions for new users, but at least there is no need to recompute the whole model every time. It has been demonstrated that this formulation is almost equivalent to a SLIM model, [13] which is an item-item model based recommender.
With this formulation, the equivalent item-item recommender would be . Therefore the similarity matrix is symmetric.
Asymmetric SVD aims at combining the advantages of SVD++ while being a model based algorithm, therefore being able to consider new users with a few ratings without needing to retrain the whole model. As opposed to the model-based SVD here the user latent factor matrix H is replaced by Q, which learns the user's preferences as function of their ratings. [14]
The predicted rating user u will give to item i is computed as:
With this formulation, the equivalent item-item recommender would be . Since matrices Q and W are different the similarity matrix is asymmetric, hence the name of the model.
A group-specific SVD can be an effective approach for the cold-start problem in many scenarios. [6] It clusters users and items based on dependency information and similarities in characteristics. Then once a new user or item arrives, we can assign a group label to it, and approximates its latent factor by the group effects (of the corresponding group). Therefore, although ratings associated with the new user or item are not necessarily available, the group effects provide immediate and effective predictions.
The predicted rating user u will give to item i is computed as:
Here and represent the group label of user u and item i, respectively, which are identical across members from the same group. And and are matrices of group effects. For example, for a new user whose latent factor is not available, we can at least identify their group label , and predict their ratings as:
This provides a good approximation to the unobserved ratings.
In recent years many other matrix factorization models have been developed to exploit the ever increasing amount and variety of available interaction data and use cases. Hybrid matrix factorization algorithms are capable of merging explicit and implicit interactions [15] or both content and collaborative data [16] [17] [18]
In recent years a number of neural and deep-learning techniques have been proposed, some of which generalize traditional Matrix factorization algorithms via a non-linear neural architecture. [19] While deep learning has been applied to many different scenarios: context-aware, sequence-aware, social tagging etc. its real effectiveness when used in a simple Collaborative filtering scenario has been put into question. Systematic analysis of publications applying deep learning or neural methods to the top-k recommendation problem, published in top conferences (SIGIR, KDD, WWW, RecSys, IJCAI), has shown that on average less than 40% of articles are reproducible, with as little as 14% in some conferences. Overall the studies identify 26 articles, only 12 of them could be reproduced and 11 of them could be outperformed by much older and simpler properly tuned baselines. The articles also highlights a number of potential problems in today's research scholarship and call for improved scientific practices in that area. [20] [21] Similar issues have been spotted also in sequence-aware recommender systems. [22]
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in data analysis, visualization and data preprocessing. This is accomplished by linearly transforming the data onto a new coordinate system such that the directions capturing the largest variation in the data can be easily identified. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science.
In linear algebra, the singular value decomposition (SVD) is a factorization of a real or complex matrix into a rotation, followed by a rescaling followed by another rotation. It generalizes the eigendecomposition of a square normal matrix with an orthonormal eigenbasis to any matrix. It is related to the polar decomposition.
Collaborative filtering (CF) is a technique used by recommender systems. Collaborative filtering has two senses, a narrow one and a more general one.
A recommender system, or a recommendation system, is a subclass of information filtering system that provide suggestions for items that are most pertinent to a particular user. Recommender systems are particularly useful when an individual needs to choose an item from a potentially overwhelming number of items that a service may offer.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.
Partial least squares regression is a statistical method that bears some relation to principal components regression; instead of finding hyperplanes of maximum variance between the response and independent variables, it finds a linear regression model by projecting the predicted variables and the observable variables to a new space. Because both the X and Y data are projected to new spaces, the PLS family of methods are known as bilinear factor models. Partial least squares discriminant analysis (PLS-DA) is a variant used when the Y is categorical.
The steering law in human–computer interaction and ergonomics is a predictive model of human movement that describes the time required to navigate, or steer, through a 2-dimensional tunnel. The tunnel can be thought of as a path or trajectory on a plane that has an associated thickness or width, where the width can vary along the tunnel. The goal of a steering task is to navigate from one end of the tunnel to the other as quickly as possible, without touching the boundaries of the tunnel. A real-world example that approximates this task is driving a car down a road that may have twists and turns, where the car must navigate the road as quickly as possible without touching the sides of the road. The steering law predicts both the instantaneous speed at which we may navigate the tunnel, and the total time required to navigate the entire tunnel.
In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to be a Markov random field if it satisfies Markov properties. The concept originates from the Sherrington–Kirkpatrick model.
Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.
Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.
In mathematics and computer algebra, factorization of polynomials or polynomial factorization expresses a polynomial with coefficients in a given field or in the integers as the product of irreducible factors with coefficients in the same domain. Polynomial factorization is one of the fundamental components of computer algebra systems.
Numerical linear algebra, sometimes called applied linear algebra, is the study of how matrix operations can be used to create computer algorithms which efficiently and accurately provide approximate answers to questions in continuous mathematics. It is a subfield of numerical analysis, and a type of linear algebra. Computers use floating-point arithmetic and cannot exactly represent irrational data, so when a computer algorithm is applied to a matrix of data, it can sometimes increase the difference between a number stored in the computer and the true number that it is an approximation of. Numerical linear algebra uses properties of vectors and matrices to develop computer algorithms that minimize the error introduced by the computer, and is also concerned with ensuring that the algorithm is as efficient as possible.
Cold start is a potential problem in computer-based information systems which involves a degree of automated data modelling. Specifically, it concerns the issue that the system cannot draw any inferences for users or items about which it has not yet gathered sufficient information.
In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a random orthogonal projection.
Dynamic mode decomposition (DMD) is a dimensionality reduction algorithm developed by Peter J. Schmid and Joern Sesterhenn in 2008. Given a time series of data, DMD computes a set of modes each of which is associated with a fixed oscillation frequency and decay/growth rate. For linear systems in particular, these modes and frequencies are analogous to the normal modes of the system, but more generally, they are approximations of the modes and eigenvalues of the composition operator. Due to the intrinsic temporal behaviors associated with each mode, DMD differs from dimensionality reduction methods such as principal component analysis, which computes orthogonal modes that lack predetermined temporal behaviors. Because its modes are not orthogonal, DMD-based representations can be less parsimonious than those generated by PCA. However, they can also be more physically meaningful because each mode is associated with a damped sinusoidal behavior in time.
Gravity R&D is an IT vendor specialized in recommender systems. Gravity was founded by members of the Netflix Prize team "Gravity".
In applied mathematics, k-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. k-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the dictionary to better fit the data. It is structurally related to the expectation maximization (EM) algorithm. k-SVD can be found widely in use in applications such as image processing, audio processing, biology, and document analysis.
In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.
In representation learning, knowledge graph embedding (KGE), also referred to as knowledge representation learning (KRL), or multi-relation learning, is a machine learning task of learning a low-dimensional representation of a knowledge graph's entities and relations while preserving their semantic meaning. Leveraging their embedded representation, knowledge graphs (KGs) can be used for various applications such as link prediction, triple classification, entity recognition, clustering, and relation extraction.
A Network Coordinate System is a system for predicting characteristics such as the latency or bandwidth of connections between nodes in a network by assigning coordinates to nodes. More formally, It assigns a coordinate embedding to each node in a network using an optimization algorithm such that a predefined operation estimates some directional characteristic of the connection between node and .