Collective classification

Last updated

In network theory, collective classification is the simultaneous prediction of the labels for multiple objects, where each label is predicted using information about the object's observed features, the observed features and labels of its neighbors, and the unobserved labels of its neighbors. [1] Collective classification problems are defined in terms of networks of random variables, where the network structure determines the relationship between the random variables. Inference is performed on multiple random variables simultaneously, typically by propagating information between nodes in the network to perform approximate inference. Approaches that use collective classification can make use of relational information when performing inference. Examples of collective classification include predicting attributes (ex. gender, age, political affiliation) of individuals in a social network, classifying webpages in the World Wide Web, and inferring the research area of a paper in a scientific publication dataset.

Contents

Motivation and background

Traditionally, a major focus of machine learning is to solve classification problems. (For example, given a collection of e-mails, we wish to determine which are spam, and which are not.) Many machine learning models for performing this task will try to categorize each item independently, and focus on predicting the class labels separately. However, the prediction accuracy for the labels whose values must be inferred can be improved with knowledge of the correct class labels for related items. For example, it is easier to predict the topic of a webpage if we know the topics of the webpages that link to it. Similarly, the chance of a particular word being a verb increases if we know that the previous word in the sentence is a noun; knowing the first few characters in a word can make it much easier to identify the remaining characters. Many researchers have proposed techniques that attempt to classify samples in a joint or collective manner, instead of treating each sample in isolation; these techniques have enabled significant gains in classification accuracy. [1] [2]

Example

Consider the task of inferring the political affiliation of users in a social network, where some portion of these affiliations are observed, and the remainder are unobserved. Each user has local features, such as their profile information, and links exist between users who are friends in this social network. An approach that does not collectively classify users will consider each user in the network independently and use their local features to infer party affiliations. An approach which performs collective classification might assume that users who are friends tend to have similar political views, and could then jointly infer all unobserved party affiliations while making use of the rich relational structure of the social network.

Definition

Consider the semi supervised learning problem of assigning labels to nodes in a network by using knowledge of a subset of the nodes' labels. Specifically, we are given a network represented by a graph with a set of nodes and an edge set representing relationships among nodes. Each node is described by its attributes: a feature vector and its label (or class) .

can further be divided into two sets of nodes: , the set of nodes for which we know the correct label values (observed variables), and , the nodes whose labels must be inferred. The collective classification task is to label the nodes in with a label from a label set .

In such settings, traditional classification algorithms assume that the data is drawn independently and identically from some distribution (iid). This means that the labels inferred for nodes whose label is unobserved are independent of each other. One does not make this assumption when performing collective classification. Instead, there are three distinct types of correlations that can be utilized to determine the classification or label of :

  1. The correlations between the label of and the observed attributes of . Traditional iid classifiers which make use of feature vectors are an example of approaches that use this correlation.
  2. The correlations between the label of and the observed attributes (including observed labels) of nodes in the neighborhood of .
  3. The correlations between the label of and the unobserved labels of objects in the neighborhood of .

Collective classification refers to the combined classification of a set of interlinked objects using the three above types of information.

Methods

There are several existing approaches to collective classification. The two major methods are iterative methods and methods based on probabilistic graphical models. [3]

Iterative methods

The general idea for iterative methods is to iteratively combine and revise individual node predictions so as to reach an equilibrium. When updating predictions for individual nodes is a fast operation, the complexity of these iterative methods will be the number of iterations needed for convergence. Though convergence and optimality is not always mathematically guaranteed, in practice, these approaches will typically converge quickly to a good solution, depending on the graph structure and problem complexity. The methods presented in this section are representative of this iterative approach.

Label propagation

A natural assumption in network classification is that adjacent nodes are likely to have the same label (i.e., contagion or homophily). The predictor for node using the label propagation method is a weighted average of its neighboring labels [4]

Iterative Classification Algorithms (ICA)

While label propagation is surprisingly effective, it may sometimes fail to capture complex relational dynamics. More sophisticated approaches can use richer predictors. Suppose we have a classifier that has been trained to classify a node given its features and the features and labels of its neighbors . Iterative classification applies uses a local classifier for each node, which uses information about current predictions and ground truth information about the node's neighbors, and iterates until the local predictions converge to a global solution. Iterative classification is an “algorithmic framework,” in that it is agnostic to the choice of predictor; this makes it a very versatile tool for collective classification. [5] [6] [7]

Collective classification with graphical models

Another approach to collective classification is to represent the problem with a graphical model and use learning and inference techniques for the graphical modeling approach to arrive at the correct classifications. Graphical models are tools for joint, probabilistic inference, making them ideal for collective classification. They are characterized by a graphical representation of a probability distribution , in which random variables are nodes in a graph . Graphical models can be broadly categorized by whether the underlying graph is directed (e.g., Bayesian networks or collections of local classifiers) or undirected (e.g., Markov random fields (MRF)).

Gibbs sampling

Gibbs sampling is a general framework for approximating a distribution. It is a Markov chain Monte Carlo algorithm, in that it iteratively samples from the current estimate of the distribution, constructing a Markov chain that converges to the target (stationary) distribution. The basic idea for Gibbs Sampling is to sample for the best label estimate for given all the values for the nodes in using local classifier for a fixed number of iterations. After that, we sample labels for each and maintain count statistics for the number of times we sampled label for node . After collecting a predefined number of such samples, we output the best label assignment for node by choosing the label that was assigned the maximum number of times to while collecting samples. [8] [9]

Loopy belief propagation

For certain undirected graphical models, it is possible to efficiently perform exact inference via message passing, or belief propagation algorithms. [10] These algorithms follow a simple iterative pattern: each variable passes its "beliefs" about its neighbors' marginal distributions, then uses the incoming messages about its own value to update its beliefs. Convergence to the true marginals is guaranteed for tree-structured MRFs, but is not guaranteed for MRFs with cycles.

Statistical relational learning is often used to address collective classification problems. A variety of SRL methods has been applied to the collective classification setting. Some of the methods include direct methods such probabilistic relational models (PRM), [11] coupled conditional models such as link-based classification, [12] and indirect methods such as Markov logic networks (MLN) [13] and Probabilistic Soft Logic (PSL). [14]

Applications

Collective classification is applied in many domains which exhibit relational structure, such as:

See also

Related Research Articles

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum–Welch algorithm can be used to estimate the parameters.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

A graphical model or probabilistic graphical model (PGM) or structured probabilistic model is a probabilistic model for which a graph expresses the conditional dependence structure between random variables. They are commonly used in probability theory, statistics—particularly Bayesian statistics—and machine learning.

In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for sampling from a specified multivariate probability distribution when direct sampling from the joint distribution is difficult, but sampling from the conditional distribution is more practical. This sequence can be used to approximate the joint distribution ; to approximate the marginal distribution of one of the variables, or some subset of the variables ; or to compute an integral. Typically, some of the variables correspond to observations whose values are known, and hence do not need to be sampled.

Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of observations.

<span class="mw-page-title-main">Markov blanket</span>

In statistics and machine learning, when one wants to infer a random variable with a set of variables, usually a subset is enough, and other variables are useless. Such a subset that contains all the useful information is called a Markov blanket. If a Markov blanket is minimal, meaning that it cannot drop any variable without losing information, it is called a Markov boundary. Identifying a Markov blanket or a Markov boundary helps to extract useful features. The terms of Markov blanket and Markov boundary were coined by Judea Pearl in 1988. A Markov blanket can be constituted by a set of Markov chains.

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

  1. A generative model is a statistical model of the joint probability distribution on given observable variable X and target variable Y;
  2. A discriminative model is a model of the conditional probability of the target Y, given an observation x; and
  3. Classifiers computed without using a probability model are also referred to loosely as "discriminative".
<span class="mw-page-title-main">Markov random field</span> Set of random variables

In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to be a Markov random field if it satisfies Markov properties. The concept originates from the Sherrington–Kirkpatrick model.

A Markov logic network (MLN) is a probabilistic logic which applies the ideas of a Markov network to first-order logic, defining probability distributions on possible worlds on any given domain.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

Discriminative models, also referred to as conditional models, are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data, such as pass/fail, win/lose, alive/dead or healthy/sick.

Statistical relational learning (SRL) is a subdiscipline of artificial intelligence and machine learning that is concerned with domain models that exhibit both uncertainty and complex, relational structure. Typically, the knowledge representation formalisms developed in SRL use first-order logic to describe relational properties of a domain in a general manner and draw upon probabilistic graphical models to model the uncertainty; some also build upon the methods of inductive logic programming. Significant contributions to the field have been made since the late 1990s.

In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes.

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

In machine learning, a probabilistic classifier is a classifier that is able to predict, given an observation of an input, a probability distribution over a set of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into ensembles.

<span class="mw-page-title-main">Probabilistic soft logic</span>

Probabilistic Soft Logic (PSL) is a statistical relational learning (SRL) framework for modeling probabilistic and relational domains. It is applicable to a variety of machine learning problems, such as collective classification, entity resolution, link prediction, and ontology alignment. PSL combines two tools: first-order logic, with its ability to succinctly represent complex phenomena, and probabilistic graphical models, which capture the uncertainty and incompleteness inherent in real-world knowledge. More specifically, PSL uses "soft" logic as its logical component and Markov random fields as its statistical model. PSL provides sophisticated inference techniques for finding the most likely answer (i.e. the maximum a posteriori (MAP) state). The "softening" of the logical formulas makes inference a polynomial time operation rather than an NP-hard operation.

The following outline is provided as an overview of and topical guide to machine learning:

Dependency networks (DNs) are graphical models, similar to Markov networks, wherein each vertex (node) corresponds to a random variable and each edge captures dependencies among variables. Unlike Bayesian networks, DNs may contain cycles. Each node is associated to a conditional probability table, which determines the realization of the random variable given its parents.

In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time , the goal is to predict the links at time . Link prediction is widely applicable. In e-commerce, link prediction is often a subtask for recommending items to users. In the curation of citation databases, it can be used for record deduplication. In bioinformatics, it has been used to predict protein-protein interactions (PPI). It is also used to identify hidden groups of terrorists and criminals in security related applications.

References

  1. 1 2 Sen, Prithviraj; Namata, Galileo; Bilgic, Mustafa; Getoor, Lise; Galligher, Brian; Eliassi-Rad, Tina (2008). "Collective Classification in Network Data". AI Magazine. 29 (3): 93. doi: 10.1609/aimag.v29i3.2157 . hdl: 1903/7546 . ISSN   0738-4602.
  2. Kajdanowicz, Tomasz; Kazienko, Przemysław (2018). "Collective Classification". Encyclopedia of Social Network Analysis and Mining. pp. 253–265. doi:10.1007/978-1-4939-7131-2_45. ISBN   978-1-4939-7130-5.
  3. London, Ben; Getoor, Lise (2014). "Collective Classification of Network Data". Data Classification: Algorithms and Applications. 29: 399–416.
  4. Zhu, Xiaojin (2002). Learning From Labeled and Unlabeled Data With Label Propagation (Technical report). CiteSeerX   10.1.1.14.3864 .
  5. Neville, Jennifer; Jensen, David (2000). Iterative classification in relational data (PDF). AAAI Workshop on Learning Statistical Models From Relational Data (SRL). AAAI. p. 8.
  6. Chakrabarti, Soumen; Dom, Byron; Indyk, Piotr (1998). Enhanced Hypertext Categorization Using Hyperlinks. Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. Association for Computing Machinery (ACM). pp. 307–318. doi: 10.1145/276304.276332 .
  7. Jensen, David; Neville, Jennifer; Gallagher, David (2000). Why collective inference improves relational classification. ACM SIGKDD international conference on Knowledge discovery and data mining. Association for Computing Machinery (ACM). p. 5. doi:10.1145/1014052.1014125.
  8. Mackskassy, Sofus; Provost, Foster (2007). "Classification in Network Data: A Toolkit and a Univariate Case Study" (PDF). Journal of Machine Learning Research: 935 - 983.
  9. Geman, Stuart; Donald, Foster (1990). "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images". Readings in Uncertain Reasoning. Morgan Kaufmann Publishers Inc. pp. 452–472.
  10. Yedidia, J.S.; Freeman, W.T.; Y. (January 2003). "Understanding Belief Propagation and Its Generalizations". In Lakemeyer, Gerhard; Nebel, Bernhard (eds.). Exploring Artificial Intelligence in the New Millennium. Morgan Kaufmann. pp. 239–236. ISBN   978-1-55860-811-5 . Retrieved 2009-03-30.
  11. Getoor, Lise; Friedman, Nir; Koller, Daphne; Taskar, Benjamin (2002). "Learning Probabilistic Models of Link Structure". J. Mach. Learn. Res. 3: 679–707.
  12. Lu, Qing; Getoor, Lise (2003). Link based Classification (PDF). International Conference on Machine Learning (ICML).
  13. Richardson, Matthew; Domingos, Pedro M. (2006). "Markov logic networks". Mach. Learn. 62 (1–2): 107–136. doi:10.1007/S10994-006-5833-1.
  14. Bach, Stephen; Broecheler, Matthias; Huang, Bert; Getoor, Lise (2017). "Hinge-Loss Markov Random Fields and Probabilistic Soft Logic". Journal of Machine Learning Research. 18: 1–67.
  15. Jaafor, Omar; Birregah, Babiga (2017-07-31). "Collective classification in social networks". Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2017. New York, NY, USA: ACM. pp. 827–835. doi:10.1145/3110025.3110128. ISBN   978-1-4503-4993-2.
  16. Fakhraei, Shobeir; Foulds, James; Shashanka, Madhusudana; Getoor, Lise (2015). "Collective Spammer Detection in Evolving Multi-Relational Social Networks". Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, New York, USA: ACM Press. pp. 1769–1778. doi:10.1145/2783258.2788606. ISBN   978-1-4503-3664-2.
  17. Bhattacharya, Indrajit; Getoor, Lise (2007). "Collective entity resolution in relational data". ACM Transactions on Knowledge Discovery from Data. 1 (1). Association for Computing Machinery (ACM): 5. doi:10.1145/1217299.1217304. hdl: 1903/4241 . ISSN   1556-4681. S2CID   488972.
  18. Luo, Ling; Yang, Zhihao; Yang, Pei; Zhang, Yin; Wang, Lei; Lin, Hongfei; Wang, Jian (2017-11-24). Wren, Jonathan (ed.). "An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition". Bioinformatics. 34 (8). Oxford University Press (OUP): 1381–1388. doi: 10.1093/bioinformatics/btx761 . ISSN   1367-4803. PMID   29186323.
  19. Burford, Clint; Bird, Steven; Baldwin, Timothy (2015). "Collective Document Classification with Implicit Inter-document Semantic Relationships". Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics. Stroudsburg, PA, USA: Association for Computational Linguistics. pp. 106–116. doi: 10.18653/v1/s15-1012 .
  20. Žitnik M, Zupan B (2015). "Gene network inference by fusing data from diverse distributions". Bioinformatics. 31 (12): i230-9. doi:10.1093/bioinformatics/btv258. PMC   4542780 . PMID   26072487.
  21. Triebel, Rudolph; Mozos, Óscar Martínez; Burgard, Wolfram (2008). "Collective Classification for Labeling of Places and Objects in 2D and 3D Range Data" (PDF). Data Analysis, Machine Learning and Applications. Berlin, Heidelberg: Springer Berlin Heidelberg. pp. 293–300. doi:10.1007/978-3-540-78246-9_35. ISBN   978-3-540-78239-1. ISSN   1431-8814.