Constrained conditional model

Last updated December 09, 2023

A constrained conditional model (CCM) is a machine learning and inference framework that augments the learning of conditional (probabilistic or discriminative) models with declarative constraints. The constraint can be used as a way to incorporate expressive^{[ clarification needed ]} prior knowledge into the model and bias the assignments made by the learned model to satisfy these constraints. The framework can be used to support decisions in an expressive output space while maintaining modularity and tractability of training and inference.

Models of this kind have recently^{[ when? ]} attracted much attention^{[ citation needed ]} within the natural language processing (NLP) community. Formulating problems as constrained optimization problems over the output of learned models has several advantages. It allows one to focus on the modeling of problems by providing the opportunity to incorporate domain-specific knowledge as global constraints using a first order language. Using this declarative framework frees the developer from low level feature engineering while capturing the problem's domain-specific properties and guarantying exact inference. From a machine learning perspective it allows decoupling the stage of model generation (learning) from that of the constrained inference stage, thus helping to simplify the learning stage while improving the quality of the solutions. For example, in the case of generating compressed sentences, rather than simply relying on a language model to retain the most commonly used n-grams in the sentence, constraints can be used to ensure that if a modifier is kept in the compressed sentence, its subject will also be kept.

Motivation

Making decisions in many domains (such as natural language processing and computer vision problems) often involves assigning values to sets of interdependent variables where the expressive dependency structure can influence, or even dictate, what assignments are possible. These settings are applicable not only to Structured Learning problems such as semantic role labeling, but also for cases that require making use of multiple pre-learned components, such as summarization, textual entailment and question answering. In all these cases, it is natural to formulate the decision problem as a constrained optimization problem, with an objective function that is composed of learned models, subject to domain- or problem-specific constraints.

Constrained conditional models form a learning and inference framework that augments the learning of conditional (probabilistic or discriminative) models with declarative constraints (written, for example, using a first-order representation) as a way to support decisions in an expressive output space while maintaining modularity and tractability of training and inference. These constraints can express either hard restrictions, completely prohibiting some assignments, or soft restrictions, penalizing unlikely assignments. In most applications of this framework in NLP, following,^[1] Integer Linear Programming (ILP) was used as the inference framework, although other algorithms can be used for that purpose.

Formal Definition

Given a set of feature functions $\{\phi _{i}(x,y)\}$ and a set of constraints $\{C_{i}(x,y)\}$ , defined over an input structure $x\in X$ and an output structure $y\in Y$ , a constraint conditional model is characterized by two weight vectors, w and $\rho$ , and is defined as the solution to the following optimization problem:

argmax_{y}\sum _{i}w_{i}\phi _{i}(x,y)-\sum \rho _{i}C_{i}(x,y)

.

Each constraint $C_{i}\in C$ is a boolean mapping indicating if the joint assignment $(x,y)$ violates a constraint, and $\rho$ is the penalty incurred for violating the constraints. Constraints assigned an infinite penalty are known as hard constraints, and represent unfeasible assignments to the optimization problem.

Training paradigms

Learning local vs. global models

The objective function used by CCMs can be decomposed and learned in several ways, ranging from a complete joint training of the model along with the constraints to completely decoupling the learning and the inference stage. In the latter case, several local models are learned independently and the dependency between these models is considered only at decision time via a global decision process. The advantages of each approach are discussed in ^[2] which studies the two training paradigms: (1) local models: L+I (learning + inference) and (2) global model: IBT (Inference based training), and shows both theoretically and experimentally that while IBT (joint training) is best in the limit, under some conditions (basically, ”good” components) L+I can generalize better.

The ability of CCM to combine local models is especially beneficial in cases where joint learning is computationally intractable or when training data are not available for joint learning. This flexibility distinguishes CCM from the other learning frameworks that also combine statistical information with declarative constraints, such as Markov logic network, that emphasize joint training.

Minimally supervised CCM

CCM can help reduce supervision by using domain knowledge (expressed as constraints) to drive learning. These settings were studied in ^[3] and.^[4] These works introduce semi-supervised Constraints Driven Learning (CODL) and show that by incorporating domain knowledge the performance of the learned model improves significantly.

Learning over latent representations

CCMs have also been applied to latent learning frameworks, where the learning problem is defined over a latent representation layer. Since the notion of a correct representation is inherently ill-defined, no gold-standard labeled data regarding the representation decision is available to the learner. Identifying the correct (or optimal) learning representation is viewed as a structured prediction process and therefore modeled as a CCM. This problem was covered in several papers, in both supervised^[5] and unsupervised ^[6] settings. In all cases research showed that explicitly modeling the interdependencies between representation decisions via constraints results in an improved performance.

Integer linear programming for natural language processing applications

The advantages of the CCM declarative formulation and the availability of off-the-shelf solvers have led to a large variety of natural language processing tasks being formulated within the framework, including semantic role labeling,^[7] syntactic parsing,^[8] coreference resolution,^[9] summarization,^[10]^[11]^[12] transliteration,^[13] natural language generation ^[14] and joint information extraction.^[15]^[16]

Most of these works use an integer linear programming (ILP) solver to solve the decision problem. Although theoretically solving an Integer Linear Program is exponential in the size of the decision problem, in practice using state-of-the-art solvers and approximate inference techniques ^[17] large scale problems can be solved efficiently.

The key advantage of using an ILP solver for solving the optimization problem defined by a constrained conditional model is the declarative formulation used as input for the ILP solver, consisting of a linear objective function and a set of linear constraints.

Resources

CCM Tutorial Predicting Structures in NLP: Constrained Conditional Models and Integer Linear Programming in NLP

External links

Related Research Articles

Linear programming (LP), also called linear optimization, is a method to achieve the best outcome in a mathematical model whose requirements are represented by linear relationships. Linear programming is a special case of mathematical programming.

Mathematical optimization or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfields: discrete optimization and continuous optimization. Optimization problems arise in all quantitative disciplines from computer science and engineering to operations research and economics, and the development of solution methods has been of interest in mathematics for centuries.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 5-12-23

A Bayesian network is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). While it is one of several forms of causal notation, causal networks are special cases of Bayesian networks. Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

An integer programming problem is a mathematical optimization or feasibility program in which some or all of the variables are restricted to be integers. In many settings the term refers to integer linear programming (ILP), in which the objective function and the constraints are linear.

<span class="mw-page-title-main">Markov random field</span> Set of random variables

In the domain of physics and probability, a Markov random field (MRF), Markov network or undirected graphical model is a set of random variables having a Markov property described by an undirected graph. In other words, a random field is said to be a Markov random field if it satisfies Markov properties. The concept originates from the Sherrington–Kirkpatrick model.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

In mathematical optimization, constrained optimization is the process of optimizing an objective function with respect to some variables in the presence of constraints on those variables. The objective function is either a cost function or energy function, which is to be minimized, or a reward function or utility function, which is to be maximized. Constraints can be either hard constraints, which set conditions for the variables that are required to be satisfied, or soft constraints, which have some variable values that are penalized in the objective function if, and based on the extent that, the conditions on the variables are not satisfied.

In computer science and mathematical logic, satisfiability modulo theories (SMT) is the problem of determining whether a mathematical formula is satisfiable. It generalizes the Boolean satisfiability problem (SAT) to more complex formulas involving real numbers, integers, and/or various data structures such as lists, arrays, bit vectors, and strings. The name is derived from the fact that these expressions are interpreted within ("modulo") a certain formal theory in first-order logic with equality. SMT solvers are tools that aim to solve the SMT problem for a practical subset of inputs. SMT solvers such as Z3 and cvc5 have been used as a building block for a wide range of applications across computer science, including in automated theorem proving, program analysis, program verification, and software testing.

Limited-memory BFGS is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algorithm for parameter estimation in machine learning. The algorithm's target problem is to minimize $over unconstrained values of the real-vector where is a differentiable scalar function.$

In mathematics, a constraint is a condition of an optimization problem that the solution must satisfy. There are several types of constraints—primarily equality constraints, inequality constraints, and integer constraints. The set of candidate solutions that satisfy all constraints is called the feasible set.

Computational Infrastructure for Operations Research (COIN-OR), is a project that aims to "create for mathematical software what the open literature is for mathematical theory." The open literature provides the operations research (OR) community with a peer-review process and an archive. Papers in operations research journals on mathematical theory often contain supporting numerical results from computational studies. The software implementations, models, and data used to produce the numerical results are typically not published. The status quo impeded researchers needing to reproduce computational results, make fair comparisons, and extend the state of the art.

Discriminative models, also referred to as conditional models, are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data, such as pass/fail, win/lose, alive/dead or healthy/sick.

The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.

Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.

Augmented Lagrangian methods are a certain class of algorithms for solving constrained optimization problems. They have similarities to penalty methods in that they replace a constrained optimization problem by a series of unconstrained problems and add a penalty term to the objective, but the augmented Lagrangian method adds yet another term designed to mimic a Lagrange multiplier. The augmented Lagrangian is related to, but not identical with, the method of Lagrange multipliers.

In constrained least squares one solves a linear least squares problem with an additional constraint on the solution. This means, the unconstrained equation $must be fit as closely as possible while ensuring that some other property of is maintained.$

Dan Roth is the Eduardo D. Glandt Distinguished Professor of Computer and Information Science at the University of Pennsylvania.

Dependency networks (DNs) are graphical models, similar to Markov networks, wherein each vertex (node) corresponds to a random variable and each edge captures dependencies among variables. Unlike Bayesian networks, DNs may contain cycles. Each node is associated to a conditional probability table, which determines the realization of the random variable given its parents.

References

↑ Dan Roth and Wen-tau Yih, "A Linear Programming Formulation for Global Inference in Natural Language Tasks." CoNLL, (2004).
↑ Vasin Punyakanok and Dan Roth and Wen-Tau Yih and Dav Zimak, "Learning and Inference over Constrained Output." IJCAI, (2005).
↑ Ming-Wei Chang and Lev Ratinov and Dan Roth, "Guiding Semi-Supervision with Constraint-Driven Learning." ACL, (2007).
↑ Ming-Wei Chang and Lev Ratinov and Dan Roth, "Constraints as Prior Knowledge." ICML Workshop on Prior Knowledge for Text and Language Processing, (2008).
↑ Ming-Wei Chang and Dan Goldwasser and Dan Roth and Vivek Srikumar, "Discriminative Learning over Constrained Latent Representations." NAACL, (2010).
↑ Ming-Wei Chang Dan Goldwasser Dan Roth and Yuancheng Tu, "Unsupervised Constraint Driven Learning For Transliteration Discovery." ^{[ permanent dead link ]} NAACL, (2009).
↑ Vasin Punyakanok, Dan Roth, Wen-tau Yih and Dav Zimak, "Semantic Role Labeling via Integer Linear Programming Inference." COLING, (2004).
↑ Kenji Sagae and Yusuke Miyao and Jun’ichi Tsujii, "HPSG Parsing with Shallow Dependency Constraints." ACL, (2007).
↑ Pascal Denis and Jason Baldridge, "Joint Determination of Anaphoricity and Coreference Resolution using Integer Programming." Archived 2010-06-21 at the Wayback Machine NAACL-HLT, (2007).
↑ James Clarke and Mirella Lapata, "Global Inference for Sentence Compression: An Integer Linear Programming Approach." Journal of Artificial Intelligence Research (JAIR), (2008).
↑ Katja Filippova and Michael Strube, "Dependency Tree Based Sentence Compression." ^{[ permanent dead link ]}INLG, (2008).
↑ Katja Filippova and Michael Strube, "Sentence Fusion via Dependency Graph Compression." EMNLP, (2008).
↑ Dan Goldwasser and Dan Roth, "Transliteration as Constrained Optimization." EMNLP, (2008).
↑ Regina Barzilay and Mirrela Lapata, "Aggregation via Set Partitioning for Natural Language Generation." NAACL, (2006).
↑ Dan Roth and Wen-tau Yih, "A Linear Programming Formulation for Global Inference in Natural Language Tasks." CoNLL, (2004).
↑ Yejin Choi and Eric Breck and Claire Cardie, "Joint Extraction of Entities and Relations for Opinion Recognition." EMNLP, (2006).
↑ André F. T. Martins, Noah A. Smith, and Eric P. Xing, "Concise Integer Linear Programming Formulations for Dependency Parsing ." ACL, (2009).

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Dan Roth and Wen-tau Yih, "A Linear Programming Formulation for Global Inference in Natural Language Tasks." CoNLL, (2004).

[2] Vasin Punyakanok and Dan Roth and Wen-Tau Yih and Dav Zimak, "Learning and Inference over Constrained Output." IJCAI, (2005).

[3] Ming-Wei Chang and Lev Ratinov and Dan Roth, "Guiding Semi-Supervision with Constraint-Driven Learning." ACL, (2007).

[4] Ming-Wei Chang and Lev Ratinov and Dan Roth, "Constraints as Prior Knowledge." ICML Workshop on Prior Knowledge for Text and Language Processing, (2008).

[5] Ming-Wei Chang and Dan Goldwasser and Dan Roth and Vivek Srikumar, "Discriminative Learning over Constrained Latent Representations." NAACL, (2010).

[6] Ming-Wei Chang Dan Goldwasser Dan Roth and Yuancheng Tu, "Unsupervised Constraint Driven Learning For Transliteration Discovery." ^{[ permanent dead link ]} NAACL, (2009).

[7] Vasin Punyakanok, Dan Roth, Wen-tau Yih and Dav Zimak, "Semantic Role Labeling via Integer Linear Programming Inference." COLING, (2004).

[8] Kenji Sagae and Yusuke Miyao and Jun’ichi Tsujii, "HPSG Parsing with Shallow Dependency Constraints." ACL, (2007).

[9] Pascal Denis and Jason Baldridge, "Joint Determination of Anaphoricity and Coreference Resolution using Integer Programming." Archived 2010-06-21 at the Wayback Machine NAACL-HLT, (2007).

[10] James Clarke and Mirella Lapata, "Global Inference for Sentence Compression: An Integer Linear Programming Approach." Journal of Artificial Intelligence Research (JAIR), (2008).

[11] Katja Filippova and Michael Strube, "Dependency Tree Based Sentence Compression." ^{[ permanent dead link ]}INLG, (2008).

[12] Katja Filippova and Michael Strube, "Sentence Fusion via Dependency Graph Compression." EMNLP, (2008).

[13] Dan Goldwasser and Dan Roth, "Transliteration as Constrained Optimization." EMNLP, (2008).

[14] Regina Barzilay and Mirrela Lapata, "Aggregation via Set Partitioning for Natural Language Generation." NAACL, (2006).

[15] Dan Roth and Wen-tau Yih, "A Linear Programming Formulation for Global Inference in Natural Language Tasks." CoNLL, (2004).

[16] Yejin Choi and Eric Breck and Claire Cardie, "Joint Extraction of Entities and Relations for Opinion Recognition." EMNLP, (2006).

[17] André F. T. Martins, Noah A. Smith, and Eric P. Xing, "Concise Integer Linear Programming Formulations for Dependency Parsing ." ACL, (2009).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]