Sliding window based part-of-speech tagging

Last updated

Sliding window based part-of-speech tagging is used to part-of-speech tag a text.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

A high percentage of words in a natural language are words which out of context can be assigned more than one part of speech. The percentage of these ambiguous words is typically around 30%, although it depends greatly on the language. Solving this problem is very important in many areas of natural language processing. For example in machine translation changing the part-of-speech of a word can dramatically change its translation.

In neuropsychology, linguistics, and the philosophy of language, a natural language or ordinary language is any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation. Natural languages can take different forms, such as speech or signing. They are distinguished from constructed and formal languages such as those used to program computers or to study logic.

Natural language processing field of computer science and linguistics

Natural language processing (NLP) is a subfield of computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Machine translation, sometimes referred to by the abbreviation MT is a sub-field of computational linguistics that investigates the use of software to translate text or speech from one language to another.

Sliding window based part-of-speech taggers are programs which assign a single part-of-speech to a given lexical form of a word, by looking at a fixed sized "window" of words around the word to be disambiguated.

In computational linguistics, word-sense disambiguation (WSD) is an open problem of natural language processing and ontology. WSD is identifying which sense of a word is used in a sentence, when the word has multiple meanings. The solution to this problem impacts other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, inference, et cetera.

The two main advantages of this approach are:

Finite-state machine mathematical model of computation; abstract machine that can be in exactly one of a finite number of states at any given time

A finite-state machine (FSM) or finite-state automaton, finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number of states at any given time. The FSM can change from one state to another in response to some external inputs; the change from one state to another is called a transition. An FSM is defined by a list of its states, its initial state, and the conditions for each transition. Finite state machines are of two types – deterministic finite state machines and non-deterministic finite state machines. A deterministic finite-state machine can be constructed equivalent to any non-deterministic one.

In the theory of computation, a Mealy machine is a finite-state machine whose output values are determined both by its current state and the current inputs. This is in contrast to a Moore machine, whose output values are determined solely by its current state. A Mealy machine is a deterministic finite-state transducer: for each state and input, at most one transition is possible.

Formal definition

Let

be the set of grammatical tags of the application, that is, the set of all possible tags which may be assigned to a word, and let

be the vocabulary of the application. Let

be a function for morphological analysis which assigns each its set of possible tags, , that can be implemented by a full-form lexicon, or a morphological analyser. Let

be the set of word classes, that in general will be a partition of with the restriction that for each all of the words will receive the same set of tags, that is, all of the words in each word class belong to the same ambiguity class.

Partition of a set

In mathematics, a partition of a set is a grouping of the set's elements into non-empty subsets, in such a way that every element is included in one and only one of the subsets.

Normally, is constructed in a way that for high frequency words, each word class contains a single word, while for low frequency words, each word class corresponds to a single ambiguity class. This allows good performance for high frequency ambiguous words, and doesn't require too many parameters for the tagger.

With these definitions it is possible to state problem in the following way: Given a text each word is assigned a word class (either by using the lexicon or morphological analyser) in order to get an ambiguously tagged text . The job of the tagger is to get a tagged text (with ) as correct as possible.

A statistical tagger looks for the most probable tag for an ambiguously tagged text :

Using Bayes formula, this is converted into:

where is the probability that a particular tag (syntactic probability) and is the probability that this tag corresponds to the text (lexical probability).

In a Markov model, these probabilities are approximated as products. The syntactic probabilities are modelled by a first order Markov process:

In probability theory, a Markov model is a stochastic model used to model randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it. Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.

where and are delimiter symbols.

Lexical probabilities are independent of context:

One form of tagging is to approximate the first probability formula:

where is the right context of the size .

In this way the sliding window algorithm only has to take into account a context of size . For most applications . For example to tag the ambiguous word "run" in the sentence "He runs from danger", only the tags of the words "He" and "from" are needed to be taken into account.

Further reading

Related Research Articles

Pareto distribution probability distribution

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, scientific, geophysical, actuarial, and many other types of observable phenomena. Originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population, the Pareto distribution has colloquially become known and referred to as the Pareto principle, or "80-20 rule", and is sometimes called the "Matthew principle". This rule states that, for example, 80% of the wealth of a society is held by 20% of its population. However, the Pareto distribution only produces this result for a particular power value, (α = log45 ≈ 1.16). While is variable, empirical observation has found the 80-20 distribution to fit a wide range of cases, including natural phenomena and human activities.

Students <i>t</i>-distribution probability distribution

In probability and statistics, Student's t-distribution is any member of a family of continuous probability distributions that arises when estimating the mean of a normally distributed population in situations where the sample size is small and population standard deviation is unknown. It was developed by William Sealy Gosset under the pseudonym Student.

Chi-squared distribution gamma distribution

In probability theory and statistics, the chi-squared distribution with k degrees of freedom is the distribution of a sum of the squares of k independent standard normal random variables. The chi-square distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing or in construction of confidence intervals. When it is being distinguished from the more general noncentral chi-squared distribution, this distribution is sometimes called the central chi-squared distribution.

Rayleigh distribution

In probability theory and statistics, the Rayleigh distribution is a continuous probability distribution for nonnegative-valued random variables. It is a chi distribution in two degrees of freedom.

In statistics, the Wishart distribution is a generalization to multiple dimensions of the gamma distribution. It is named in honor of John Wishart, who first formulated the distribution in 1928.

Categorial grammar is a term used for a family of formalisms in natural language syntax motivated by the principle of compositionality and organized according to the view that syntactic constituents should generally combine as functions or according to a function-argument relationship. Most versions of categorial grammar analyze sentence structure in terms of constituencies and are therefore phrase structure grammars.

In mathematics, the Hamilton–Jacobi equation (HJE) is a necessary condition describing extremal geometry in generalizations of problems from the calculus of variations, and is a special case of the Hamilton–Jacobi–Bellman equation. It is named for William Rowan Hamilton and Carl Gustav Jacob Jacobi.

Mathematics of general relativity

The mathematics of general relativity refers to various mathematical structures and techniques that are used in studying and formulating Albert Einstein's theory of general relativity. The main tools used in this geometrical theory of gravitation are tensor fields defined on a Lorentzian manifold representing spacetime. This article is a general description of the mathematics of general relativity.

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space; it is proportional to the square root of the determinant of the Fisher information matrix:

Chi distribution

In probability theory and statistics, the chi distribution is a continuous probability distribution. It is the distribution of the positive square root of the sum of squares of a set of independent random variables each following a standard normal distribution, or equivalently, the distribution of the Euclidean distance of the random variables from the origin. It is thus related to the chi-squared distribution by describing the distribution of the positive square roots of a variable obeying a chi-squared distribution.

In mathematical finance, the SABR model is a stochastic volatility model, which attempts to capture the volatility smile in derivatives markets. The name stands for "stochastic alpha, beta, rho", referring to the parameters of the model. The SABR model is widely used by practitioners in the financial industry, especially in the interest rate derivative markets. It was developed by Patrick S. Hagan, Deep Kumar, Andrew Lesniewski, and Diana Woodward.

In game theory, a stochastic game, introduced by Lloyd Shapley in the early 1950s, is a dynamic game with probabilistic transitions played by one or more players. The game is played in a sequence of stages. At the beginning of each stage the game is in some state. The players select actions and each player receives a payoff that depends on the current state and the chosen actions. The game then moves to a new random state whose distribution depends on the previous state and the actions chosen by the players. The procedure is repeated at the new state and play continues for a finite or infinite number of stages. The total payoff to a player is often taken to be the discounted sum of the stage payoffs or the limit inferior of the averages of the stage payoffs.

An embedded pushdown automaton or EPDA is a computational model for parsing languages generated by tree-adjoining grammars (TAGs). It is similar to the context-free grammar-parsing pushdown automaton, except that instead of using a plain stack to store symbols, it has a stack of iterated stacks that store symbols, giving TAGs a generative capacity between context-free grammars and context-sensitive grammars, or a subset of the mildly context-sensitive grammars. Embedded pushdown automata should not be confused with nested stack automata which have more computational power.

In the field of computational linguistics, a morphological dictionary is a linguistic resource that contains correspondences between surface form and lexical forms of words. Surface forms of words are those found in any text. The corresponding lexical form of a surface form is the lemma followed by grammatical information. In English give, gives, giving, gave and given are surface forms of the verb give. The lexical form would be "give", verb. There are two kinds of morphological dictionaries: aligned and non-aligned.

Normal-inverse-gamma distribution

In probability theory and statistics, the normal-inverse-gamma distribution is a four-parameter family of multivariate continuous probability distributions. It is the conjugate prior of a normal distribution with unknown mean and variance.

Variance gamma process

In the theory of stochastic processes, a part of the mathematical theory of probability, the variance gamma process (VG), also known as Laplace motion, is a Lévy process determined by a random time change. The process has finite moments distinguishing it from many Lévy processes. There is no diffusion component in the VG process and it is thus a pure jump process. The increments are independent and follow a Variance-gamma distribution, which is a generalization of the Laplace distribution.

Uniform convergence in probability is a form of convergence in probability in statistical asymptotic theory and probability theory. It means that, under certain conditions, the empirical frequencies of all events in a certain event-family converge to their theoretical probabilities. Uniform convergence in probability has applications to statistics as well as machine learning as part of statistical learning theory.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In mathematical physics, higher-dimensional gamma matrices generalize to arbitrary dimension the four-dimensional Gamma matrices of Dirac, which are a mainstay of relativistic quantum mechanics. They are utilized in relativistically invariant wave equations for fermions in arbitrary space-time dimensions, notably in string theory and supergravity.

This article summarizes important identities in exterior calculus.