Markovian discrimination

Last updated

Markovian discrimination is a class of spam filtering methods used in CRM114 and other spam filters to filter based on statistical patterns of transition probabilities between words or other lexical tokens in spam messages that would not be captured using simple bag-of-words naive Bayes spam filtering. [1]

Contents

Markovian Discrimination vs. Bag-of-Words Discrimination

A bag-of-words model contains only a dictionary of legal words and their relative probabilities in spam and genuine messages, whereas a Markovian model adds to this the relative transition probabilities between words in spam and in genuine messages (the relative probabilities of each next word given each current word in spam and in genuine messages). In other words, a bag-of-words filter discriminates based on relative probabilities of single words alone losing all phrase structure, while a Markovian word-based filter discriminates based on relative probabilities of either ordered pairs of words (the simplest case) or other short sequences of words (more common in practice), allowing the Markovian filter greater sensitivity to phrase structure.

Neither naive Bayes nor Markovian filters are limited to the word level for tokenizing messages. They can also process letters as tokens, partial words as tokens, or phrases as tokens. In such cases specific "bag-of-words" methods would correspond to general "bag-of-tokens" methods. Modelers can parameterize Markovian spam filters based on the relative probabilities of any such tokens' transitions appearing in spam or in legitimate messages. [2]

Visible and Hidden Markov Models

There are two primary classes of Markov models, visible Markov models and hidden Markov models, which differ in whether the Markov chain generating token sequences is assumed to have its states fully determined by each generated token (the visible Markov models) or might also have additional state (the hidden Markov models). With a visible Markov model, each current token is modeled as if it contains the complete information about previous tokens of the message relevant to the probability of future tokens, whereas a hidden Markov model allows for more obscure conditional relationships. [3] Since those more obscure conditional relationships are more typical of natural language messages including both genuine messages and spam, hidden Markov models are generally preferred over visible Markov models for spam filtering. Due to storage constraints, the most commonly employed model is a specific type of hidden Markov model known as a Markov random field, typically with a 'sliding window' or clique size ranging between four and six tokens. [4]

See also

Related Research Articles

<span class="mw-page-title-main">Naive Bayes classifier</span> Probabilistic classification algorithm

In statistics, naive Bayes classifiers are a family of linear "probabilistic classifiers" which assumes that the features are conditionally independent, given the target class. The strength (naivety) of this assumption is what gives the classifier its name. These classifiers are among the simplest Bayesian network models.

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum–Welch algorithm can be used to estimate the parameters.

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use. 5–12–23

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.

In statistical classification, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

  1. A generative model is a statistical model of the joint probability distribution on a given observable variable X and target variable Y; A generative model can be used to "generate" random instances (outcomes) of an observation x.
  2. A discriminative model is a model of the conditional probability of the target Y, given an observation x. It can be used to "discriminate" the value of the target variable Y, given an observation x.
  3. Classifiers computed without using a probability model are also referred to loosely as "discriminative".

In probability theory, statistics, and machine learning, recursive Bayesian estimation, also known as a Bayes filter, is a general probabilistic approach for estimating an unknown probability density function (PDF) recursively over time using incoming measurements and a mathematical process model. The process relies heavily upon mathematical concepts and models that are theorized within a study of prior and posterior probabilities known as Bayesian statistics.

Conditional random fields (CRFs) are a class of statistical modeling methods often applied in pattern recognition and machine learning and used for structured prediction. Whereas a classifier predicts a label for a single sample without considering "neighbouring" samples, a CRF can take context into account. To do so, the predictions are modelled as a graphical model, which represents the presence of dependencies between the predictions. What kind of graph is used depends on the application. For example, in natural language processing, "linear chain" CRFs are popular, for which each prediction is dependent only on its immediate neighbours. In image processing, the graph typically connects locations to nearby and/or similar locations to enforce that they receive similar predictions.

<span class="mw-page-title-main">Bitext word alignment</span> Identifying translation relationships among the words in a bitext

Bitext word alignment or simply word alignment is the natural language processing task of identifying translation relationships among the words in a bitext, resulting in a bipartite graph between the two sides of the bitext, with an arc between two words if and only if they are translations of one another. Word alignment is typically done after sentence alignment has already identified pairs of sentences that are translations of one another.

Bayesian poisoning is a technique used by e-mail spammers to attempt to degrade the effectiveness of spam filters that rely on Bayesian spam filtering. Bayesian filtering relies on Bayesian probability to determine whether an incoming mail is spam or is not spam. The spammer hopes that the addition of random words that are unlikely to appear in a spam message will cause the spam filter to believe the message to be legitimate—a statistical type II error.

Discriminative models, also referred to as conditional models, are a class of models frequently used for classification. They are typically used to solve binary classification problems, i.e. assign labels, such as pass/fail, win/lose, alive/dead or healthy/sick, to existing datapoints.

In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it. Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.

In statistics, a maximum-entropy Markov model (MEMM), or conditional Markov model (CMM), is a graphical model for sequence labeling that combines features of hidden Markov models (HMMs) and maximum entropy (MaxEnt) models. An MEMM is a discriminative model that extends a standard maximum entropy classifier by assuming that the unknown values to be learnt are connected in a Markov chain rather than being conditionally independent of each other. MEMMs find applications in natural language processing, specifically in part-of-speech tagging and information extraction.

<span class="mw-page-title-main">Gary Robinson</span> American software engineer and mathematician

Gary Robinson is an American software engineer and mathematician and inventor notable for his mathematical algorithms to fight spam. In addition, he patented a method to use web browser cookies to track consumers across different web sites, allowing marketers to better match advertisements with consumers. The patent was bought by DoubleClick, and then DoubleClick was bought by Google. He is credited as being one of the first to use automated collaborative filtering technologies to turn word-of-mouth recommendations into useful data.

In machine learning, sequence labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. A common example of a sequence labeling task is part of speech tagging, which seeks to assign a part of speech to each word in an input sentence or document. Sequence labeling can be treated as a set of independent classification tasks, one per member of the sequence. However, accuracy is generally improved by making the optimal label for a given element dependent on the choices of nearby elements, using special algorithms to choose the globally best set of labels for the entire sequence at once.

<span class="mw-page-title-main">Bayesian programming</span> Statistics concept

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.

The following outline is provided as an overview of and topical guide to machine learning:

References

  1. Chhabra, S., Yerazunis, W. S., and Siefkes, C. 2004. Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas. In Proceedings of the Fourth IEEE international Conference on Data Mining (November 1–04, 2004). ICDM. IEEE Computer Society, Washington, DC, Mazharul
  2. Duarte, J. P., Leite, V. C., & Frutuoso e Melo, P. F. F. (January 2013). A comparison between Markovian models and Bayesian networks for treating some dependent events in reliability evaluations. 2013 International Nuclear Atlantic Conference, Recife, Brazil.
  3. Jurafsky, Daniel & Martin, James H. Speech and Language Processing. 2023. Stanford University.
  4. Yerazunis, W. S. The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It. Mitsubishi Electric Research Laboratories, TR2004-091, January 2004.