Markovian discrimination

Last updated

Within the probability theory Markov model, Markovian discrimination in spam filtering is a method used in CRM114 and other spam filters to model the statistical behaviors of spam and non-spam more accurately than in simple Bayesian methods. [1] A simple Bayesian model of written text contains only the dictionary of legal words and their relative probabilities. A Markovian model adds the relative transition probabilities that given one word, it will predict what the next word will be. It is based on the theory of Markov chains by Andrey Markov, hence the name. In essence, a Bayesian filter works on single words alone, while a Markovian filter works on phrases or entire sentences.

Contents

Comparison with Bayesian Model and Spam Filtering Application

In contrast to a simple Bayesian model, which represents a message simply as a bag-of-words and examines a dictionary of legitimate words along with their relative probabilities, a Markovian model enhances this approach by incorporating relative transition probabilities. These transition probabilities predict the next word based on the preceding word. This methodology originates from the theory of Markov chains by Andrey Markov, hence the term 'Markovian.' The fundamental distinction between Bayesian and Markovian filters lies in their scope: while a Bayesian filter examines individual words in isolation, a Markovian filter extends its analysis to phrases or complete sentences. Markovian filters used in spam filtering are not limited to the word level; they can also process letter or partial word tokens. Additionally, weights can be assigned to these tokens based on their probability of appearing in spam or legitimate content, further enhancing the accuracy of the filter. [2]

Markov Models

There are two variants of Markov models: the visible Markov model and the hidden Markov model (HMM). The difference lies in the concept of the current word; with a visible Markov model, the current word is considered to contain the entire information about previous states of the language model, whereas a hidden Markov model obscures the relationship, assuming that the current word is only probabilistically linked to the previously read words. [3]

Application of Hidden Markov Models

To illustrate, in a visible Markov model, the word "the" should predict the subsequent word with a high degree of accuracy. In contrast, a hidden Markov model implies that the entire preceding text indicates the actual state and foresees the subsequent words, but it does not provide an absolute guarantee of that state or prediction. As spam filtering typically encounters the latter scenario, hidden Markov models are predominantly employed. Moreover, due to storage constraints, a specific type of hidden Markov model, known as a Markov random field, proves particularly suitable, typically with a 'sliding window' or clique size ranging between four and six tokens. [4]

See also

Related Research Articles

<span class="mw-page-title-main">Markov chain</span> Random process independent of past history

A Markov chain or Markov process is a stochastic model describing a sequence of possible events in which the probability of each event depends only on the state attained in the previous event. Informally, this may be thought of as, "What happens next depends only on the state of affairs now." A countably infinite sequence, in which the chain moves state at discrete time steps, gives a discrete-time Markov chain (DTMC). A continuous-time process is called a continuous-time Markov chain (CTMC). It is named after the Russian mathematician Andrey Markov.

A hidden Markov model (HMM) is a Markov model in which the observations are dependent on a latent Markov process. An HMM requires that there be an observable process whose outcomes depend on the outcomes of in a known way. Since cannot be observed directly, the goal is to learn about state of by observing By definition of being a Markov model, an HMM has an additional requirement that the outcome of at time must be "influenced" exclusively by the outcome of at and that the outcomes of and at must be conditionally independent of at given at time Estimation of the parameters in an HMM can be performed using maximum likelihood. For linear chain HMMs, the Baum–Welch algorithm can be used to estimate the parameters.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent pattern. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

<span class="mw-page-title-main">Markov property</span> Memoryless property of a stochastic process

In probability theory and statistics, the term Markov property refers to the memoryless property of a stochastic process, which means that its future evolution is independent of its history. It is named after the Russian mathematician Andrey Markov. The term strong Markov property is similar to the Markov property, except that the meaning of "present" is defined in terms of a random variable known as a stopping time.

CRM114 is a program based upon a statistical approach for classifying data, and especially used for filtering email spam.

Naive Bayes classifiers are a popular statistical technique of e-mail filtering. They typically use bag-of-words features to identify email spam, an approach commonly used in text classification.

A word salad is a "confused or unintelligible mixture of seemingly random words and phrases", most often used to describe a symptom of a neurological or mental disorder. The name schizophasia is used in particular to describe the confused language that may be evident in schizophrenia. The words may or may not be grammatically correct, but they are semantically confused to the point that the listener cannot extract any meaning from them. The term is often used in psychiatry as well as in theoretical linguistics to describe a type of grammatical acceptability judgement by native speakers, and in computer programming to describe textual randomization.

In electrical engineering, statistical computing and bioinformatics, the Baum–Welch algorithm is a special case of the expectation–maximization algorithm used to find the unknown parameters of a hidden Markov model (HMM). It makes use of the forward-backward algorithm to compute the statistics for the expectation step.

In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient.

In probability theory, statistics, and machine learning, recursive Bayesian estimation, also known as a Bayes filter, is a general probabilistic approach for estimating an unknown probability density function (PDF) recursively over time using incoming measurements and a mathematical process model. The process relies heavily upon mathematical concepts and models that are theorized within a study of prior and posterior probabilities known as Bayesian statistics.

Bayesian poisoning is a technique used by e-mail spammers to attempt to degrade the effectiveness of spam filters that rely on Bayesian spam filtering. Bayesian filtering relies on Bayesian probability to determine whether an incoming mail is spam or is not spam. The spammer hopes that the addition of random words that are unlikely to appear in a spam message will cause the spam filter to believe the message to be legitimate—a statistical type II error.

In the mathematical theory of stochastic processes, variable-order Markov (VOM) models are an important class of models that extend the well known Markov chain models. In contrast to the Markov chain models, where each random variable in a sequence with a Markov property depends on a fixed number of random variables, in VOM models this number of conditioning random variables may vary based on the specific observed realization.

In probability theory, a Markov model is a stochastic model used to model pseudo-randomly changing systems. It is assumed that future states depend only on the current state, not on the events that occurred before it. Generally, this assumption enables reasoning and computation with the model that would otherwise be intractable. For this reason, in the fields of predictive modelling and probabilistic forecasting, it is desirable for a given model to exhibit the Markov property.

<span class="mw-page-title-main">Gary Robinson</span> American software engineer and mathematician

Gary Robinson is an American software engineer and mathematician and inventor notable for his mathematical algorithms to fight spam. In addition, he patented a method to use web browser cookies to track consumers across different web sites, allowing marketers to better match advertisements with consumers. The patent was bought by DoubleClick, and then DoubleClick was bought by Google. He is credited as being one of the first to use automated collaborative filtering technologies to turn word-of-mouth recommendations into useful data.

In machine learning, sequence labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. A common example of a sequence labeling task is part of speech tagging, which seeks to assign a part of speech to each word in an input sentence or document. Sequence labeling can be treated as a set of independent classification tasks, one per member of the sequence. However, accuracy is generally improved by making the optimal label for a given element dependent on the choices of nearby elements, using special algorithms to choose the globally best set of labels for the entire sequence at once.

In mathematics, specifically in the theory of Markovian stochastic processes in probability theory, the Chapman–Kolmogorov equation (CKE) is an identity relating the joint probability distributions of different sets of coordinates on a stochastic process. The equation was derived independently by both the British mathematician Sydney Chapman and the Russian mathematician Andrey Kolmogorov. The CKE is prominently used in recent Variational Bayesian methods.

<span class="mw-page-title-main">Bayesian programming</span> Statistics concept

Bayesian programming is a formalism and a methodology for having a technique to specify probabilistic models and solve problems when less than the necessary information is available.

The following outline is provided as an overview of and topical guide to machine learning:

<span class="mw-page-title-main">Éric Moulines</span> French researcher in statistical learning

Éric Moulines is a French researcher in statistical learning and signal processing. He received the silver medal from the CNRS in 2010, the France Télécom prize awarded in collaboration with the French Academy of Sciences in 2011. He was appointed a Fellow of the European Association for Signal Processing in 2012 and of the Institute of Mathematical Statistics in 2016. He is General Engineer of the Corps des Mines (X81).

References

  1. Chhabra, S., Yerazunis, W. S., and Siefkes, C. 2004. Spam Filtering using a Markov Random Field Model with Variable Weighting Schemas. In Proceedings of the Fourth IEEE international Conference on Data Mining (November 1–04, 2004). ICDM. IEEE Computer Society, Washington, DC, Mazharul
  2. Duarte, J. P., Leite, V. C., & Frutuoso e Melo, P. F. F. (January 2013). A comparison between Markovian models and Bayesian networks for treating some dependent events in reliability evaluations. 2013 International Nuclear Atlantic Conference, Recife, Brazil.
  3. Jurafsky, Daniel & Martin, James H. Speech and Language Processing. 2023. Stanford University.
  4. Yerazunis, W. S. The Spam-Filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It. Mitsubishi Electric Research Laboratories, TR2004-091, January 2004.