Divergence-from-randomness model

Last updated

In the field of information retrieval, divergence from randomness (DFR), is a generalization of one of the very first models, Harter's 2-Poisson indexing-model. [1] It is one type of probabilistic model. It is used to test the amount of information carried in documents. The 2-Poisson model is based on the hypothesis that the level of documents is related to a set of documents that contains words that occur in relatively greater extent than in the rest of the documents. It is not a 'model', but a framework for weighting terms using probabilistic methods, and it has a special relationship for term weighting based on the notion of elite

Contents

Term weights are being treated as the standard of whether a specific word is in that set or not. Term weights are computed by measuring the divergence between a term distribution produced by a random process and the actual term distribution.

Divergence from randomness models set up by instantiating the three main components of the framework: first selecting a basic randomness model, then applying the first normalization and at last normalizing the term frequencies. The basic models are from the following tables.

Definition

The divergence from randomness is based on this idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in document d. In other words, the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness." [1]

(Formula 1)

  1. M represents the type of model of randomness which employs to calculate the probability.
  2. d is the total number of words in the documents.
  3. t is the number of a specific word in d.
  4. k is defined by M.

It is possible that we use different urn models to choose the appropriate model M of randomness. In Information Retrieval, there are documents instead of urns, and terms instead of colors. There are several ways to choose M, each of these has a basic divergence from randomness model to support it. [1]

Model

Basic Models

D      Divergence approximation of the binomial P      Approximation of the binomial BE        Bose-Einstein distribution G  Geometric approximation of the  Bose-Einstein  I(n)   Inverse Document Frequency Model I(F)   Inverse Term Frequency Model I(ne) Inverse Expected Document Frequency Model

DFR Models

BB2    Bernoulli-Einstein model with Bernoulli after-effect and normalization 2. IFB2    Inverse Term Frequency model with Bernoulli after-effect and normalization 2. In-expB2 Inverse Expected Document Frequency model with Bernoulli after-effect and normalization 2. The logarithms are base 2. This model can be used for classic ad-hoc tasks. In-expC2  Inverse Expected Document Frequency model with Bernoulli after-effect and normalization 2. The logarithms are base e. This model can be used for classic ad-hoc tasks. InL2    Inverse Document Frequency model with Laplace after-effect and normalization 2. This model can be used for tasks that require early precision. PL2    Poisson model with Laplace after-effect and normalization 2. This model can be used for tasks that require early precision[7,8].

First Normalization

When a specific rare term cannot be found in a document, then in that document the term has approximately zero probability of being informative. On the other hand, if a rare term occurs frequently in a document, therefore it can have a very high, near 100% probability to be informative for the topic that mentioned by the document. Applying to Ponte and Croft’s language model can also provide further data. A risk component is considered in the DFR. Logically speaking, if the term-frequency in the document is relatively high, then inversely the risk for the term of not being informative is relatively small. If Formula 1 gives a high value, then there is a minimal risk that it has the negative effect of showing small information gain. As a result, the weight of Formula one is organized 1 to only consider the portion of which is the amount of information gained with the term. The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus the smaller the associated risk is. We use two models to compute the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B. [2]

Term frequency normalization

Before using the within-document frequency tf of a term, the document-length dl is normalized to a standard length sl. Therefore, the term-frequencies tf are recalculated with the respect to the standard document-length, that is:

 tfn = tf * log(1+ sl/dl) (normalization 1)

tfn represents the normalized term frequency. Another version of the normalization formula is the following:

 tfn = tf * log(1 + c*(sl/dl)) (normalization 2)

Normalization 2 is usually considered to be more flexible, since there is no fixed value for c.

  1. tf is the term-frequency of the term t in the document d
  2. dl is the document-length.
  3. sl is the standard length.

Mathematic and statistical tools

The probability space

Sampling space V

Utility-Theoretic Indexing developed by Cooper and Maron is a theory of indexing based on utility theory. To reflect the value for documents that is expected by the users, index terms are assigned to documents. Also, Utility-Theoretic Indexing is related an "event space" in the statistical word. There are several basic spaces Ω in the Information Retrieval. A really simple basic space Ω can be the set V of terms t, which is called the vocabulary of the document collection. Due to Ω=V is the set of all mutually exclusive events, Ω can also be the certain event with probability [3] :

Thus P, the probability distribution, assigns probabilities to all sets of terms for the vocabulary. Notice that the basic problem of Information Retrieval is to find an estimate for P(t). Estimates are computed on the basis of sampling and the experimental text collection furnishes the samples needed for the estimation. Now we run into the main concern which is how do we treat two arbitrary but heterogeneous pieces of texts appropriately. Paragons like a chapter in a Science Magazine and an article from a sports newspaper as the other. They can be considered as two different samples since those aiming at different population.

Sampling with a document

The relationship of the document with the experiments is made by the way in which the sample space is chosen. In IR, term experiment, or trial, is used here with a technical meaning rather than a common sense. For example, a document could be an experiment which means the document is a sequence of outcomes t∈V, or just a sample of a population. We will talk about the event of observing a number Xt =tf of occurrences of a given word t in a sequence of experiments. In order to introduce this event space, we should introduce the product of the probability spaces associated with the experiments of the sequence. We could introduce our sample space to associate a point with possible configurations of the outcomes. The one-to-one correspondence for sample space can be defined as:

Where ld is the number of trials of the experiment or in this example, the length of a document. We can assume that each outcome may or may not depend on the outcomes of the previous experiments. If the experiments are designed so that an outcome is influencing the next outcomes, then the probability distribution on V is different at each trial. But, more commonly, in order to establish the simpler case when the probability space is invariant in IR, the term independence assumption is often made. Therefore, all possible configurations ofΩ=Vld are considered equiprobable. Considering this assumption, we can consider each document a Bernoulli process. The probability spaces of the product are invariant and the probability of a given sequence is the product of the probabilities at each trial. Consequently, if p=P(t) is the prior probability that the outcome is t and the number of experiments is ld we obtain the probability of is equal to:

Which is the sum of the probability of all possible configurations having tf outcomes out of ld. P(Xt=tf|p) is a probability distribution because

  1. The length of document d.
  2. tf The term frequency of t in document d.
  3. The number of occurrence of a specific word in one list.

Multiple samplings

Already considering the hypothesis of having a single sample, we need to consider that we have several samples, for example, a collection D of documents. The situation of having a collection of N documents is abstractly equivalent to the scheme of placing a certain number Tot of V colored types of balls in a collection of N cells. For each term t∈V a possible configuration of ball placement satisfies the equations:

 tf1+...+tfN=Ft

And the condition

 F1+...+FV=Tot

Where Ft is the number of balls of the same color t to be distributed in the N cells. We have thus changed the basic space. The outcome of our experiment will be the documents d in which the ball will be placed. Also, we will have a lot of possible configurations consistent with the number of colored balls.

  1. Ft The total number of tokens of t in the collection.
  2. Tot The total number of tokens in the collection D

Distributions

Binomial distribution

Hypergeometric distribution

Bose-Einstein statistics

Fat-tailed distribution

Conclusion

The divergence from Randomness Model is based on the Bernoulli model and its limiting forms, the hypergeometric distribution, Bose-Einstein statistics and its limiting forms, the compound of the binomial distribution with the beta distribution, and the fat-tailed distribution. Divergence from randomness model shows a unifying framework that has the potential constructing a lot of different effective models of IR.

Applications

Applications and characteristics

  1. The Divergence from randomness model can be applied in automatic indexing in Information Retrieval. These can be explained as the dissertation elite, the notion of an informative content of a term within a document.
  2. The effectiveness of the models based on divergence from randomness is very high in comparison with both BM25 and language model. For short queries, the performance of the models of divergence from randomness is definitely better than the BM25 Model, which since 1994 has been used as a standard baseline for the comparison of the models.
  3. The Divergence from randomness model can show the best performance with only a few documents comparing to other query expansion skills.
  4. The framework of Divergence from randomness model is very general and flexible. With the query expansion provided for each component, we can apply different technologies in order to get the best performance.

Proximity

Proximity can be handled within divergence from randomness to consider the number of occurrences of a pair of query terms within a window of pre-defined size. To specify, the DFR Dependence Score Modifier DSM implements both the pBiL and pBiL2 models, which calculate the randomness divided by the document's length, rather than the statistics of the pair in the corpus the pair in the corpus.

Examples of divergence from randomness

Let t be a term and c be a collection. Let the term occur in tfc=nL(t,c)=200 locations, and in df(t,c)=nL(t,c)=100 documents. The expected average term frequency is avgtf(t,c)=200/100=2; this is the average over the documents in which the term occurs. Let N.D(c)=1000 be the total amounts of documents. The term's occurrence is 10% in the documents: P.D(t|c)=100/1000. The expected average term frequency is 200/1000=1/5, and this is the average over all documents. The term frequency is shown as Kt =0,...,6.

Graph for Example 1.jpg

The following table show the column nD is the number of Documents that contains kt occurrence of t, shown as nD(t,c,kt). Another column nL is the number of Locations at which the term occurs follows by this equation: nL=kt*nD. The columns to the right show the observed and Poisson probabilities. P obs,elite(Kt) is the observed probability over all documents. P Poisson, all, lambda(Kt) is the Poisson probability, where lambda(t,c)=nL(t,c)/N D(c)=0.20 is the Poisson parameter. The table illustrates how the observed probability is different from the Poisson probability. P Poisson(1) is greater than P obs(1), whereas for kt>1.the observed probabilities are greater than the Poisson probabilities. There is more mass in the tail of the observed distribution than the Poisson distribution assumes. Moreover, the columns to the right illustrate the usage of the elite documents instead of all documents. Here, the single event probability is based on the locations of elite documents only.

Further interest of examples

  1. Adjusting document length.
  2. Applying DFR in content-only XML Documents
  3. Introduction to DFR models

Related Research Articles

<span class="mw-page-title-main">Event (probability theory)</span> In statistics and probability theory, set of outcomes to which a probability is assigned

In probability theory, an event is a set of outcomes of an experiment to which a probability is assigned. A single outcome may be an element of many different events, and different events in an experiment are usually not equally likely, since they may include very different groups of outcomes. An event consisting of only a single outcome is called an elementary event or an atomic event; that is, it is a singleton set. An event that has more than one possible outcome is called a compound event. An event is said to occur if contains the outcome of the experiment. The probability that an event occurs is the probability that contains the outcome of an experiment. An event defines a complementary event, namely the complementary set, and together these define a Bernoulli trial: did the event occur or not?

<span class="mw-page-title-main">Probability theory</span> Branch of mathematics concerning probability

Probability theory or probability calculus is the branch of mathematics concerned with probability. Although there are several different probability interpretations, probability theory treats the concept in a rigorous mathematical manner by expressing it through a set of axioms. Typically these axioms formalise probability in terms of a probability space, which assigns a measure taking values between 0 and 1, termed the probability measure, to a set of outcomes called the sample space. Any specified subset of the sample space is called an event.

<span class="mw-page-title-main">Probability distribution</span> Mathematical function for the probability a given outcome occurs in an experiment

In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of possible outcomes for an experiment. It is a mathematical description of a random phenomenon in terms of its sample space and the probabilities of events.

<span class="mw-page-title-main">Random variable</span> Variable representing a random phenomenon

A random variable is a mathematical formalization of a quantity or object which depends on random events. The term 'random variable' in its mathematical definition refers to neither randomness nor variability but instead is a mathematical function in which

<span class="mw-page-title-main">Negative binomial distribution</span> Probability distribution

In probability theory and statistics, the negative binomial distribution is a discrete probability distribution that models the number of failures in a sequence of independent and identically distributed Bernoulli trials before a specified (non-random) number of successes occurs. For example, we can define rolling a 6 on some dice as a success, and rolling any other number as a failure, and ask how many failure rolls will occur before we see the third success. In such a case, the probability distribution of the number of failures that appear will be a negative binomial distribution.

<span class="mw-page-title-main">Stochastic process</span> Collection of random variables

In probability theory and related fields, a stochastic or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Stochastic processes are widely used as mathematical models of systems and phenomena that appear to vary in a random manner. Examples include the growth of a bacterial population, an electrical current fluctuating due to thermal noise, or the movement of a gas molecule. Stochastic processes have applications in many disciplines such as biology, chemistry, ecology, neuroscience, physics, image processing, signal processing, control theory, information theory, computer science, and telecommunications. Furthermore, seemingly random changes in financial markets have motivated the extensive use of stochastic processes in finance.

<span class="mw-page-title-main">Bernoulli process</span> Random process of binary (boolean) random variables

In probability and statistics, a Bernoulli process is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The component Bernoulli variablesXi are identically distributed and independent. Prosaically, a Bernoulli process is a repeated coin flipping, possibly with an unfair coin. Every variable Xi in the sequence is associated with a Bernoulli trial or experiment. They all have the same Bernoulli distribution. Much of what can be said about the Bernoulli process can also be generalized to more than two outcomes ; this generalization is known as the Bernoulli scheme.

<span class="mw-page-title-main">Bernoulli trial</span> Any experiment with two possible random outcomes

In the theory of probability and statistics, a Bernoulli trial is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is conducted. It is named after Jacob Bernoulli, a 17th-century Swiss mathematician, who analyzed them in his Ars Conjectandi (1713).

<span class="mw-page-title-main">Probability mass function</span> Discrete-variable probability distribution

In probability and statistics, a probability mass function is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes it is also known as the discrete probability density function. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete.

<span class="mw-page-title-main">Law of large numbers</span> Averages of repeated trials converge to the expected value

In probability theory, the law of large numbers (LLN) is a mathematical law that states that the average of the results obtained from a large number of independent random samples converges to the true value, if it exists. More formally, the LLN states that given a sample of independent and identically distributed values, the sample mean converges to the true mean.

A prior probability distribution of an uncertain quantity, often simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.

In probability theory, the multinomial distribution is a generalization of the binomial distribution. For example, it models the probability of counts for each side of a k-sided dice rolled n times. For n independent trials each of which leads to a success for exactly one of k categories, with each category having a given fixed success probability, the multinomial distribution gives the probability of any particular combination of numbers of successes for the various categories.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus.

In information theory, perplexity is a measure of uncertainty in the value of a sample from a discrete probability distribution. The larger the perplexity, the less likely it is that an observer can guess the value which will be drawn from the distribution. Perplexity was originally introduced in 1977 in the context of speech recognition by Frederick Jelinek, Robert Leroy Mercer, Lalit R. Bahl, and James K. Baker.

A stochastic simulation is a simulation of a system that has variables that can change stochastically (randomly) with individual probabilities.

<span class="mw-page-title-main">Experiment (probability theory)</span> Procedure that can be infinitely repeated, with a well-defined set of outcomes

In probability theory, an experiment or trial is any procedure that can be infinitely repeated and has a well-defined set of possible outcomes, known as the sample space. An experiment is said to be random if it has more than one possible outcome, and deterministic if it has only one. A random experiment that has exactly two possible outcomes is known as a Bernoulli trial.

<span class="mw-page-title-main">Poisson distribution</span> Discrete probability distribution

In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known constant mean rate and independently of the time since the last event. It can also be used for the number of events in other types of intervals than time, and in dimension greater than 1.

In probability and statistics, a compound probability distribution is the probability distribution that results from assuming that a random variable is distributed according to some parametrized distribution, with the parameters of that distribution themselves being random variables. If the parameter is a scale parameter, the resulting mixture is also called a scale mixture.

References

  1. 1 2 3 "Divergence From Randomness (DFR) Framework". terrier.org. University of Glasgow School of Computing Science. Retrieved 14 Sep 2024.
  2. He, Ben (27 April 2005). "DivergenceFromRandomness". ir.dcs.gla.ac.uk. Archived from the original on 11 Sep 2019.
  3. Amati, Giambattista (n.d.) (9 June 2003). Probabilistic Models of Information Retrieval Based on Measuring the Divergence from Randomness (PDF) (Phd CompSci thesis). University of Glasgow. Retrieved 14 Sep 2024 via Fondazione Ugo Bordoni and CORNELIS JOOST VAN RIJSBERGEN, theses.gla.ac.uk.