Discounted cumulative gain

Last updated May 13, 2024

Discounted cumulative gain (DCG) is a measure of ranking quality in information retrieval. It is often normalized so that it is comparable across queries, giving Normalized DCG (nDCG or NDCG). NDCG is often used to measure effectiveness of search engine algorithms and related applications. Using a graded relevance scale of documents in a search-engine result set, DCG sums the usefulness, or gain, of the results discounted by their position in the result list.^[1] NDCG is DCG normalized by the maximum possible DCG of the result set when ranked from highest to lowest gain, thus adjusting for the different numbers of relevant results for different queries.

Overview

Two assumptions are made in using DCG and its related measures.

Highly relevant documents are more useful when appearing earlier in a search engine result list (have higher ranks)
Highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than non-relevant documents.

Cumulative Gain

DCG is a refinement of a simpler measure, Cumulative Gain (CG).^[2] Cumulative Gain is the sum of the graded relevance values of all results in a search result list. CG does not take into account the rank (position) of a result in the result list. The CG at a particular rank position $p$ is defined as:

\mathrm {CG_{p}} =\sum _{i=1}^{p}rel_{i}

Where $rel_{i}$ is the graded relevance of the result at position $i$ .

The value computed with the CG function is unaffected by changes in the ordering of search results. That is, moving a highly relevant document $d_{i}$ above a higher ranked, less relevant, document $d_{j}$ does not change the computed value for CG (assuming $i,j\leq p$ ). Based on the two assumptions made above about the usefulness of search results, (N)DCG is usually preferred over CG. Cumulative Gain is sometimes called Graded Precision.

Discounted Cumulative Gain

The premise of DCG is that highly relevant documents appearing lower in a search result list should be penalized, as the graded relevance value is reduced logarithmically proportional to the position of the result.

The usual formula of DCG accumulated at a particular rank position $p$ is defined as:^[1]

\mathrm {DCG_{p}} =\sum _{i=1}^{p}{\frac {rel_{i}}{\log _{2}(i+1)}}=rel_{1}+\sum _{i=2}^{p}{\frac {rel_{i}}{\log _{2}(i+1)}}

Until 2013, there was no theoretically sound justification for using a logarithmic reduction factor^[3] other than the fact that it produces a smooth reduction. But Wang et al. (2013)^[2] gave theoretical guarantee for using the logarithmic reduction factor in Normalized DCG (NDCG). The authors show that for every pair of substantially different ranking functions, the NDCG can decide which one is better in a consistent manner.

An alternative formulation of DCG^[4] places stronger emphasis on retrieving relevant documents:

\mathrm {DCG_{p}} =\sum _{i=1}^{p}{\frac {2^{rel_{i}}-1}{\log _{2}(i+1)}}

The latter formula is commonly used in industrial applications including major web search companies^[5] and data science competition platforms such as Kaggle.^[6]

These two formulations of DCG are the same when the relevance values of documents are binary;^[3]^: 320 $rel_{i}\in \{0,1\}$ .

Note that Croft et al. (2010) and Burges et al. (2005) present the second DCG with a log of base e, while both versions of DCG above use a log of base 2. When computing NDCG with the first formulation of DCG, the base of the log does not matter, but the base of the log does affect the value of NDCG for the second formulation. Clearly, the base of the log affects the value of DCG in both formulations.

Convex and smooth approximations to DCG have also been developed, for use as an objective function in gradient based learning methods.^[7]

Normalized DCG

Search result lists vary in length depending on the query. Comparing a search engine's performance from one query to the next cannot be consistently achieved using DCG alone, so the cumulative gain at each position for a chosen value of $p$ should be normalized across queries. This is done by sorting all relevant documents in the corpus by their relative relevance, producing the maximum possible DCG through position $p$ , also called Ideal DCG (IDCG) through that position. For a query, the normalized discounted cumulative gain, or nDCG, is computed as:

\mathrm {nDCG_{p}} ={\frac {DCG_{p}}{IDCG_{p}}}

,

where IDCG is ideal discounted cumulative gain,

\mathrm {IDCG_{p}} =\sum _{i=1}^{|REL_{p}|}{\frac {2^{rel_{i}}-1}{\log _{2}(i+1)}}

and $REL_{p}$ represents the list of relevant documents (ordered by their relevance) in the corpus up to position p.

The nDCG values for all queries can be averaged to obtain a measure of the average performance of a search engine's ranking algorithm. Note that in a perfect ranking algorithm, the $DCG_{p}$ will be the same as the $IDCG_{p}$ producing an nDCG of 1.0. All nDCG calculations are then relative values on the interval 0.0 to 1.0 and so are cross-query comparable.

The main difficulty encountered in using nDCG is the unavailability of an ideal ordering of results when only partial relevance feedback is available.

Example

Presented with a list of documents in response to a search query, an experiment participant is asked to judge the relevance of each document to the query. Each document is to be judged on a scale of 0-3 with 0 meaning not relevant, 3 meaning highly relevant, and 1 and 2 meaning "somewhere in between". For the documents ordered by the ranking algorithm as

D_{1},D_{2},D_{3},D_{4},D_{5},D_{6}

the user provides the following relevance scores:

3,2,3,0,1,2

That is: document 1 has a relevance of 3, document 2 has a relevance of 2, etc. The Cumulative Gain of this search result listing is:

\mathrm {CG_{6}} =\sum _{i=1}^{6}rel_{i}=3+2+3+0+1+2=11

Changing the order of any two documents does not affect the CG measure. If $D_{3}$ and $D_{4}$ are switched, the CG remains the same, 11. DCG is used to emphasize highly relevant documents appearing early in the result list. Using the logarithmic scale for reduction, the DCG for each result in order is:

$i$	$rel_{i}$	$\log _{2}(i+1)$	${\frac {rel_{i}}{\log _{2}(i+1)}}$
1	3	1	3
2	2	1.585	1.262
3	3	2	1.5
4	0	2.322	0
5	1	2.585	0.387
6	2	2.807	0.712

So the $DCG_{6}$ of this ranking is:

\mathrm {DCG_{6}} =\sum _{i=1}^{6}{\frac {rel_{i}}{\log _{2}(i+1)}}=3+1.262+1.5+0+0.387+0.712=6.861

Now a switch of $D_{3}$ and $D_{4}$ results in a reduced DCG because a less relevant document is placed higher in the ranking; that is, a more relevant document is discounted more by being placed in a lower rank.

The performance of this query to another is incomparable in this form since the other query may have more results, resulting in a larger overall DCG which may not necessarily be better. In order to compare, the DCG values must be normalized.

To normalize DCG values, an ideal ordering for the given query is needed. For this example, that ordering would be the monotonically decreasing sort of all known relevance judgments. In addition to the six from this experiment, suppose we also know there is a document $D_{7}$ with relevance grade 3 to the same query and a document $D_{8}$ with relevance grade 2 to that query. Then the ideal ordering is:

3,3,3,2,2,2,1,0

The ideal ranking is cut again to length 6 to match the depth of analysis of the ranking:

3,3,3,2,2,2

The DCG of this ideal ordering, or IDCG (Ideal DCG) , is computed to rank 6:

\mathrm {IDCG_{6}} =8.740

And so the nDCG for this query is given as:

\mathrm {nDCG_{6}} ={\frac {DCG_{6}}{IDCG_{6}}}={\frac {6.861}{8.740}}=0.785

Limitations

Normalized DCG does not penalize containing bad documents in the result. For example, if a query returns two results with scores $1,1,1$ and $1,1,1,0$ respectively, both would be considered equally good, even if the latter contains a bad document. For the ranking judgments $Excellent, Fair, Bad$ one might use numerical scores $1,0,-1$ instead of $2,1,0$ . This would cause the score to be lowered if bad results are returned, prioritizing the precision of the results over the recall; however, this approach can result in an overall negative score.
Normalized DCG does not penalize missing documents in the result. For example, if a query returns two results with scores $1,1,1$ and $1,1,1,1,1$ respectively, both would be considered equally good, assuming ideal DCG is computed to rank 3 for the former and rank 5 for the latter. One way to take into account this limitation is to enforce a fixed set size for the result set and use minimum scores for the missing documents. In the previous example, we would use the scores $1,1,1,0,0$ and $1,1,1,1,1$ and quote nDCG as nDCG@5.
Normalized DCG may not be suitable to measure the performance of queries that may have several equally good results. This is especially true when this metric is limited to only the first few results, as it is often done in practice. For example, for queries such as "restaurants" nDCG@1 accounts for only the first result. If one result set contains only 1 restaurant from the nearby area while the other contains 5, both would end up having the same score even though the latter is more comprehensive.

Related Research Articles

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial, and many other types of observable phenomena; the principle originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population. The Pareto principle or "80-20 rule" stating that 80% of outcomes are due to 20% of causes was named in honour of Pareto, but the concepts are distinct, and only Pareto distributions with shape value of log₄5 ≈ 1.16 precisely reflect it. Empirical observation has shown that this 80-20 distribution fits a wide range of cases, including natural phenomena and human activities.

In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate expectations, covariances using differentiation based on some useful algebraic properties, as well as for generality, as exponential families are in a sense very natural sets of distributions to consider. The term exponential class is sometimes used in place of "exponential family", or the older term Koopman–Darmois family. Sometimes loosely referred to as "the" exponential family, this class of distributions is distinct because they all possess a variety of desirable properties, most importantly the existence of a sufficient statistic.

<span class="mw-page-title-main">Mutual information</span> Measure of dependence between two variables

In probability theory and information theory, the mutual information (MI) of two random variables is a measure of the mutual dependence between the two variables. More specifically, it quantifies the "amount of information" obtained about one random variable by observing the other random variable. The concept of mutual information is intimately linked to that of entropy of a random variable, a fundamental notion in information theory that quantifies the expected "amount of information" held in a random variable.

In information geometry, the Fisher information metric is a particular Riemannian metric which can be defined on a smooth statistical manifold, i.e., a smooth manifold whose points are probability measures defined on a common probability space. It can be used to calculate the informational difference between measurements.

<span class="mw-page-title-main">Logistic distribution</span> Continuous probability distribution

In probability theory and statistics, the logistic distribution is a continuous probability distribution. Its cumulative distribution function is the logistic function, which appears in logistic regression and feedforward neural networks. It resembles the normal distribution in shape but has heavier tails. The logistic distribution is a special case of the Tukey lambda distribution.

The equilibrium constant of a chemical reaction is the value of its reaction quotient at chemical equilibrium, a state approached by a dynamic chemical system after sufficient time has elapsed at which its composition has no measurable tendency towards further change. For a given set of reaction conditions, the equilibrium constant is independent of the initial analytical concentrations of the reactant and product species in the mixture. Thus, given the initial composition of a system, known equilibrium constant values can be used to determine the composition of the system at equilibrium. However, reaction parameters like temperature, solvent, and ionic strength may all influence the value of the equilibrium constant.

Feature selection is the process of selecting a subset of relevant features for use in model construction. Stylometry and DNA microarray analysis are two cases where feature selection is used. It should be distinguished from feature extraction.

In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as type I, II and III extreme value distributions. By the extreme value theorem the GEV distribution is the only possible limit distribution of properly normalized maxima of a sequence of independent and identically distributed random variables. Note that a limit distribution needs to exist, which requires regularity conditions on the tail of the distribution. Despite this, the GEV distribution is often used as an approximation to model the maxima of long (finite) sequences of random variables.

Hyperlink-Induced Topic Search is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages. In other words, a good hub represents a page that pointed to many other pages, while a good authority represents a page that is linked by many different hubs.

In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general. It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf.

<span class="mw-page-title-main">F-score</span> Statistical measure of a tests accuracy

In statistical analysis of binary classification and information retrieval systems, the F-score or F-measure is a measure of predictive performance. It is calculated from the precision and recall of the test, where the precision is the number of true positive results divided by the number of all samples predicted to be positive, including those not identified correctly, and the recall is the number of true positive results divided by the number of all samples that should have been identified as positive. Precision is also known as positive predictive value, and recall is also known as sensitivity in diagnostic binary classification.

In information retrieval, Okapi BM25 is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

<span class="mw-page-title-main">Precision and recall</span> Pattern-recognition performance metrics

In pattern recognition, information retrieval, object detection and classification, precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space.

Ranking of query is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. Given a query $q$ and a collection $D$ of documents that match the query, the problem is to rank, that is, sort, the documents in $D$ according to some criterion so that the "best" results appear early in the result list displayed to the user. Ranking in terms of information retrieval is an important concept in computer science and is used in many different applications such as search engine queries and recommender systems. A majority of search engines use ranking algorithms to provide users with accurate and relevant results.

Vector space model or term vector model is an algebraic model for representing text documents as vectors such that the distance between vectors represents the relevance between the documents. It is used in information filtering, information retrieval, indexing and relevancy rankings. Its first use was in the SMART Information Retrieval System.

Learning to rank or machine-learned ranking (MLR) is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems. Training data may, for example, consist of lists of items with some partial order specified between items in each list. This order is typically induced by giving a numerical or ordinal score or a binary judgment for each item. The goal of constructing the ranking model is to rank new, unseen lists in a similar way to rankings in the training data.

The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algorithm was developed using the vector space model. Its underlying assumption is that most users have a general conception of which documents should be denoted as relevant or irrelevant. Therefore, the user's search query is revised to include an arbitrary percentage of relevant and irrelevant documents as a means of increasing the search engine's recall, and possibly the precision as well. The number of relevant and irrelevant documents allowed to enter a query is dictated by the weights of the a, b, c variables listed below in the Algorithm section.

The Binary Independence Model (BIM) in computing and information science is a probabilistic information retrieval technique. The model makes some simple assumptions to make the estimation of document/query similarity probable and feasible.

<span class="mw-page-title-main">PageRank</span> Algorithm used by Google Search to rank web pages

PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. According to Google:

PageRank works by counting the number and quality of links to a page to determine a rough estimate of how important the website is. The underlying assumption is that more important websites are likely to receive more links from other websites.

Evaluation measures for an information retrieval (IR) system assess how well an index, search engine, or database returns results from a collection of resources that satisfy a user's query. They are therefore fundamental to the success of information systems and digital platforms.

References

1 2 Kalervo Järvelin, Jaana Kekäläinen, "Cumulated gain-based evaluation of IR techniques". ACM Transactions on Information Systems20(4), 422–446 (2002)
1 2 Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, Tie-Yan Liu. 2013. A Theoretical Analysis of Normalized Discounted Cumulative Gain (NDCG) Ranking Measures. In Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013).
1 2 B. Croft; D. Metzler; T. Strohman (2010). Search Engines: Information Retrieval in Practice. Addison Wesley.
↑ Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (ICML '05). ACM, New York, NY, USA, 89-96. DOI=10.1145/1102351.1102363 http://doi.acm.org/10.1145/1102351.1102363
↑ "Introduction to Information Retrieval - Evaluation" (PDF). Stanford University. 21 April 2013. Retrieved 23 March 2014.
↑ "Normalized Discounted Cumulative Gain". Archived from the original on 23 March 2014. Retrieved 23 March 2014.
↑ D. Cossock and T. Zhang, "Statistical Analysis of Bayes Optimal Subset Ranking," in IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 5140-5154, Nov. 2008, doi: 10.1109/TIT.2008.929939.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[jarvelin_2002_cumulated-1] 1 2 Kalervo Järvelin, Jaana Kekäläinen, "Cumulated gain-based evaluation of IR techniques". ACM Transactions on Information Systems20(4), 422–446 (2002)

[:0-2] 1 2 Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, Tie-Yan Liu. 2013. A Theoretical Analysis of Normalized Discounted Cumulative Gain (NDCG) Ranking Measures. In Proceedings of the 26th Annual Conference on Learning Theory (COLT 2013).

[CMS2009-3] 1 2 B. Croft; D. Metzler; T. Strohman (2010). Search Engines: Information Retrieval in Practice. Addison Wesley.

[4] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning (ICML '05). ACM, New York, NY, USA, 89-96. DOI=10.1145/1102351.1102363 http://doi.acm.org/10.1145/1102351.1102363

[stanfordireval-5] "Introduction to Information Retrieval - Evaluation" (PDF). Stanford University. 21 April 2013. Retrieved 23 March 2014.

[6] "Normalized Discounted Cumulative Gain". Archived from the original on 23 March 2014. Retrieved 23 March 2014.

[7] D. Cossock and T. Zhang, "Statistical Analysis of Bayes Optimal Subset Ranking," in IEEE Transactions on Information Theory, vol. 54, no. 11, pp. 5140-5154, Nov. 2008, doi: 10.1109/TIT.2008.929939.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic Loss
Clustering	Silhouette Calinski-Harabasz index Davies-Bouldin Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC SimHash
Ranking	MRR NDCG AP
Computer Vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep Learning Related Metrics	Inception score FID
Recommender system	Coverage Intra-list Similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix