nDCG - Normalized Discounted Cumulative Gain
What is nDCG
nDCG is a rank-aware evaluation metric that measures the quality of search results by comparing the actual ranking of documents against an ideal ranking. It accounts for both relevance and position in the ranked list.
NDCG is calculated by taking the Discounted Cumulative Gain (DCG) divided by the Ideal Discounted Cumulative Gain (IDCG).
When to use nDCG
- Position-sensitive: Heavily weights top-ranked results.
- Normalised: Enables comparison across different queries.
- Handles graded relevance: Works with multi-level relevance judgments (not just binary). For example, the level of relevances could be 3, 2, 1, 0.
- Industry standard: Widely used in information retrieval, recommendation systems, and major benchmarks like MTEB (Massive Text Embedding Benchmark)
Key Components
1. Cumulative Gain (CG)
Sum of relevance scores of all retrieved documents
\(CG = \sum_{i=1}^{k} rel_i\)
(where \(rel_i\) is the relevance of document at position \(i\))
2. Discounted Cumulative Gain (DCG)
Applies logarithmic discount to lower-ranked positions
\(DCG = \sum_{i=1}^{k} \frac{rel_i}{\log_2(i + 1)}\) Higher-ranked documents contribute more to the score
3. Ideal DCG (IDCG)
DCG of the perfect ranking (documents sorted by relevance in descending order) Represents the maximum possible DCG for a given query
4. Normalized DCG (nDCG)
\(nDCG = \frac{DCG}{IDCG}\)
Range: 0 to 1 (1 = perfect ranking)
Example Calculation
Consider a query with 5 documents ranked by a system, with relevance judgements:
| Rank | Document | Relevance Score | 1/logâ‚‚(rank+1) | Contribution to DCG |
|---|---|---|---|---|
| 1 | Doc A | 3 | 1.000 | 3.000 |
| 2 | Doc B | 1 | 0.631 | 0.631 |
| 3 | Doc C | 2 | 0.500 | 1.000 |
| 4 | Doc D | 0 | 0.431 | 0.000 |
| 5 | Doc E | 1 | 0.387 | 0.387 |
DCG = 3.000 + 0.631 + 1.000 + 0.000 + 0.387 = 5.018
Ideal ranking (sorted by relevance): Doc A (3), Doc C (2), Doc B (1), Doc E (1), Doc D (0) IDCG = 3.000 + 1.262 + 0.500 + 0.387 + 0.000 = 5.149
nDCG = 5.018 / 5.149 = 0.975
This high nDCG score indicates the ranking is very close to optimal, with only Doc B and Doc C swapped from the ideal order.
nDCG@k Variant
nDCG@k (e.g., nDCG@10) evaluates only the top k documents in the ranking:
- Focuses on precision at the top of the results
- More practical than evaluating entire result sets
Dataset-Level Evaluation
When evaluating on a dataset with multiple queries, calculate the overall nDCG performance by:
- Compute nDCG for each query individually using the query's relevance judgements
- Average across all queries to get the dataset-level score:
\(\text{Dataset nDCG} = \frac{1}{|Q|} \sum_{q \in Q} \text{nDCG}_q\) , where \(Q\) is the set of all queries
Important considerations:
- Each query contributes equally to the final score regardless of the number of relevant documents
- Queries with no relevant documents (IDCG = 0) are typically excluded from the average
- Report median nDCG to handle score distribution.
Interpretation
- nDCG = 1.0 represents perfect ranking where all relevant documents are ranked in descending order of their relevance scores
- nDCG scores cannot be directly compared across different datasets. Dataset characteristics affect achievable scores: number of relevant documents, relevance distribution, and query difficulty.
- It is suggested to use basic retrieval methods (e.g., BM25, TF-IDF) as the baselines to understand the difficulty of datasets.
Reference
https://docs.cohere.com/docs/rerank-understanding-the-results#ndcg10