MRR - Mean Reciprocal Rank
What is MRR
Mean Reciprocal Rank (MRR) is a rank-aware relevance evaluation metric that measures how well a system ranks the first relevant document. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 1 for first place, 1/2 for second place, and 1/n for the nth place. It focuses specifically on the position of the highest-ranked relevant item. Note that only the rank of the first relevant answer is considered, and possible further relevant answers are ignored
When to use MRR
- First-hit and Ranking focused: Evaluate how well a system ranks results, with particular emphasis on finding the first relevant result as high as possible.
- Question answering: For scenarios where users typically need only one correct answer. The metric assumes that once a user finds the first relevant item, their search task is complete.
- Binary relevance: when the labels of data are either relevant or not relevant, with no intermediate levels of relevance
Key Components
1. Reciprocal Rank (RR)
For a single query, the reciprocal rank is the inverse of the position of the first relevant document:
\(RR = \frac{1}{\text{rank of first relevant document}}\)
- If the first relevant document is at rank 1: RR = 1.0
- If the first relevant document is at rank 3: RR = 1/3 = 0.333
- If no relevant documents are found: RR = 0.0
2. Mean Reciprocal Rank (MRR)
When evaluating on a dataset with multiple queries:
The average reciprocal rank across all queries in a dataset:
\(MRR = \frac{1}{|Q|} \sum_{q=1}^{|Q|} RR_q\)
where \(|Q|\) is the total number of queries and \(RR_q\) is the reciprocal rank for query \(q\).
Range: 0 to 1 (1 = perfect performance where all first relevant documents are ranked at position 1)
Important considerations:
- Each query contributes equally regardless of how many relevant documents it has
- Queries with no relevant documents contribute 0 to the average
- MRR treats all relevant documents beyond the first as having no additional value
Example Calculation
Consider a dataset with 4 queries and their search results:
| Query ID | Query Text | Ranked Results | Relevance Labels | First Relevant Position | |
|---|---|---|---|---|---|
| Q1 | python | [R1,R2,R3,R4] | [0,1,0,1] | 2 | 0.500 |
| Q2 | ML | [R5,R6,R7,R8] | [1,0,1,0] | 1 | 1.000 |
| Q3 | data | [R9,R10,R11] | [0,0,1] | 3 | 0.333 |
| Q4 | AI | [R1,R2,R8, R12] | [0,0,0,0] | No relevant found | 0 |
MRR = (0.500 + 1.000 + 0.333 + 0.000) / 4 = 0.458
This MRR score indicates that, on average, the first relevant document appears around position 2.18 (1/0.458) in the rankings.
MRR@k Variant
MRR@k (e.g., MRR@10) only considers the top k documents in the ranking:
- If the first relevant document appears beyond rank k, the reciprocal rank is 0
- More practical for evaluating systems with large result sets
- Focuses evaluation on the most visible results to users
Example with MRR@3:
- Query with first relevant document at rank 5: RR = 0 (beyond top 3)
- Query with first relevant document at rank 2: RR = 0.5 (within top 3)
Interpretation
- MRR = 1.0 represents perfect performance where every query's first relevant document is ranked at position 1
- MRR = 0.5 indicates that, on average, the first relevant document appears at rank 2
- MRR = 0.0 means no relevant documents were found for any query
MRR scores are directly interpretable since 1/MRR gives the average rank of the first relevant document. However, MRR scores cannot be directly compared across different datasets due to varying query difficulty and relevance patterns.
Best Practices
- Use established baselines (e.g., BM25, random ranking) to understand the relative difficulty of datasets when reporting MRR scores
- Consider MRR@k variants for practical evaluation scenarios
Reference
https://en.wikipedia.org/wiki/Mean_reciprocal_rank
https://docs.cohere.com/docs/rerank-understanding-the-results#mrr10