MRR - Mean Reciprocal Rank

What is MRR

Mean Reciprocal Rank (MRR) is a rank-aware relevance evaluation metric that measures how well a system ranks the first relevant document. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 1 for first place, 1/2 for second place, and 1/n for the nth place. It focuses specifically on the position of the highest-ranked relevant item. Note that only the rank of the first relevant answer is considered, and possible further relevant answers are ignored

When to use MRR

First-hit and Ranking focused: Evaluate how well a system ranks results, with particular emphasis on finding the first relevant result as high as possible.
Question answering: For scenarios where users typically need only one correct answer. The metric assumes that once a user finds the first relevant item, their search task is complete.
Binary relevance: when the labels of data are either relevant or not relevant, with no intermediate levels of relevance

Key Components

1. Reciprocal Rank (RR)

For a single query, the reciprocal rank is the inverse of the position of the first relevant document:

\(RR = \frac{1}{\text{rank of first relevant document}}\)

If the first relevant document is at rank 1: RR = 1.0
If the first relevant document is at rank 3: RR = 1/3 = 0.333
If no relevant documents are found: RR = 0.0

2. Mean Reciprocal Rank (MRR)

When evaluating on a dataset with multiple queries:

The average reciprocal rank across all queries in a dataset:

\(MRR = \frac{1}{|Q|} \sum_{q=1}^{|Q|} RR_q\)

where \(|Q|\) is the total number of queries and \(RR_q\) is the reciprocal rank for query \(q\).

Range: 0 to 1 (1 = perfect performance where all first relevant documents are ranked at position 1)

Important considerations:

Each query contributes equally regardless of how many relevant documents it has
Queries with no relevant documents contribute 0 to the average
MRR treats all relevant documents beyond the first as having no additional value

Example Calculation

Consider a dataset with 4 queries and their search results:

Query ID	Query Text	Ranked Results	Relevance Labels	First Relevant Position
Q1	python	[R1,R2,R3,R4]	[0,1,0,1]	2	0.500
Q2	ML	[R5,R6,R7,R8]	[1,0,1,0]	1	1.000
Q3	data	[R9,R10,R11]	[0,0,1]	3	0.333
Q4	AI	[R1,R2,R8, R12]	[0,0,0,0]	No relevant found	0

MRR = (0.500 + 1.000 + 0.333 + 0.000) / 4 = 0.458

This MRR score indicates that, on average, the first relevant document appears around position 2.18 (1/0.458) in the rankings.

MRR@k Variant

MRR@k (e.g., MRR@10) only considers the top k documents in the ranking:

If the first relevant document appears beyond rank k, the reciprocal rank is 0
More practical for evaluating systems with large result sets
Focuses evaluation on the most visible results to users

Example with MRR@3:

Query with first relevant document at rank 5: RR = 0 (beyond top 3)
Query with first relevant document at rank 2: RR = 0.5 (within top 3)

Interpretation

MRR = 1.0 represents perfect performance where every query's first relevant document is ranked at position 1
MRR = 0.5 indicates that, on average, the first relevant document appears at rank 2
MRR = 0.0 means no relevant documents were found for any query

MRR scores are directly interpretable since 1/MRR gives the average rank of the first relevant document. However, MRR scores cannot be directly compared across different datasets due to varying query difficulty and relevance patterns.

Best Practices

Use established baselines (e.g., BM25, random ranking) to understand the relative difficulty of datasets when reporting MRR scores
Consider MRR@k variants for practical evaluation scenarios

Reference

https://en.wikipedia.org/wiki/Mean_reciprocal_rank

https://docs.cohere.com/docs/rerank-understanding-the-results#mrr10