MAP - Mean Average Precision

What is MAP

Mean Average Precision (MAP) is a rank-aware evaluation metric that measures the quality of ranked retrieval results by considering both precision and recall across all relevant documents. MAP evaluates the entire ranked list and rewards systems that rank multiple relevant documents highly.

MAP calculates the average precision for each query and then averages these scores across all queries in a dataset. It provides a single metric that captures how well a system retrieves and ranks all relevant documents.

When to use MAP

Multiple relevant documents: Ideal when queries have multiple correct answers and you want to evaluate how well the system retrieves all of them. Users may benefit from finding multiple relevant documents, not just the first one, assumes user is interested in finding many relevant documents for each query.
Ranking quality assessment: Evaluates both the ability to find relevant documents and rank them highly.
Binary relevance: when the labels of data are either relevant or not relevant, with no intermediate levels of relevance

Key Components

1. Precision at k (P@k)

The proportion of relevant documents in the top k results:

\(P@k = \frac{\text{number of relevant documents in top k}}{k}\)

2. Average Precision (AP)

For a single query, AP is the average of precision values calculated at each position where a relevant document is retrieved:

\(AP = \frac{1}{R} \sum_{k=1}^{n} P@k \cdot rel_k\)

where:

\(R\) is the total number of relevant documents for the query
\(n\) is the total number of retrieved documents
\(rel_k\) is 1 if the document at rank \(k\) is relevant, 0 otherwise

Simplified calculation: AP is the average of precision scores at ranks where relevant documents appear, divided by the total number of relevant documents.

3. Mean Average Precision (MAP)

The average of AP scores across all queries:

\(MAP = \frac{1}{|Q|} \sum_{q=1}^{|Q|} AP_q\)

where \(|Q|\) is the total number of queries and \(AP_q\) is the average precision for query \(q\).

Range: 0 to 1 (1 = perfect performance where all relevant documents are ranked at the top)

Example Calculation

Consider a dataset with 3 queries and their search results:

Query ID	Query Text	Ranked Results	Relevance Labels	Relevant Count
Q1	python	[R1,R2,R3,R4,R5]	[0,1,1,0,1]	3
Q2	ML	[R6,R7,R8]	[1,0,1]	2
Q3	data	[R9,R10,R11,R12]	[0,0,0,1]	1

Query Q1 (python):

Relevant documents at positions: 2, 3, 5
P@2 = 1/2 = 0.500, P@3 = 2/3 = 0.667, P@5 = 3/5 = 0.600
AP₁ = (0.500 + 0.667 + 0.600) / 3 = 0.589

Query Q2 (ML):

Relevant documents at positions: 1, 3
P@1 = 1/1 = 1.000, P@3 = 2/3 = 0.667
AP₂ = (1.000 + 0.667) / 2 = 0.833

Query Q3 (data):

Relevant documents at positions: 4
P@4 = 1/4 = 0.250
AP₃ = 0.250 / 1 = 0.250

MAP = (0.589 + 0.833 + 0.250) / 3 = 0.557

MAP@k Variant

MAP@k (e.g., MAP@10) only considers the top k documents in the ranking:

Relevant documents beyond rank k are ignored in the calculation
More practical for evaluating systems with large result sets
Focuses on the most visible results to users

Interpretation

MAP = 1.0 represents perfect performance where all relevant documents for every query are ranked at the top in any order
MAP = 0.0 means no relevant documents were retrieved for any query

MAP scores provide insight into both retrieval effectiveness (finding relevant documents) and ranking quality (positioning them highly). Higher MAP scores indicate better overall system performance across multiple relevant documents per query.

Best Practices

Use established baselines (e.g., BM25, TF-IDF) to contextualise MAP scores and understand dataset difficulty.
Consider MAP@k variants for practical evaluation scenarios where users typically examine only the top k results

Reference

https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision

https://web.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf

https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Average_precision