Retrieval
Retrieval is a task where both the query and the documents being searched are in text format. The goal is to find relevant text documents that match or relate to a given text query.
Core Process
- Query and/or Document Encoding: Convert text query and/or documents into a vector.
- Similarity Computation: Calculate distance between query and documents.
- Ranking: Return documents ordered by distance.
- Evaluation: Calculate selected performance metrics (For example, NDCG, MAP, Recall@K) when ground truth is available.
When external database is used to store the documents, the core process is
- Search: query encoding may be needed (depends on the setup of external database)
- Evaluation: Calculate the metrics
For example, RAG is used, the documentation can be embedded and stored in the Qdrant database.
Data Schema Specifications
Optional meta can be added to the datasets.
Queries Dataset Schema
| Column | Type | Description | Required |
|---|---|---|---|
_id |
string | Unique query identifier | ✓ |
query |
string | Query text/question | ✓ |
Documents Dataset Schema
| Column | Type | Description | Required |
|---|---|---|---|
_id |
string | Unique document identifier | ✓ |
doc |
string | Document text content | ✓ |
Relevances Dataset Schema
| Column | Type | Description | Required |
|---|---|---|---|
query_id |
string | References queries._id | ✓ |
doc_id |
string | References documents._id | ✓ |
score |
int/float | Relevance score | ✓ |
Supported Models
Only HuggingFace embedding-based models compatible with SentenceTransformer (e.g., SentenceTransformer, BERT, RoBERTa) are supported.