Introduction to Evals Hub
Evals Hub is a unified, scriptable evaluation framework for comparing NLP/LLM model and application performance across diverse task types including Retrieval, Re‑ranking, Classification, Natural Language Inference (NLI), and Question Answering (QA). It targets:
- Fast local iteration
- Reproducible benchmarking (deterministic seeding & Langfuse tracing to track exact prompts, models used, and LLM conversations)
- Transparent metric computation (MAP, MRR, nDCG, Precision, Recall, F1, Accuracy, ...)
- Easy extensibility (add tasks, datasets, or metrics with minimal boilerplate)

What Evals Hub provides?
| Principle | What It Means | Example |
|---|---|---|
| Uniform Interface | Same CLI shape across tasks | evals-hub run-benchmark --task-name reranking ... |
| Config + Override | YAML + ad‑hoc CLI overrides | Tune --evaluation.top-k without editing file |
| Deterministic | Seeded runs for reproducibility | --evaluation.seed 42 |
| Extensible | Drop‑in dataset / metric modules | Add metric -> reuse pipeline |
At a Glance
| Task | Description | Inputs | Typical Metrics |
|---|---|---|---|
| Retrieval | Find the most relevant documents to a user query | queries, corpus, relevance labels | MAP, MRR, nDCG, Recall, Precision |
| Reranking | Re‑order an candidate documents by their relevance to a user query | query, candidate list, relevance | MAP, MRR, nDCG |
| Classification | Assign discrete label(s) to pieces of text | text, label | Accuracy, Precision, Recall, F1 (micro/macro) |
| NLI | Predict relation (entail / neutral / contradict) | premise, hypothesis, label | Accuracy (+ optional F1) |
| Question Answering | Generate an answer to a user query | question, answer | Accuracy, Confidence |
See: Task Overview → individual task pages.
Metrics Snapshot
Supported metrics (see Metrics pages for formulas):
- Mean Average Precision (MAP)
- Mean Reciprocal Rank (MRR)
- Normalized Discounted Cumulative Gain (nDCG)
- Precision & Recall
- F1 (micro & macro)
- Accuracy
Installation
git clone https://github.com/tomoro-ai/tomoro-evals.git
cd evals_hub
uv venv
source .venv/bin/activate
uv pip install -e .
Upgrade / refresh locked dependencies:
uv sync --upgrade
Run tests:
uv run pytest -v
Serve the documentation locally:
uv run mkdocs serve -f docs/mkdocs.yml
Need offline usage patterns? See Offline Usage.
Quickstart (CLI)
Explore base help:
evals-hub --help
Inspect benchmark parameters:
evals-hub run-benchmark --help
Run a reranking benchmark using a provided YAML config:
evals-hub run-benchmark --config reranking_config.yaml
Override selected options inline (CLI > YAML precedence):
evals-hub run-benchmark \
--task-name reranking \
--dataset.name nfcorpus \
--model.checkpoint sentence-transformers/all-MiniLM-L6-v2 \
--metrics.map map \
--metrics.mrr mrr \
--evaluation.top-k 20 \
--output.results-file benchmark_results/reranking_evaluation_results.json
Result JSON is written to the path given by
--output.results-file—ideal for diffing in PRs or storing as build artifacts.
Configuration Model
You can drive experiments via:
- YAML config: versionable baseline (e.g.
reranking_config.yaml) - CLI flags: rapid overrides / automation
- Hybrid: YAML + CLI (flags overwrite YAML values)
This layered approach encourages stable experiment templates while enabling controlled parameter sweeps.
Core parameter groups (abridged):
| Group | Examples |
|---|---|
| Task | --task-name |
| Dataset | --dataset.name, --dataset.split, --dataset.hf-subset |
| Model | --model.checkpoint |
| Evaluation | --evaluation.batch-size, --evaluation.top-k, --evaluation.seed, --evaluation.n-experiments |
| Output | --output.results-file |
Full authoritative list: evals-hub run-benchmark --help.
Models Snapshot
Evals Hub supports various models across different tasks, providing flexible evaluation options:
By Task Type
| Task | Model Types | Example Models |
|---|---|---|
| Retrieval | Embedding models (HuggingFace) | sentence-transformers/all-MiniLM-L6-v2sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2sentence-transformers/all-mpnet-base-v2 |
| Reranking | Embedding models (HuggingFace) API-based reranking |
sentence-transformers/all-MiniLM-L6-v2sdadas/mmlw-e5-smallcohere_rerank_3_5 (API) |
| Classification | Embedding models (HuggingFace) | sentence-transformers/all-MiniLM-L6-v2 |
| NLI | Embedding models (HuggingFace) | sentence-transformers/all-MiniLM-L6-v2 |
| QA | API based models Custom function models |
API based models: openai_gpt41, gemini_2_5_proFunction-based: Custom wrapper function implementing QAOutput interface |
| BR (Biological Reasoning) | API based models | gemini_2_5_proopenai_gpt41gemini_2_0_flash_lite |
| PL (Patent Landscape) | LLM service via PLAN API | gpt-4.1 (configured in plan_service settings) |
Configuration Options
Models can be configured in various ways:
-
Embedding Models (HuggingFace): Specified via
model.checkpointparameter, limited to SentenceTransformer-compatible models.model: checkpoint: "sentence-transformers/all-MiniLM-L6-v2" -
API-based Models: For reranking and LLM tasks, configured with API options.
model: checkpoint: "cohere_rerank_3_5" reranking_method: "api" -
Custom Function Models: For QA tasks, can be provided via import path to a callable function.
model: import_path: "your.module.path:function_name" -
PLAN Service: For Patent Landscape evaluation, configured with LLM service options.
plan_service: url: "http://localhost:8000/analysis/patents" llm_service: "openai" llm_model: "gpt-4.1"
Environment Configuration
API-based models require environment variables:
API_KEY: For authenticationBASE_URL
Dataset Snapshots
Evals Hub uses the Hugging Face Hub for dataset management, implementing a Medallion Architecture for data organization. All datasets are stored privately within the organization on Hugging Face.
Supported Datasets by Task
| Task | Dataset Name | Split | Description |
|---|---|---|---|
| Reranking | fr-reranking-alloprof-s2p_goldpatentmatch_silver |
test | French language reranking dataset and patent matching dataset |
| Retrieval | nfcorpus_goldBioASQ_12b |
test/train | Retrieval datasets based on medical abstracts and biomedical questions |
| QA | hle_futurehouse_gold |
test | Question answering dataset |
| Classification | amazon_counterfactual_gold |
test | Classification dataset based on Amazon reviews |
| NLI | xnli_gold |
test | Natural Language Inference dataset |
| PL (Patent Landscape) | paecter_gold |
test | Patent landscape dataset |
| BR (Biological Reasoning) | hle_futurehouse_gold |
test | Biological reasoning datasets including healthcare data and scientific citations |
Dataset Configuration
Datasets can be configured in benchmark runs using the following YAML structure:
run-benchmark:
dataset:
name: "dataset_name_gold"
split: "test" # The split to use (test, train, validation)
Data Organization
Datasets follow the Medallion Architecture pattern:
- Bronze: Raw data in original format
- Silver: Optional intermediate processing layer
- Gold: Cleaned and standardized data meeting task specifications
For details on dataset access and structure, see the Datasets documentation.
Example Output Schema (Illustrative)
{
"task": "reranking",
"dataset": { "name": "nfcorpus", "split": "test" },
"model": { "checkpoint": "sentence-transformers/all-MiniLM-L6-v2" },
"metrics": { "map": 0.3412, "mrr": 0.5123, "ndcg": 0.4789 },
"config": { "top_k": 20, "batch_size": 16, "seed": 42 },
"timestamp": "2025-08-18T12:34:56Z"
}
Extending the Hub
Add capability in four lightweight steps:
- Implement / register a dataset loader (see Data Loader API).
- Provide / adapt a model interface (embedding / classifier) in
src/evals_hub/models/. - Add metrics (see Metrics API).
- Reference the new parts inside a YAML config and run the CLI.
Because the execution pipeline is task‑abstract, new metrics or datasets rarely require changes elsewhere.
Offline & Enterprise Notes
Pre-download datasets / models and point configs to the local cache. For SSL chain issues behind corporate proxies (see Datasets Q&A) set:
export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
More details: see Offline Usage.
Langfuse External Evaluation
Evals Hub can emit rich Langfuse traces during evaluations and then score those traces later via a decoupled external pipeline. This lets you separate model inference from metric experimentation.
Key benefits:
- Decoupled: Add / change metrics without re‑running embeddings
- Scalable: Batch trace fetching
- Observable: Full history & per‑query similarity + relevance vectors
- Reproducible: Deterministic evaluation over stored outputs
Minimal setup:
- Set credentials (env or
.env):LANGFUSE_PUBLIC_KEY,LANGFUSE_SECRET_KEY,LANGFUSE_HOST. - Enable tracing in your config (
reranking_config.yaml):langfuse: enabled: true name: reranking_evaluation - Run the benchmark (generates traces):
evals-hub run-benchmark --config reranking_config.yaml - Run external scoring (adds MRR@10 & Average Precision scores back onto traces):
uv run langfuse_trace_evaluation.py
Adding a new trace‑level metric: implement function → call it inside evaluate_and_update_trace_scores → post via update_trace_scores (or a new helper) → re‑run only the external script.
See full guide: Langfuse External Evaluation.
What Next?
Jump to: Tasks · Metrics · Offline Usage · Evaluator API
Happy evaluating!