Introduction to Evals Hub

Evals Hub is a unified, scriptable evaluation framework for comparing NLP/LLM model and application performance across diverse task types including Retrieval, Re‑ranking, Classification, Natural Language Inference (NLI), and Question Answering (QA). It targets:

Fast local iteration
Reproducible benchmarking (deterministic seeding & Langfuse tracing to track exact prompts, models used, and LLM conversations)
Transparent metric computation (MAP, MRR, nDCG, Precision, Recall, F1, Accuracy, ...)
Easy extensibility (add tasks, datasets, or metrics with minimal boilerplate)

Evals Hub Overview

What Evals Hub provides?

Principle	What It Means	Example
Uniform Interface	Same CLI shape across tasks	`evals-hub run-benchmark --task-name reranking ...`
Config + Override	YAML + ad‑hoc CLI overrides	Tune `--evaluation.top-k` without editing file
Deterministic	Seeded runs for reproducibility	`--evaluation.seed 42`
Extensible	Drop‑in dataset / metric modules	Add metric -> reuse pipeline

At a Glance

Task	Description	Inputs	Typical Metrics
Retrieval	Find the most relevant documents to a user query	queries, corpus, relevance labels	MAP, MRR, nDCG, Recall, Precision
Reranking	Re‑order an candidate documents by their relevance to a user query	query, candidate list, relevance	MAP, MRR, nDCG
Classification	Assign discrete label(s) to pieces of text	text, label	Accuracy, Precision, Recall, F1 (micro/macro)
NLI	Predict relation (entail / neutral / contradict)	premise, hypothesis, label	Accuracy (+ optional F1)
Question Answering	Generate an answer to a user query	question, answer	Accuracy, Confidence

See: Task Overview → individual task pages.

Metrics Snapshot

Supported metrics (see Metrics pages for formulas):

Mean Average Precision (MAP)
Mean Reciprocal Rank (MRR)
Normalized Discounted Cumulative Gain (nDCG)
Precision & Recall
F1 (micro & macro)
Accuracy

Installation

git clone https://github.com/tomoro-ai/tomoro-evals.git
cd evals_hub
uv venv
source .venv/bin/activate
uv pip install -e .

Upgrade / refresh locked dependencies:

uv sync --upgrade

Run tests:

uv run pytest -v

Serve the documentation locally:

uv run mkdocs serve -f docs/mkdocs.yml

Need offline usage patterns? See Offline Usage.

Quickstart (CLI)

Explore base help:

evals-hub --help

Inspect benchmark parameters:

evals-hub run-benchmark --help

Run a reranking benchmark using a provided YAML config:

evals-hub run-benchmark --config reranking_config.yaml

Override selected options inline (CLI > YAML precedence):

evals-hub run-benchmark \
    --task-name reranking \
    --dataset.name nfcorpus \
    --model.checkpoint sentence-transformers/all-MiniLM-L6-v2 \
    --metrics.map map \
    --metrics.mrr mrr \
    --evaluation.top-k 20 \
    --output.results-file benchmark_results/reranking_evaluation_results.json

Result JSON is written to the path given by --output.results-file—ideal for diffing in PRs or storing as build artifacts.

Configuration Model

You can drive experiments via:

YAML config: versionable baseline (e.g. reranking_config.yaml)
CLI flags: rapid overrides / automation
Hybrid: YAML + CLI (flags overwrite YAML values)

This layered approach encourages stable experiment templates while enabling controlled parameter sweeps.

Core parameter groups (abridged):

Group	Examples
Task	`--task-name`
Dataset	`--dataset.name`, `--dataset.split`, `--dataset.hf-subset`
Model	`--model.checkpoint`
Evaluation	`--evaluation.batch-size`, `--evaluation.top-k`, `--evaluation.seed`, `--evaluation.n-experiments`
Output	`--output.results-file`

Full authoritative list: evals-hub run-benchmark --help.

Models Snapshot

Evals Hub supports various models across different tasks, providing flexible evaluation options:

By Task Type

Task	Model Types	Example Models
Retrieval	Embedding models (HuggingFace)	`sentence-transformers/all-MiniLM-L6-v2` `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` `sentence-transformers/all-mpnet-base-v2`
Reranking	Embedding models (HuggingFace) API-based reranking	`sentence-transformers/all-MiniLM-L6-v2` `sdadas/mmlw-e5-small` `cohere_rerank_3_5` (API)
Classification	Embedding models (HuggingFace)	`sentence-transformers/all-MiniLM-L6-v2`
NLI	Embedding models (HuggingFace)	`sentence-transformers/all-MiniLM-L6-v2`
QA	API based models Custom function models	API based models: `openai_gpt41`, `gemini_2_5_pro` Function-based: Custom wrapper function implementing `QAOutput` interface
BR (Biological Reasoning)	API based models	`gemini_2_5_pro` `openai_gpt41` `gemini_2_0_flash_lite`
PL (Patent Landscape)	LLM service via PLAN API	`gpt-4.1` (configured in `plan_service` settings)

Configuration Options

Models can be configured in various ways:

Embedding Models (HuggingFace): Specified via model.checkpoint parameter, limited to SentenceTransformer-compatible models.
```
model:
  checkpoint: "sentence-transformers/all-MiniLM-L6-v2"
```

API-based Models: For reranking and LLM tasks, configured with API options.

model:
  checkpoint: "cohere_rerank_3_5"
  reranking_method: "api"

Custom Function Models: For QA tasks, can be provided via import path to a callable function.
```
model:
  import_path: "your.module.path:function_name"
```

PLAN Service: For Patent Landscape evaluation, configured with LLM service options.

plan_service:
  url: "http://localhost:8000/analysis/patents"
  llm_service: "openai"
  llm_model: "gpt-4.1"

Environment Configuration

API-based models require environment variables:

API_KEY: For authentication
BASE_URL

Dataset Snapshots

Evals Hub uses the Hugging Face Hub for dataset management, implementing a Medallion Architecture for data organization. All datasets are stored privately within the organization on Hugging Face.

Supported Datasets by Task

Task	Dataset Name	Split	Description
Reranking	`fr-reranking-alloprof-s2p_gold` `patentmatch_silver`	test	French language reranking dataset and patent matching dataset
Retrieval	`nfcorpus_gold` `BioASQ_12b`	test/train	Retrieval datasets based on medical abstracts and biomedical questions
QA	`hle_futurehouse_gold`	test	Question answering dataset
Classification	`amazon_counterfactual_gold`	test	Classification dataset based on Amazon reviews
NLI	`xnli_gold`	test	Natural Language Inference dataset
PL (Patent Landscape)	`paecter_gold`	test	Patent landscape dataset
BR (Biological Reasoning)	`hle_futurehouse_gold`	test	Biological reasoning datasets including healthcare data and scientific citations

Dataset Configuration

Datasets can be configured in benchmark runs using the following YAML structure:

run-benchmark:
  dataset:
    name: "dataset_name_gold"
    split: "test"  # The split to use (test, train, validation)

Data Organization

Datasets follow the Medallion Architecture pattern:

Bronze: Raw data in original format
Silver: Optional intermediate processing layer
Gold: Cleaned and standardized data meeting task specifications

For details on dataset access and structure, see the Datasets documentation.

Example Output Schema (Illustrative)

{
    "task": "reranking",
    "dataset": { "name": "nfcorpus", "split": "test" },
    "model": { "checkpoint": "sentence-transformers/all-MiniLM-L6-v2" },
    "metrics": { "map": 0.3412, "mrr": 0.5123, "ndcg": 0.4789 },
    "config": { "top_k": 20, "batch_size": 16, "seed": 42 },
    "timestamp": "2025-08-18T12:34:56Z"
}

Extending the Hub

Add capability in four lightweight steps:

Implement / register a dataset loader (see Data Loader API).
Provide / adapt a model interface (embedding / classifier) in src/evals_hub/models/.
Add metrics (see Metrics API).
Reference the new parts inside a YAML config and run the CLI.

Because the execution pipeline is task‑abstract, new metrics or datasets rarely require changes elsewhere.

Offline & Enterprise Notes

Pre-download datasets / models and point configs to the local cache. For SSL chain issues behind corporate proxies (see Datasets Q&A) set:

export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

More details: see Offline Usage.

Langfuse External Evaluation

Evals Hub can emit rich Langfuse traces during evaluations and then score those traces later via a decoupled external pipeline. This lets you separate model inference from metric experimentation.

Key benefits:

Decoupled: Add / change metrics without re‑running embeddings
Scalable: Batch trace fetching
Observable: Full history & per‑query similarity + relevance vectors
Reproducible: Deterministic evaluation over stored outputs

Minimal setup:

Set credentials (env or .env): LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST.

Enable tracing in your config (reranking_config.yaml):

langfuse:
  enabled: true
  name: reranking_evaluation

Run the benchmark (generates traces):

evals-hub run-benchmark --config reranking_config.yaml

Run external scoring (adds MRR@10 & Average Precision scores back onto traces):
```
uv run langfuse_trace_evaluation.py
```

Adding a new trace‑level metric: implement function → call it inside evaluate_and_update_trace_scores → post via update_trace_scores (or a new helper) → re‑run only the external script.

See full guide: Langfuse External Evaluation.

What Next?

Jump to: Tasks · Metrics · Offline Usage · Evaluator API

Happy evaluating!