Skip to content

Introduction to Evals Hub

Evals Hub is a unified, scriptable evaluation framework for comparing NLP/LLM model and application performance across diverse task types including Retrieval, Re‑ranking, Classification, Natural Language Inference (NLI), and Question Answering (QA). It targets:

  • Fast local iteration
  • Reproducible benchmarking (deterministic seeding & Langfuse tracing to track exact prompts, models used, and LLM conversations)
  • Transparent metric computation (MAP, MRR, nDCG, Precision, Recall, F1, Accuracy, ...)
  • Easy extensibility (add tasks, datasets, or metrics with minimal boilerplate)

Evals Hub Overview


What Evals Hub provides?

Principle What It Means Example
Uniform Interface Same CLI shape across tasks evals-hub run-benchmark --task-name reranking ...
Config + Override YAML + ad‑hoc CLI overrides Tune --evaluation.top-k without editing file
Deterministic Seeded runs for reproducibility --evaluation.seed 42
Extensible Drop‑in dataset / metric modules Add metric -> reuse pipeline

At a Glance

Task Description Inputs Typical Metrics
Retrieval Find the most relevant documents to a user query queries, corpus, relevance labels MAP, MRR, nDCG, Recall, Precision
Reranking Re‑order an candidate documents by their relevance to a user query query, candidate list, relevance MAP, MRR, nDCG
Classification Assign discrete label(s) to pieces of text text, label Accuracy, Precision, Recall, F1 (micro/macro)
NLI Predict relation (entail / neutral / contradict) premise, hypothesis, label Accuracy (+ optional F1)
Question Answering Generate an answer to a user query question, answer Accuracy, Confidence

See: Task Overview → individual task pages.


Metrics Snapshot

Supported metrics (see Metrics pages for formulas):

  • Mean Average Precision (MAP)
  • Mean Reciprocal Rank (MRR)
  • Normalized Discounted Cumulative Gain (nDCG)
  • Precision & Recall
  • F1 (micro & macro)
  • Accuracy

Installation

git clone https://github.com/tomoro-ai/tomoro-evals.git
cd evals_hub
uv venv
source .venv/bin/activate
uv pip install -e .

Upgrade / refresh locked dependencies:

uv sync --upgrade

Run tests:

uv run pytest -v

Serve the documentation locally:

uv run mkdocs serve -f docs/mkdocs.yml

Need offline usage patterns? See Offline Usage.


Quickstart (CLI)

Explore base help:

evals-hub --help

Inspect benchmark parameters:

evals-hub run-benchmark --help

Run a reranking benchmark using a provided YAML config:

evals-hub run-benchmark --config reranking_config.yaml

Override selected options inline (CLI > YAML precedence):

evals-hub run-benchmark \
    --task-name reranking \
    --dataset.name nfcorpus \
    --model.checkpoint sentence-transformers/all-MiniLM-L6-v2 \
    --metrics.map map \
    --metrics.mrr mrr \
    --evaluation.top-k 20 \
    --output.results-file benchmark_results/reranking_evaluation_results.json

Result JSON is written to the path given by --output.results-file—ideal for diffing in PRs or storing as build artifacts.


Configuration Model

You can drive experiments via:

  1. YAML config: versionable baseline (e.g. reranking_config.yaml)
  2. CLI flags: rapid overrides / automation
  3. Hybrid: YAML + CLI (flags overwrite YAML values)

This layered approach encourages stable experiment templates while enabling controlled parameter sweeps.

Core parameter groups (abridged):

Group Examples
Task --task-name
Dataset --dataset.name, --dataset.split, --dataset.hf-subset
Model --model.checkpoint
Evaluation --evaluation.batch-size, --evaluation.top-k, --evaluation.seed, --evaluation.n-experiments
Output --output.results-file

Full authoritative list: evals-hub run-benchmark --help.


Models Snapshot

Evals Hub supports various models across different tasks, providing flexible evaluation options:

By Task Type

Task Model Types Example Models
Retrieval Embedding models (HuggingFace) sentence-transformers/all-MiniLM-L6-v2
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
sentence-transformers/all-mpnet-base-v2
Reranking Embedding models (HuggingFace)
API-based reranking
sentence-transformers/all-MiniLM-L6-v2
sdadas/mmlw-e5-small
cohere_rerank_3_5 (API)
Classification Embedding models (HuggingFace) sentence-transformers/all-MiniLM-L6-v2
NLI Embedding models (HuggingFace) sentence-transformers/all-MiniLM-L6-v2
QA API based models
Custom function models
API based models: openai_gpt41, gemini_2_5_pro
Function-based: Custom wrapper function implementing QAOutput interface
BR (Biological Reasoning) API based models gemini_2_5_pro
openai_gpt41
gemini_2_0_flash_lite
PL (Patent Landscape) LLM service via PLAN API gpt-4.1 (configured in plan_service settings)

Configuration Options

Models can be configured in various ways:

  1. Embedding Models (HuggingFace): Specified via model.checkpoint parameter, limited to SentenceTransformer-compatible models.

    model:
      checkpoint: "sentence-transformers/all-MiniLM-L6-v2"
    

  2. API-based Models: For reranking and LLM tasks, configured with API options.

    model:
      checkpoint: "cohere_rerank_3_5"
      reranking_method: "api"
    

  3. Custom Function Models: For QA tasks, can be provided via import path to a callable function.

    model:
      import_path: "your.module.path:function_name"
    

  4. PLAN Service: For Patent Landscape evaluation, configured with LLM service options.

    plan_service:
      url: "http://localhost:8000/analysis/patents"
      llm_service: "openai"
      llm_model: "gpt-4.1"
    

Environment Configuration

API-based models require environment variables:

  • API_KEY: For authentication
  • BASE_URL

Dataset Snapshots

Evals Hub uses the Hugging Face Hub for dataset management, implementing a Medallion Architecture for data organization. All datasets are stored privately within the organization on Hugging Face.

Supported Datasets by Task

Task Dataset Name Split Description
Reranking fr-reranking-alloprof-s2p_gold
patentmatch_silver
test French language reranking dataset and patent matching dataset
Retrieval nfcorpus_gold
BioASQ_12b
test/train Retrieval datasets based on medical abstracts and biomedical questions
QA hle_futurehouse_gold test Question answering dataset
Classification amazon_counterfactual_gold test Classification dataset based on Amazon reviews
NLI xnli_gold test Natural Language Inference dataset
PL (Patent Landscape) paecter_gold test Patent landscape dataset
BR (Biological Reasoning) hle_futurehouse_gold test Biological reasoning datasets including healthcare data and scientific citations

Dataset Configuration

Datasets can be configured in benchmark runs using the following YAML structure:

run-benchmark:
  dataset:
    name: "dataset_name_gold"
    split: "test"  # The split to use (test, train, validation)

Data Organization

Datasets follow the Medallion Architecture pattern:

  • Bronze: Raw data in original format
  • Silver: Optional intermediate processing layer
  • Gold: Cleaned and standardized data meeting task specifications

For details on dataset access and structure, see the Datasets documentation.


Example Output Schema (Illustrative)

{
    "task": "reranking",
    "dataset": { "name": "nfcorpus", "split": "test" },
    "model": { "checkpoint": "sentence-transformers/all-MiniLM-L6-v2" },
    "metrics": { "map": 0.3412, "mrr": 0.5123, "ndcg": 0.4789 },
    "config": { "top_k": 20, "batch_size": 16, "seed": 42 },
    "timestamp": "2025-08-18T12:34:56Z"
}

Extending the Hub

Add capability in four lightweight steps:

  1. Implement / register a dataset loader (see Data Loader API).
  2. Provide / adapt a model interface (embedding / classifier) in src/evals_hub/models/.
  3. Add metrics (see Metrics API).
  4. Reference the new parts inside a YAML config and run the CLI.

Because the execution pipeline is task‑abstract, new metrics or datasets rarely require changes elsewhere.


Offline & Enterprise Notes

Pre-download datasets / models and point configs to the local cache. For SSL chain issues behind corporate proxies (see Datasets Q&A) set:

export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

More details: see Offline Usage.


Langfuse External Evaluation

Evals Hub can emit rich Langfuse traces during evaluations and then score those traces later via a decoupled external pipeline. This lets you separate model inference from metric experimentation.

Key benefits:

  • Decoupled: Add / change metrics without re‑running embeddings
  • Scalable: Batch trace fetching
  • Observable: Full history & per‑query similarity + relevance vectors
  • Reproducible: Deterministic evaluation over stored outputs

Minimal setup:

  1. Set credentials (env or .env): LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST.
  2. Enable tracing in your config (reranking_config.yaml):
    langfuse:
      enabled: true
      name: reranking_evaluation
    
  3. Run the benchmark (generates traces):
    evals-hub run-benchmark --config reranking_config.yaml
    
  4. Run external scoring (adds MRR@10 & Average Precision scores back onto traces):
    uv run langfuse_trace_evaluation.py
    

Adding a new trace‑level metric: implement function → call it inside evaluate_and_update_trace_scores → post via update_trace_scores (or a new helper) → re‑run only the external script.

See full guide: Langfuse External Evaluation.


What Next?

Jump to: Tasks · Metrics · Offline Usage · Evaluator API

Happy evaluating!