Skip to content

Offline Usage

Benchmarks can be run ad-hoc, during development or as part of a CI/CD process using the uv run evals-hub run-benchmark command:

Configuration

Benchmarks can be configured via the commandline, a YAML file, or a mixture of the two.

Note

Commandline options always take precedence over and will override YAML file options.

Each benchmark configuration is an 'object' with the following fields (click on field types to see individual options):

Field Type Description Is Nested? Required
task_name str Determines which task pipeline should be run X
dataset DatasetConfig Determines which dataset is used, along with (optionally) the split and HuggingFace subset
model ModelConfig Configuration for the model, including the choice of model and model settings
judge ModelConfig Configuration for the judge model (if applicable ) including the choice of model and model settings X
evaluation EvaluationConfig Experiment settings like batch size, seed, max concurrency etc.
output OutputConfig Choice of where to store the evaluation results

CLI Configuration

CLI Options
Traceback (most recent call last):
  File "/Users/fatemehtahavori/Desktop/dev/tomoro-evals/.venv/bin/evals-hub", line 4, in <module>
    from evals_hub.cli import app
ModuleNotFoundError: No module named 'evals_hub'

YAML Configuration

When configuring a run via a YAML file, nested options (e.g. model.checkpoint) correspond to YAML nesting, i.e.:

run-benchmark:
  task_name: qa
  model:
    checkpoint: ...
  ...

This can then be run via the following:

uv run evals-hub run-benchmark --config <CONFIG_FILE>