Offline Usage

Benchmarks can be run ad-hoc, during development or as part of a CI/CD process using the uv run evals-hub run-benchmark command:

Configuration

Benchmarks can be configured via the commandline, a YAML file, or a mixture of the two.

Note

Commandline options always take precedence over and will override YAML file options.

Each benchmark configuration is an 'object' with the following fields (click on field types to see individual options):

Field	Type	Description	Is Nested?	Required
`task_name`	str	Determines which task pipeline should be run	X	✓
`dataset`	DatasetConfig	Determines which dataset is used, along with (optionally) the split and HuggingFace subset	✓	✓
`model`	ModelConfig	Configuration for the model, including the choice of model and model settings	✓	✓
`judge`	ModelConfig	Configuration for the judge model (if applicable ) including the choice of model and model settings	✓	X
`evaluation`	EvaluationConfig	Experiment settings like batch size, seed, max concurrency etc.	✓	✓
`output`	OutputConfig	Choice of where to store the evaluation results	✓	✓

CLI Configuration

CLI Options

Traceback (most recent call last):
  File "/Users/fatemehtahavori/Desktop/dev/tomoro-evals/.venv/bin/evals-hub", line 4, in <module>
    from evals_hub.cli import app
ModuleNotFoundError: No module named 'evals_hub'

YAML Configuration

When configuring a run via a YAML file, nested options (e.g. model.checkpoint) correspond to YAML nesting, i.e.:

run-benchmark:
  task_name: qa
  model:
    checkpoint: ...
  ...

This can then be run via the following:

uv run evals-hub run-benchmark --config <CONFIG_FILE>