QA

QA is a common task for language models, whereby a model is made to generate an answer given some question. The goal of the task is to produce the correct answer with appropriate levels of confidence.

Core Process

graph TD
    A[Questions] --> B[Language Model:<br/>Generate Answers]
    G -->|Retrieve Questions/Answers| C[Language Model:<br/>Generate Judgements]
    G -->|Retrieve Judgements| D[Evaluation Metrics]

    E[Ground Truth] --> C

    B -.->|Log Questions/Answers| G[Langfuse:<br/>Ingest new traces]
    C -.->|Log Judgements| G

Questions are sent to the language model.
The language model generates answers in structured output, with an explanation, the answer and a confidence score. The outputs along with the questions are logged to Langfuse.
Question-Answer pairs are retrieved from Langfuse, and a language model is used to judge the answers.
The judgements are logged to Langfuse.
Judgements are retrieved and metrics are calculated.

Data Schema Specifications

The following schema is expected for the QA evaluations pipeline:

Instances Dataset Schema

Column	Type	Description	Required
`id`	string	Unique question identifier	✓
`question`	string	Question	✓
`answer_type`	string	Answer type (e.g. multipleChoice, exactMatch, etc.)	✓

Answers Dataset Schema

Column	Type	Description	Required
`id`	string	Unique question identifier	✓
`answer`	string	Ground truth answer	X

Relevant Metrics

Metric	Description
Accuracy	The percentage of examples for which the model produced the correct answer
Calibration Error	Measure of how well the model's confidence across its answers differs from the true probability

Supported Models

The QA pipeline supports the following model types:

API based models
A custom function model

Function-Based Models

Function-based models provide the means for users to wrap other models with a function for use in the QA evaluation pipeline. These functions must meet the following criteria:

Must be async - this allows the QA pipeline to generate answers concurrently
Must take a single string argument as input, representing the question
Must return a QAOutput class

View an example function-based model

from evals_hub.evaluator.qa_eval import QAOutput

async def custom_model_wrapper(content: str) -> QAOutput:
    return QAOutput(
        explanation="I read the Hitchhiker's Guide to the galaxy",
        answer="42",
        confidence="100"
    )

Examples for Model Types

View an example config file for an API based model

run-benchmark:
  task_name: "qa"
  dataset:
    name: "hle_futurehouse_gold"
    split: "test"

  model:
    checkpoint: 'openai_gpt41_nano'
    system_prompt_path: examples/runner_configurations/qa/prompts/qa_model_system_prompt.txt

  judge:
    checkpoint: "openai_gpt41_nano"
    user_prompt_path: examples/runner_configurations/qa/prompts/qa_judge_user_prompt.txt

  # Additional Settings
  ...

View an example config file for a function-based model

run-benchmark:
  task-name: "qa"
  dataset:
    name: "hle_futurehouse_gold"
    split: "test"

  model:
    import_path: "evals_hub.examples.qa_application_wrapper:application_wrapper"

  judge:
    checkpoint: "openai_o3_mini"
    user_prompt_path: examples/runner_configurations/qa/prompts/qa_judge_user_prompt.txt

  # Additional Settings
  ...