Skip to content

QA

QA is a common task for language models, whereby a model is made to generate an answer given some question. The goal of the task is to produce the correct answer with appropriate levels of confidence.

Core Process

graph TD
    A[Questions] --> B[Language Model:<br/>Generate Answers]
    G -->|Retrieve Questions/Answers| C[Language Model:<br/>Generate Judgements]
    G -->|Retrieve Judgements| D[Evaluation Metrics]

    E[Ground Truth] --> C

    B -.->|Log Questions/Answers| G[Langfuse:<br/>Ingest new traces]
    C -.->|Log Judgements| G
  1. Questions are sent to the language model.
  2. The language model generates answers in structured output, with an explanation, the answer and a confidence score. The outputs along with the questions are logged to Langfuse.
  3. Question-Answer pairs are retrieved from Langfuse, and a language model is used to judge the answers.
  4. The judgements are logged to Langfuse.
  5. Judgements are retrieved and metrics are calculated.

Data Schema Specifications

The following schema is expected for the QA evaluations pipeline:

Instances Dataset Schema

Column Type Description Required
id string Unique question identifier
question string Question
answer_type string Answer type (e.g. multipleChoice, exactMatch, etc.)

Answers Dataset Schema

Column Type Description Required
id string Unique question identifier
answer string Ground truth answer X

Relevant Metrics

Metric Description
Accuracy The percentage of examples for which the model produced the correct answer
Calibration Error Measure of how well the model's confidence across its answers differs from the true probability

Supported Models

The QA pipeline supports the following model types:

  • API based models
  • A custom function model

Function-Based Models

Function-based models provide the means for users to wrap other models with a function for use in the QA evaluation pipeline. These functions must meet the following criteria:

  • Must be async - this allows the QA pipeline to generate answers concurrently
  • Must take a single string argument as input, representing the question
  • Must return a QAOutput class
View an example function-based model
from evals_hub.evaluator.qa_eval import QAOutput

async def custom_model_wrapper(content: str) -> QAOutput:
    return QAOutput(
        explanation="I read the Hitchhiker's Guide to the galaxy",
        answer="42",
        confidence="100"
    )

Examples for Model Types

View an example config file for an API based model
run-benchmark:
  task_name: "qa"
  dataset:
    name: "hle_futurehouse_gold"
    split: "test"

  model:
    checkpoint: 'openai_gpt41_nano'
    system_prompt_path: examples/runner_configurations/qa/prompts/qa_model_system_prompt.txt

  judge:
    checkpoint: "openai_gpt41_nano"
    user_prompt_path: examples/runner_configurations/qa/prompts/qa_judge_user_prompt.txt

  # Additional Settings
  ...
View an example config file for a function-based model
run-benchmark:
  task-name: "qa"
  dataset:
    name: "hle_futurehouse_gold"
    split: "test"

  model:
    import_path: "evals_hub.examples.qa_application_wrapper:application_wrapper"

  judge:
    checkpoint: "openai_o3_mini"
    user_prompt_path: examples/runner_configurations/qa/prompts/qa_judge_user_prompt.txt

  # Additional Settings
  ...