QA
QA is a common task for language models, whereby a model is made to generate an answer given some question. The goal of the task is to produce the correct answer with appropriate levels of confidence.
Core Process
graph TD
A[Questions] --> B[Language Model:<br/>Generate Answers]
G -->|Retrieve Questions/Answers| C[Language Model:<br/>Generate Judgements]
G -->|Retrieve Judgements| D[Evaluation Metrics]
E[Ground Truth] --> C
B -.->|Log Questions/Answers| G[Langfuse:<br/>Ingest new traces]
C -.->|Log Judgements| G
- Questions are sent to the language model.
- The language model generates answers in structured output, with an explanation, the answer and a confidence score. The outputs along with the questions are logged to Langfuse.
- Question-Answer pairs are retrieved from Langfuse, and a language model is used to judge the answers.
- The judgements are logged to Langfuse.
- Judgements are retrieved and metrics are calculated.
Data Schema Specifications
The following schema is expected for the QA evaluations pipeline:
Instances Dataset Schema
| Column | Type | Description | Required |
|---|---|---|---|
id |
string | Unique question identifier | ✓ |
question |
string | Question | ✓ |
answer_type |
string | Answer type (e.g. multipleChoice, exactMatch, etc.) | ✓ |
Answers Dataset Schema
| Column | Type | Description | Required |
|---|---|---|---|
id |
string | Unique question identifier | ✓ |
answer |
string | Ground truth answer | X |
Relevant Metrics
| Metric | Description |
|---|---|
| Accuracy | The percentage of examples for which the model produced the correct answer |
| Calibration Error | Measure of how well the model's confidence across its answers differs from the true probability |
Supported Models
The QA pipeline supports the following model types:
- API based models
- A custom function model
Function-Based Models
Function-based models provide the means for users to wrap other models with a function for use in the QA evaluation pipeline. These functions must meet the following criteria:
- Must be async - this allows the QA pipeline to generate answers concurrently
- Must take a single string argument as input, representing the question
- Must return a QAOutput class
View an example function-based model
from evals_hub.evaluator.qa_eval import QAOutput
async def custom_model_wrapper(content: str) -> QAOutput:
return QAOutput(
explanation="I read the Hitchhiker's Guide to the galaxy",
answer="42",
confidence="100"
)
Examples for Model Types
View an example config file for an API based model
run-benchmark:
task_name: "qa"
dataset:
name: "hle_futurehouse_gold"
split: "test"
model:
checkpoint: 'openai_gpt41_nano'
system_prompt_path: examples/runner_configurations/qa/prompts/qa_model_system_prompt.txt
judge:
checkpoint: "openai_gpt41_nano"
user_prompt_path: examples/runner_configurations/qa/prompts/qa_judge_user_prompt.txt
# Additional Settings
...
View an example config file for a function-based model
run-benchmark:
task-name: "qa"
dataset:
name: "hle_futurehouse_gold"
split: "test"
model:
import_path: "evals_hub.examples.qa_application_wrapper:application_wrapper"
judge:
checkpoint: "openai_o3_mini"
user_prompt_path: examples/runner_configurations/qa/prompts/qa_judge_user_prompt.txt
# Additional Settings
...