Evaluating Langfuse LLM Traces with Evals Hub's External Evaluation Pipeline
This guide demonstrates how to use Evals Hub to evaluate Langfuse traces generated from reranking tasks using an external evaluation pipeline.
Reference: See also the Langfuse External Evaluation Cookbook for more background and alternative examples.
Overview
Evals Hub provides a comprehensive evaluation framework that can: 1. Generate Langfuse traces during evaluation (reranking task) 2. Fetch these traces from Langfuse 3. Apply external evaluation metrics 4. Update the traces with computed scores
Architecture Design
The external evaluation pipeline follows a modular architecture that separates model inference from evaluation:

Figure: Architecture of decoupled trace generation and external evaluation.
Key Benefits of This Architecture:
- Decoupled Evaluation: Model inference and evaluation run independently
- Scalable: External evaluation can process traces in batches
- Flexible: Add new metrics without re-running model inference
- Observable: Full trace history for debugging and analysis
- Reproducible: Deterministic evaluation on stored data
Prerequisites
- Environment Setup: Create a
.envfile with your Langfuse credentials (or export them in your shell):LANGFUSE_PUBLIC_KEY=pk-lf-... LANGFUSE_SECRET_KEY=sk-lf-... LANGFUSE_HOST=https://cloud.langfuse.com # or your self-hosted instance
The
.envfile must be present in the project root, or the variables must be exported before running scripts.
- Install Dependencies:
uv venv source .venv/bin/activate uv pip install -e .
Step 1: Generate Traces with Reranking Evaluation
First, run a reranking evaluation with Langfuse tracing enabled using the configuration file:
reranking_config.yaml:
- Ensure
langfuse.enabled: trueand setlangfuse.name: reranking_evaluation(or your preferred name) in your config. The trace name must match the name used in the external evaluation script.
Run the evaluation:
evals-hub run-benchmark --config reranking_config.yaml
This will:
- Load the reranking dataset
- Initialize the Reranker with Langfuse integration
- Generate traces for each query processed
- Store similarity scores and relevance information in Langfuse traces
Step 2: External Evaluation Pipeline
After generating traces, use the external evaluation pipeline to compute additional metrics:
uv run langfuse_trace_evaluation.py
The trace name used in
fetch_all_traces(name)must match the name set in your reranking config (langfuse.name).
How the External Evaluation Works
The langfuse_trace_evaluation.py script:
- Fetches Traces: Uses
fetch_all_tracesto retrieve all traces with the specified name - Extracts Data: Parses the trace output to get
is_relevantandsimilarity_scoresdata - Computes Metrics: Calculates additional metrics using functions from
reranking_metrics.py: - Mean Reciprocal Rank (MRR@10)
- Average Precision (AP)
- Updates Traces: Adds the computed scores back to Langfuse using
update_trace_scores
Note: The script includes: - Error Handling: Skips traces with missing or malformed outputs and prints a warning for each skipped trace. - Langfuse Client Flushing: After updating scores, the script calls
langfuse.flush()to ensure all updates are sent to the server.
Key Components
src/evals_hub/evaluator/reranking_eval.py: Reranking runner and trace generationlangfuse_trace_evaluation.py: External evaluation scriptsrc/evals_hub/langfuse/langfuse_fetcher.py: Trace fetchingsrc/evals_hub/langfuse/langfuse_updater.py: Score updatessrc/evals_hub/metrics/reranking_metrics.py: Metric computation
Trace Generation (reranking_eval.py)
@observe()
def process_query_instance(self, instance, ...):
# Process query and compute similarity scores
# Store results in Langfuse trace
return {
"mrr": mrr,
"ap": ap,
"is_relevant": is_relevant,
"similarity_scores": sim_scores.cpu().tolist(),
}
Trace Fetching (langfuse_fetcher.py)
def fetch_all_traces(name: str):
# Fetch traces in batches with retry logic
# Handle pagination and timeouts
return all_traces
External Metrics Calculation
def evaluate_and_update_trace_scores(name: str = "reranking_evaluation", top_k: int = 10):
traces = fetch_all_traces(name)
for trace in traces:
# Extract similarity scores and relevance data
is_relevant = trace.output["is_relevant"]
sim_scores = trace.output["similarity_scores"]
# Compute metrics
mrr = reciprocal_rank_at_k(is_relevant, sim_scores, k=top_k)
ap = ap_score(is_relevant, sim_scores)
# Update trace with new scores
update_trace_scores(trace.id, mrr, ap)
Score Updates (langfuse_updater.py)
def update_trace_scores(trace_id: str, mrr: float, ap: float):
langfuse.create_score(
trace_id=trace_id,
name="mrr@10",
value=mrr,
comment="Mean Reciprocal Rank at 10"
)
langfuse.create_score(
trace_id=trace_id,
name="average_precision",
value=ap,
comment="Average Precision Score"
)
Step 3: View Results
After running the external evaluation, you can:
- View in Langfuse UI: Check your Langfuse dashboard to see the updated traces with additional scores
- Analyze Metrics: The external evaluation script prints summary statistics:
Processed 150 traces Mean MRR@10: 0.4231 Mean AP: 0.4792
Customisation
You can extend this pipeline by:
- Adding New Metrics: Implement additional functions in
reranking_metrics.pyand call them in the external evaluation script. - Different Tasks: Adapt the pattern for other evaluation tasks by changing the trace generation and evaluation logic.
- Custom Filters: Modify the trace fetching logic to filter specific subsets.
- Batch Processing: Process traces in batches for large datasets.
This approach provides a powerful way to enhance the LLM evaluation workflow with comprehensive tracing and flexible metric computation.