Skip to content

Evaluating Langfuse LLM Traces with Evals Hub's External Evaluation Pipeline

This guide demonstrates how to use Evals Hub to evaluate Langfuse traces generated from reranking tasks using an external evaluation pipeline.

Reference: See also the Langfuse External Evaluation Cookbook for more background and alternative examples.

Overview

Evals Hub provides a comprehensive evaluation framework that can: 1. Generate Langfuse traces during evaluation (reranking task) 2. Fetch these traces from Langfuse 3. Apply external evaluation metrics 4. Update the traces with computed scores

Architecture Design

The external evaluation pipeline follows a modular architecture that separates model inference from evaluation:

External evaluation architecture showing three stages: (1) Application generates traces, (2) Langfuse stores traces (inputs, outputs, metadata, timestamps), (3) External Evaluation Pipeline fetches traces, applies metrics, and updates scores.

Figure: Architecture of decoupled trace generation and external evaluation.

Key Benefits of This Architecture:

  1. Decoupled Evaluation: Model inference and evaluation run independently
  2. Scalable: External evaluation can process traces in batches
  3. Flexible: Add new metrics without re-running model inference
  4. Observable: Full trace history for debugging and analysis
  5. Reproducible: Deterministic evaluation on stored data

Prerequisites

  1. Environment Setup: Create a .env file with your Langfuse credentials (or export them in your shell):
    LANGFUSE_PUBLIC_KEY=pk-lf-...
    LANGFUSE_SECRET_KEY=sk-lf-...
    LANGFUSE_HOST=https://cloud.langfuse.com  # or your self-hosted instance
    

The .env file must be present in the project root, or the variables must be exported before running scripts.

  1. Install Dependencies:
    uv venv
    source .venv/bin/activate
    uv pip install -e .
    

Step 1: Generate Traces with Reranking Evaluation

First, run a reranking evaluation with Langfuse tracing enabled using the configuration file:

reranking_config.yaml:

  • Ensure langfuse.enabled: true and set langfuse.name: reranking_evaluation (or your preferred name) in your config. The trace name must match the name used in the external evaluation script.

Run the evaluation:

evals-hub run-benchmark --config reranking_config.yaml

This will: - Load the reranking dataset - Initialize the Reranker with Langfuse integration - Generate traces for each query processed - Store similarity scores and relevance information in Langfuse traces

Step 2: External Evaluation Pipeline

After generating traces, use the external evaluation pipeline to compute additional metrics:

uv run langfuse_trace_evaluation.py

The trace name used in fetch_all_traces(name) must match the name set in your reranking config (langfuse.name).

How the External Evaluation Works

The langfuse_trace_evaluation.py script:

  1. Fetches Traces: Uses fetch_all_traces to retrieve all traces with the specified name
  2. Extracts Data: Parses the trace output to get is_relevant and similarity_scores data
  3. Computes Metrics: Calculates additional metrics using functions from reranking_metrics.py:
  4. Mean Reciprocal Rank (MRR@10)
  5. Average Precision (AP)
  6. Updates Traces: Adds the computed scores back to Langfuse using update_trace_scores

Note: The script includes: - Error Handling: Skips traces with missing or malformed outputs and prints a warning for each skipped trace. - Langfuse Client Flushing: After updating scores, the script calls langfuse.flush() to ensure all updates are sent to the server.

Key Components

  • src/evals_hub/evaluator/reranking_eval.py: Reranking runner and trace generation
  • langfuse_trace_evaluation.py: External evaluation script
  • src/evals_hub/langfuse/langfuse_fetcher.py: Trace fetching
  • src/evals_hub/langfuse/langfuse_updater.py: Score updates
  • src/evals_hub/metrics/reranking_metrics.py: Metric computation

Trace Generation (reranking_eval.py)

@observe()
def process_query_instance(self, instance, ...):
    # Process query and compute similarity scores
    # Store results in Langfuse trace
    return {
        "mrr": mrr,
        "ap": ap,
        "is_relevant": is_relevant,
        "similarity_scores": sim_scores.cpu().tolist(),
    }

Trace Fetching (langfuse_fetcher.py)

def fetch_all_traces(name: str):
    # Fetch traces in batches with retry logic
    # Handle pagination and timeouts
    return all_traces

External Metrics Calculation

def evaluate_and_update_trace_scores(name: str = "reranking_evaluation", top_k: int = 10):
    traces = fetch_all_traces(name)
    for trace in traces:
        # Extract similarity scores and relevance data
        is_relevant = trace.output["is_relevant"]
        sim_scores = trace.output["similarity_scores"]

        # Compute metrics
        mrr = reciprocal_rank_at_k(is_relevant, sim_scores, k=top_k)
        ap = ap_score(is_relevant, sim_scores)

        # Update trace with new scores
        update_trace_scores(trace.id, mrr, ap)

Score Updates (langfuse_updater.py)

def update_trace_scores(trace_id: str, mrr: float, ap: float):
    langfuse.create_score(
        trace_id=trace_id,
        name="mrr@10",
        value=mrr,
        comment="Mean Reciprocal Rank at 10"
    )
    langfuse.create_score(
        trace_id=trace_id,
        name="average_precision", 
        value=ap,
        comment="Average Precision Score"
    )

Step 3: View Results

After running the external evaluation, you can:

  1. View in Langfuse UI: Check your Langfuse dashboard to see the updated traces with additional scores
  2. Analyze Metrics: The external evaluation script prints summary statistics:
    Processed 150 traces
    Mean MRR@10: 0.4231
    Mean AP:     0.4792
    

Customisation

You can extend this pipeline by:

  1. Adding New Metrics: Implement additional functions in reranking_metrics.py and call them in the external evaluation script.
  2. Different Tasks: Adapt the pattern for other evaluation tasks by changing the trace generation and evaluation logic.
  3. Custom Filters: Modify the trace fetching logic to filter specific subsets.
  4. Batch Processing: Process traces in batches for large datasets.

This approach provides a powerful way to enhance the LLM evaluation workflow with comprehensive tracing and flexible metric computation.