The Uncomfortable Truth About LLM Benchmarks
Here’s a fact that might surprise you: GPT-4, Claude 3.5, and Gemini all score nearly identically on traditional fluency metrics. Yet anyone who’s used these models in production knows they behave very differently. Your chatbot hallucinates less with Claude. Your summarization pipeline produces more accurate outputs with GPT-4. Your document extraction works better with a fine-tuned smaller model than any of the giants.
So what’s going on? Why are we still chasing BLEU scores and perplexity when they clearly don’t predict real-world performance?
After working with LLMs in production for the past two years and evaluating everything from document understanding systems to conversational AI, I’ve learned one critical lesson: The metrics that got us here won’t get us where we’re going.
In this guide, I’ll show you:
- Why traditional benchmarks fail for modern LLMs
- What actually matters in 2026 (spoiler: faithfulness and informativeness)
- How to choose between DeepEval, Ragas, and LangSmith
- A practical 5-minute setup to start evaluating properly
- Production monitoring strategies that actually work
Let’s dive in.
Part 1: The Benchmark Trap
Traditional Metrics Were Built for a Different Era
When BLEU was introduced in 2002, it was revolutionary. When perplexity became the standard measure, it made sense. But these metrics were designed for:
- Statistical machine translation (not neural networks)
- Deterministic outputs (not probabilistic generation)
- Narrow tasks (not general-purpose intelligence)
The problem? Modern LLMs broke all these assumptions.
graph LR
A[2002-2015:<br/>Traditional Metrics] --> B[BLEU<br/>ROUGE<br/>Perplexity]
B --> C[Worked for:<br/>• MT Systems<br/>• Deterministic Output<br/>• Narrow Tasks]
D[2020-2026:<br/>Modern LLMs] --> E[Need New Metrics]
E --> F[Faithfulness<br/>Informativeness<br/>Context Relevance]
F --> G[Work for:<br/>• General AI<br/>• Probabilistic Output<br/>• Complex Tasks]
style A fill:#ff6b6b
style D fill:#51cf66
style B fill:#ffd43b
style F fill:#339af0
Why BLEU, ROUGE, and Perplexity Fall Short
Let me show you a concrete example. Here are three model outputs for the same question:
Question: “What is the capital of France?”
Model A Output:
The capital of France is Paris, a beautiful city known for its art,
culture, and the Eiffel Tower.
- BLEU Score: 0.42
- ROUGE-L: 0.38
Model B Output:
Paris is the capital of France.
- BLEU Score: 0.68
- ROUGE-L: 0.71
Model C Output:
The capital city of France is Paris, located in the north-central
part of the country.
- BLEU Score: 0.51
- ROUGE-L: 0.49
Which is “better”? Model B wins on traditional metrics because it’s closest to a reference answer “Paris is the capital of France.” But Model A and C provide more informative responses. In production, informativeness often matters more than exact n-gram matching.
The 2026 Research Finding That Changes Everything
Recent empirical research analyzed 243,337 manual annotations across GPT-4, Claude, and ChatGPT. The key finding:
Fluency is no longer a primary performance differentiator. Informativeness and accuracy are the actual discriminators.
This is huge. It means:
- All modern LLMs are “fluent enough”
- The competition is now on factual accuracy and usefulness
- Traditional fluency metrics (BLEU, ROUGE) measure the wrong thing
Think about it: When was the last time you said “This LLM response is grammatically incorrect”? You didn’t. You said “This is hallucinated” or “This doesn’t answer my question.”
💡 Key Insight from 2026 Research
Analysis of 243,337 manual annotations across GPT-4, Claude, and ChatGPT revealed:
Fluency is no longer a differentiator.
Modern LLMs compete on informativeness and accuracy, not grammar.
Part 2: What Actually Matters in 2026
Based on production experience and recent research, here are the metrics that actually predict LLM success:
graph TD
A[Modern LLM Evaluation] --> B[1. Faithfulness<br/>Anti-Hallucination]
A --> C[2. Informativeness<br/>Value Metric]
A --> D[3. Context Relevance<br/>Retrieval Quality]
A --> E[4. Answer Accuracy<br/>Ground Truth]
B --> F[Does output match<br/>source context?]
C --> G[How much useful<br/>info is conveyed?]
D --> H[Are retrieved docs<br/>relevant?]
E --> I[Is the answer<br/>factually correct?]
style A fill:#667eea,color:#fff
style B fill:#51cf66
style C fill:#339af0
style D fill:#ff6b6b
style E fill:#ffd43b
1. Faithfulness (The Anti-Hallucination Metric)
Definition: Does the output accurately reflect the source context without adding false information?
Why it matters: A fluent, well-written hallucination is worse than a clunky but accurate response.
Example:
# Context (from a research paper)
context = """
The study found that 67% of participants showed improvement
after 8 weeks of treatment. The control group showed 12% improvement.
"""
# LLM Output 1 (Low Faithfulness)
output_1 = """
The treatment was highly effective, with most patients showing
significant improvement within just 2 weeks.
"""
# Faithfulness Score: 0.3 - Hallucinated timeline (2 weeks vs 8 weeks)
# LLM Output 2 (High Faithfulness)
output_2 = """
67% of treatment group participants improved after 8 weeks,
compared to 12% in the control group.
"""
# Faithfulness Score: 0.95 - Accurate, grounded in context
2. Informativeness (The Value Metric)
Definition: How much useful information does the response convey?
Why it matters: Users want answers, not filler.
Example:
question = "How do I reduce hallucinations in my RAG system?"
# Low Informativeness
response_low = """
To reduce hallucinations, you should improve your system and use
better retrieval methods.
"""
# High Informativeness
response_high = """
To reduce hallucinations in RAG systems:
1. Improve retrieval quality:
- Use hybrid search (keyword + semantic)
- Increase chunk overlap to 20-50 tokens
- Implement reranking with a cross-encoder
2. Constrain generation:
- Add "only use provided context" to system prompt
- Set temperature to 0.1-0.3 for factual tasks
- Use citation mechanisms to link claims to sources
3. Add verification:
- Implement SelfCheckGPT for consistency
- Use an LLM judge to verify faithfulness
- Track hallucination metrics in production
Research shows hybrid RAG can reduce errors by 35-60%.
"""
The second response provides actionable, specific, measurable information. This is what informativeness measures.
3. Context Relevance (The Retrieval Quality Metric)
Definition: For RAG systems, are the retrieved documents actually relevant to answering the question?
Why it matters: Garbage in, garbage out. If your retrieval is poor, your generation will be poor.
Example:
query = "What is the return policy for electronics?"
# Low Relevance Retrieved Context
retrieved_docs = [
"Our electronics department features the latest smartphones...",
"Customer service hours are Monday-Friday 9am-5pm...",
"Electronics require special recycling procedures..."
]
# Context Relevance: 0.2
# High Relevance Retrieved Context
retrieved_docs = [
"Electronics can be returned within 30 days with original packaging...",
"Defective electronics are covered by manufacturer warranty...",
"Return process: 1) Bring item and receipt 2) Visit customer service..."
]
# Context Relevance: 0.9
4. Answer Accuracy (The Ground Truth Metric)
Definition: Is the answer factually correct?
Why it matters: This is the bottom line. A beautiful, fluent, well-structured wrong answer is still wrong.
Measuring it:
# For tasks with ground truth
def accuracy_score(predicted, ground_truth):
"""
Simple exact match for closed-ended questions
"""
return 1.0 if predicted.strip().lower() == ground_truth.strip().lower() else 0.0
# For subjective tasks, use LLM-as-judge
def llm_judge_accuracy(question, predicted, reference):
"""
Use a strong LLM to judge correctness
"""
prompt = f"""
Compare the predicted answer to the reference answer.
Question: {question}
Predicted: {predicted}
Reference: {reference}
Score accuracy from 0-1, where:
0 = Completely wrong
0.5 = Partially correct
1.0 = Fully correct
Return only the numeric score.
"""
return call_llm_judge(prompt)
Part 3: Framework Comparison - DeepEval vs Ragas vs LangSmith
You’ve decided to evaluate properly. Now you need tools. Here’s how the top 3 frameworks compare:
Quick Decision Matrix
| Use Case | Best Framework | Why |
|---|---|---|
| RAG experimentation | Ragas | Lightweight, RAG-focused, quick setup |
| Enterprise testing | DeepEval | CI/CD integration, custom metrics, comprehensive |
| LangChain workflows | LangSmith | Native integration, observability built-in |
| Multi-framework strategy | All three | DeepEval for testing, LangSmith for monitoring, Ragas for quick checks |
graph TB
subgraph DeepEval
D1[Comprehensive Testing]
D2[14+ RAG Metrics]
D3[CI/CD Ready]
D4[Custom Metrics]
D5[Like Pytest]
end
subgraph Ragas
R1[RAG-Focused]
R2[Lightweight]
R3[Quick Setup]
R4[Domain-Specific]
R5[Like Pandas]
end
subgraph LangSmith
L1[Observability]
L2[LangChain Native]
L3[Hosted Platform]
L4[Tracing Built-in]
L5[Enterprise Ready]
end
USE[Your Use Case] --> Q1{Need CI/CD<br/>Integration?}
Q1 -->|Yes| D1
Q1 -->|No| Q2{Using<br/>LangChain?}
Q2 -->|Yes| L1
Q2 -->|No| Q3{Quick<br/>Experiment?}
Q3 -->|Yes| R1
Q3 -->|No| D1
style DeepEval fill:#51cf66
style Ragas fill:#339af0
style LangSmith fill:#ffd43b
style USE fill:#667eea,color:#fff
🎯 Framework Selection Guide
- Starting out? → Ragas (easiest setup)
- Production system? → DeepEval (most comprehensive)
- Already using LangChain? → LangSmith (native integration)
- Enterprise with CI/CD? → DeepEval + LangSmith (best of both)
DeepEval: The “Pytest for LLMs”
Philosophy: Treat LLM evaluation like unit testing
Best Features:
- 14+ built-in metrics for RAG
- Custom metric creation
- CI/CD integration
- Works with any LLM (OpenAI, Anthropic, local models)
Code Example:
# Installation
# pip install deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
def test_rag_response():
"""Test RAG system response quality"""
# Define test case
test_case = LLMTestCase(
input="What is the capital of France?",
actual_output="Paris is the capital of France, located in the north-central region.",
retrieval_context=[
"Paris is the capital and largest city of France.",
"The city of Paris is located in north-central France."
]
)
# Define metrics
faithfulness = FaithfulnessMetric(threshold=0.7)
relevancy = AnswerRelevancyMetric(threshold=0.7)
# Assert (fails test if below threshold)
assert_test(test_case, [faithfulness, relevancy])
# Run with: pytest test_rag.py
Custom Metric Example:
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
class CitationMetric(BaseMetric):
"""Custom metric to check if response includes citations"""
def __init__(self, threshold: float = 0.5):
self.threshold = threshold
def measure(self, test_case: LLMTestCase):
# Check if output contains citation markers like [1], [2]
import re
citations = re.findall(r'\[\d+\]', test_case.actual_output)
# Calculate score based on citation density
words = len(test_case.actual_output.split())
citation_density = len(citations) / (words / 100) # Citations per 100 words
self.score = min(citation_density, 1.0)
self.success = self.score >= self.threshold
return self.score
# Usage
test_case = LLMTestCase(
input="Explain quantum computing",
actual_output="Quantum computing uses qubits [1]. Unlike classical bits [2]...",
)
metric = CitationMetric(threshold=0.5)
metric.measure(test_case)
print(f"Citation Score: {metric.score}")
Ragas: The Lightweight RAG Specialist
Philosophy: Purpose-built for RAG pipeline evaluation
Best Features:
- Domain-specific RAG metrics (context precision, faithfulness)
- Quick to set up (like pandas for evaluation)
- Good for experimentation phase
Code Example:
# Installation
# pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall
)
from datasets import Dataset
# Your RAG system outputs
data = {
'question': [
'What is the capital of France?',
'Who wrote Romeo and Juliet?'
],
'answer': [
'Paris is the capital of France.',
'William Shakespeare wrote Romeo and Juliet in the 1590s.'
],
'contexts': [
['Paris is the capital and most populous city of France.'],
['Romeo and Juliet is a tragedy written by William Shakespeare early in his career.']
],
'ground_truth': [
'Paris',
'William Shakespeare'
]
}
dataset = Dataset.from_dict(data)
# Evaluate
results = evaluate(
dataset,
metrics=[
faithfulness,
answer_relevancy,
context_relevancy,
context_recall
]
)
print(results)
# Output:
# {
# 'faithfulness': 0.95,
# 'answer_relevancy': 0.92,
# 'context_relevancy': 0.88,
# 'context_recall': 0.91
# }
LangSmith: The Observability Platform
Philosophy: Evaluate + monitor + debug in one platform
Best Features:
- Built by LangChain team
- Automatic tracing of chain execution
- Hosted platform (no infrastructure)
Code Example:
# Installation
# pip install langsmith langchain
from langsmith import Client
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
# Initialize LangSmith
client = Client()
# Create a chain
llm = ChatOpenAI(temperature=0)
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant."),
("user", "{question}")
])
chain = prompt | llm
# Run with automatic tracing
result = chain.invoke(
{"question": "What is the capital of France?"},
config={"run_name": "capital_question"}
)
# Evaluate with LangSmith
from langsmith.evaluation import evaluate
def correctness_evaluator(run, example):
"""Custom evaluator function"""
predicted = run.outputs["output"]
expected = example.outputs["answer"]
return {"score": 1.0 if expected.lower() in predicted.lower() else 0.0}
# Run evaluation on a dataset
evaluate(
lambda x: chain.invoke(x),
data="my_evaluation_dataset",
evaluators=[correctness_evaluator],
experiment_prefix="rag_eval_v1"
)
Part 4: 5-Minute Quick Start with DeepEval
Let’s get you evaluating properly in 5 minutes. I’ll use DeepEval because it’s the easiest to integrate into existing workflows.
Step 1: Install DeepEval
pip install deepeval
Step 2: Set Up Your First Test
Create test_llm.py:
import pytest
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import (
FaithfulnessMetric,
AnswerRelevancyMetric,
HallucinationMetric
)
# Simulate your LLM system
def my_rag_system(question):
"""
Your RAG system here. This is a mock for demonstration.
In practice, replace with actual RAG pipeline.
"""
if "capital" in question.lower() and "france" in question.lower():
return {
"answer": "Paris is the capital of France.",
"context": ["Paris is the capital and most populous city of France."]
}
return {"answer": "I don't know.", "context": []}
@pytest.mark.parametrize(
"question,expected_keywords",
[
("What is the capital of France?", ["Paris", "capital"]),
("Who is the president of France?", ["president"]),
]
)
def test_rag_quality(question, expected_keywords):
"""Test that RAG responses are faithful and relevant"""
# Get response from your system
response = my_rag_system(question)
# Create test case
test_case = LLMTestCase(
input=question,
actual_output=response["answer"],
retrieval_context=response["context"]
)
# Define metrics with thresholds
faithfulness = FaithfulnessMetric(
threshold=0.7,
model="gpt-3.5-turbo" # LLM to use for evaluation
)
relevancy = AnswerRelevancyMetric(
threshold=0.7,
model="gpt-3.5-turbo"
)
hallucination = HallucinationMetric(
threshold=0.3 # Lower is better for hallucination
)
# Run evaluation
assert_test(test_case, [faithfulness, relevancy, hallucination])
# Check if expected keywords are present
for keyword in expected_keywords:
assert keyword.lower() in response["answer"].lower(), \
f"Expected keyword '{keyword}' not found in response"
# Run with: pytest test_llm.py -v
Step 3: Run Your Tests
# Run all tests
pytest test_llm.py -v
# Run with detailed output
deepeval test run test_llm.py
# Generate HTML report
deepeval test run test_llm.py --output report.html
Expected Output:
test_llm.py::test_rag_quality[What is the capital of France?-keywords0]
Faithfulness Score: 0.95 ✓
Answer Relevancy Score: 0.92 ✓
Hallucination Score: 0.05 ✓
PASSED
test_llm.py::test_rag_quality[Who is the president of France?-keywords1]
Faithfulness Score: 0.45 ✗
Answer Relevancy Score: 0.15 ✗
FAILED - Scores below threshold
Step 4: Integrate into CI/CD
# .github/workflows/llm-tests.yml
name: LLM Evaluation Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.10'
- name: Install dependencies
run: |
pip install deepeval pytest
pip install -r requirements.txt
- name: Run LLM evaluation tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
pytest test_llm.py -v --tb=short
- name: Generate evaluation report
if: always()
run: |
deepeval test run test_llm.py --output evaluation-report.html
- name: Upload report
if: always()
uses: actions/upload-artifact@v3
with:
name: evaluation-report
path: evaluation-report.html
Now every PR will automatically evaluate your LLM system!
Part 5: Production Monitoring - Evaluation vs Observability
Setting up evaluation is great for development. But production is different. You need observability.
The Critical Distinction
graph LR
subgraph Evaluation
E1[Pre-Production]
E2[Test Sets]
E3[Metrics:<br/>Faithfulness<br/>Relevancy<br/>Accuracy]
E4[Questions:<br/>Is it good enough?<br/>What's the score?]
end
subgraph Observability
O1[Production]
O2[Live Traffic]
O3[Monitoring:<br/>Latency<br/>Cost<br/>Quality Drift]
O4[Questions:<br/>Why did it fail?<br/>When did it degrade?]
end
DEV[Development] --> E1
E1 --> DEPLOY{Deploy?}
DEPLOY -->|Pass| O1
DEPLOY -->|Fail| E1
O1 --> FEEDBACK[Feedback Loop]
FEEDBACK --> E1
style Evaluation fill:#51cf66
style Observability fill:#339af0
style DEPLOY fill:#ffd43b
📊 Evaluation
Purpose: Measure quality before deployment
- Test sets with ground truth
- Offline metrics
- Pre-production phase
- Question: "Is it good enough?"
🔍 Observability
Purpose: Monitor behavior in production
- Live user traffic
- Real-time metrics
- Post-deployment phase
- Question: "Why did it fail?"
The Silent Failure Problem
Unlike traditional software that crashes with clear error messages, LLMs fail silently:
# Traditional Software
def divide(a, b):
return a / b
result = divide(10, 0)
# CRASH: ZeroDivisionError - Clear, immediate, debuggable
# LLM Application
def answer_question(question):
return llm.generate(question)
result = answer_question("What is 2+2?")
# Returns: "2+2 equals 5"
# NO ERROR - Silent failure, plausible but wrong
This is why you need observability - to catch these silent failures.
Production Observability Stack
graph TD
USER[User Query] --> API[LLM API]
API --> TRACE[Tracing Layer<br/>LangSmith/Braintrust]
TRACE --> RETRIEVE[Retrieval]
RETRIEVE --> GENERATE[Generation]
GENERATE --> EVAL[Async Evaluation<br/>DeepEval/Ragas]
TRACE --> METRICS[Metrics Collection]
METRICS --> LAT[Latency Tracking]
METRICS --> COST[Cost Monitoring]
METRICS --> QUAL[Quality Scoring]
LAT --> ALERT{Alert<br/>Threshold?}
COST --> ALERT
QUAL --> ALERT
ALERT -->|Exceeded| NOTIFY[Notify Team]
ALERT -->|Normal| LOG[Log to Dashboard]
EVAL --> FEED[Feedback Loop]
FEED --> IMPROVE[Model Improvement]
style USER fill:#667eea,color:#fff
style ALERT fill:#ff6b6b,color:#fff
style NOTIFY fill:#fa5252,color:#fff
style IMPROVE fill:#51cf66
Here’s what a production-ready stack looks like:
from langsmith import Client
from deepeval.metrics import FaithfulnessMetric
import time
class ProductionRAGSystem:
"""RAG system with built-in observability"""
def __init__(self):
self.langsmith = Client() # For tracing
self.metrics = {
'faithfulness': FaithfulnessMetric(threshold=0.7)
}
self.cost_tracker = CostTracker()
self.latency_tracker = LatencyTracker()
def query(self, question: str, user_id: str):
"""Process query with full observability"""
# Start trace
with self.langsmith.trace(
name="rag_query",
metadata={"user_id": user_id}
) as trace:
start_time = time.time()
# Retrieve context
context = self.retrieve(question)
trace.log("Retrieved chunks", {"count": len(context)})
# Generate response
response = self.generate(question, context)
# Track latency
latency = time.time() - start_time
self.latency_tracker.record(latency)
# Track cost
cost = self.cost_tracker.calculate(response.tokens)
# Evaluate quality (async in production)
quality_score = self.evaluate_async(question, response, context)
# Log metrics
trace.log("Metrics", {
"latency_ms": latency * 1000,
"cost_usd": cost,
"quality_score": quality_score,
"tokens": response.tokens
})
# Alert if quality drops
if quality_score < 0.6:
self.alert("Low quality response detected", {
"question": question,
"score": quality_score
})
return response
class CostTracker:
"""Track token usage and costs"""
def __init__(self):
self.costs = {
"gpt-4": {"input": 0.03, "output": 0.06}, # per 1K tokens
"gpt-3.5-turbo": {"input": 0.001, "output": 0.002}
}
def calculate(self, tokens, model="gpt-4"):
"""Calculate cost for token usage"""
input_cost = (tokens.input / 1000) * self.costs[model]["input"]
output_cost = (tokens.output / 1000) * self.costs[model]["output"]
total = input_cost + output_cost
# Log to monitoring system
self.log_to_datadog(total, tokens)
return total
Key Metrics to Track in Production
# 1. Latency Distribution
latency_p50 = 120ms # Median response time
latency_p95 = 450ms # 95th percentile
latency_p99 = 890ms # 99th percentile (catch outliers)
# 2. Cost Metrics
daily_cost = $127.34
cost_per_request = $0.023
token_efficiency = 0.78 # Ratio of useful to total tokens
# 3. Quality Metrics
hallucination_rate = 0.03 # 3% of responses
faithfulness_score = 0.89
user_feedback_positive = 0.92
# 4. System Health
error_rate = 0.001 # 0.1% errors
timeout_rate = 0.005 # 0.5% timeouts
cache_hit_rate = 0.67 # 67% served from cache
Setting Up Alerts
# alert_config.py
ALERT_THRESHOLDS = {
"quality": {
"faithfulness_score": {
"min": 0.7,
"severity": "high"
},
"hallucination_rate": {
"max": 0.05, # Alert if >5% hallucinations
"severity": "critical"
}
},
"performance": {
"latency_p95": {
"max": 500, # Alert if p95 >500ms
"severity": "medium"
}
},
"cost": {
"hourly_spend": {
"max": 10.0, # Alert if >$10/hour
"severity": "high"
}
}
}
def check_alerts(metrics):
"""Check if metrics exceed thresholds"""
alerts = []
if metrics['faithfulness'] < ALERT_THRESHOLDS['quality']['faithfulness_score']['min']:
alerts.append({
"type": "quality",
"message": f"Faithfulness dropped to {metrics['faithfulness']}",
"severity": "high"
})
if metrics['hourly_cost'] > ALERT_THRESHOLDS['cost']['hourly_spend']['max']:
alerts.append({
"type": "cost",
"message": f"Hourly cost ${metrics['hourly_cost']:.2f} exceeds threshold",
"severity": "high"
})
return alerts
Part 6: Real-World Case Study
Let me share a real scenario from a document extraction system I worked on.
The Problem
We had a RAG system extracting information from financial documents. Traditional metrics looked great:
- BLEU Score: 0.78 ✓
- ROUGE-L: 0.82 ✓
- Perplexity: 23.4 ✓
But users complained the system was “making things up.”
The Investigation
When we added proper evaluation:
from deepeval.metrics import HallucinationMetric, FaithfulnessMetric
# Test on 100 real user queries
test_cases = load_production_queries(limit=100)
hallucination_metric = HallucinationMetric(threshold=0.3)
faithfulness_metric = FaithfulnessMetric(threshold=0.7)
results = []
for case in test_cases:
hallucination_score = hallucination_metric.measure(case)
faithfulness_score = faithfulness_metric.measure(case)
results.append({
'hallucination': hallucination_score,
'faithfulness': faithfulness_score
})
# Analysis
avg_hallucination = sum(r['hallucination'] for r in results) / len(results)
avg_faithfulness = sum(r['faithfulness'] for r in results) / len(results)
print(f"Hallucination Rate: {avg_hallucination:.2%}") # 23% ❌
print(f"Faithfulness Score: {avg_faithfulness:.2f}") # 0.54 ❌
Findings:
- 23% hallucination rate (should be <5%)
- Faithfulness score of 0.54 (should be >0.7)
- The model was adding “plausible” but incorrect financial figures
The Fix
We implemented a multi-stage solution:
# 1. Improved retrieval with hybrid search
from langchain.retrievers import EnsembleRetriever
from langchain.vectorstores import FAISS
from langchain.retrievers import BM25Retriever
# Combine semantic + keyword search
vector_retriever = FAISS.from_documents(docs, embeddings)
keyword_retriever = BM25Retriever.from_documents(docs)
ensemble = EnsembleRetriever(
retrievers=[vector_retriever, keyword_retriever],
weights=[0.6, 0.4] # 60% semantic, 40% keyword
)
# 2. Constrained generation prompt
prompt = """
You are a financial document assistant.
CRITICAL RULES:
1. ONLY use information explicitly stated in the context below
2. If the answer is not in the context, say "Information not found"
3. Include citation numbers [1], [2] for each claim
4. Never infer, estimate, or extrapolate numbers
Context:
{context}
Question: {question}
Answer with citations:
"""
# 3. Post-generation verification
def verify_response(response, context):
"""Verify each claim in response against context"""
verifier = HallucinationMetric()
score = verifier.measure(response, context)
if score > 0.1: # >10% hallucination
return "I cannot provide a confident answer based on the available documents."
return response
The Results
After implementing these changes:
| Metric | Before | After | Change |
|---|---|---|---|
| Hallucination Rate | 23% | 3.2% | -86% ✓ |
| Faithfulness Score | 0.54 | 0.91 | +69% ✓ |
| User Satisfaction | 67% | 94% | +40% ✓ |
| False Confidence | 31% | 4% | -87% ✓ |
%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#667eea'}}}%%
graph LR
subgraph Before
B1[Hallucination: 23%]
B2[Faithfulness: 0.54]
B3[User Sat: 67%]
end
subgraph Improvements
I1[Hybrid Search]
I2[Constrained Prompts]
I3[Post-Gen Verification]
end
subgraph After
A1[Hallucination: 3.2%]
A2[Faithfulness: 0.91]
A3[User Sat: 94%]
end
Before --> Improvements
Improvements --> After
style Before fill:#ff6b6b
style After fill:#51cf66
style Improvements fill:#339af0
📈 86% Reduction in Hallucinations
From 23% to 3.2% hallucination rate
User satisfaction jumped from 67% to 94%
Key Lesson: Traditional metrics (BLEU, ROUGE) showed no problem.
Modern evaluation metrics (faithfulness, hallucination) revealed the critical issues.
Part 7: Best Practices and Common Pitfalls
graph TD
START[Start Evaluation] --> Q1{Have<br/>test data?}
Q1 -->|No| CREATE[Create test set<br/>from prod queries]
Q1 -->|Yes| Q2{Production<br/>system?}
Q2 -->|No| DEV[Use DeepEval<br/>for development]
Q2 -->|Yes| Q3{Need<br/>observability?}
Q3 -->|Yes| OBS[Add LangSmith<br/>+ monitoring]
Q3 -->|No| EVAL[Run periodic<br/>evaluations]
CREATE --> Q2
DEV --> Q4{Ready for<br/>production?}
Q4 -->|Yes| OBS
Q4 -->|No| ITERATE[Iterate & improve]
ITERATE --> DEV
OBS --> MONITOR[Continuous<br/>monitoring]
EVAL --> MONITOR
MONITOR --> ALERT{Quality<br/>degradation?}
ALERT -->|Yes| DEBUG[Debug & fix]
ALERT -->|No| GOOD[All good!]
DEBUG --> MONITOR
style START fill:#667eea,color:#fff
style GOOD fill:#51cf66
style ALERT fill:#ffd43b
style DEBUG fill:#ff6b6b,color:#fff
✅ Best Practices
1. Create a Balanced Evaluation Suite
# Don't just test one metric
evaluation_suite = [
FaithfulnessMetric(threshold=0.7), # Anti-hallucination
AnswerRelevancyMetric(threshold=0.7), # Usefulness
ContextRelevancyMetric(threshold=0.6), # Retrieval quality
LatencyMetric(max_latency=500), # Performance
CostMetric(max_cost_per_query=0.05) # Economics
]
2. Test on Real User Queries
# Bad: Synthetic test data
test_data = [
"What is X?",
"How does Y work?",
"Explain Z"
]
# Good: Real production queries
test_data = load_production_queries(
date_range="last_30_days",
sample_size=1000,
include_edge_cases=True
)
3. Version Everything
# Track versions for reproducibility
evaluation_run = {
"timestamp": "2026-02-15T10:30:00Z",
"model": "gpt-4-0125-preview",
"prompt_version": "v2.3",
"retrieval_config": {
"chunk_size": 512,
"overlap": 50,
"top_k": 5
},
"metrics_version": "deepeval==0.20.1",
"results": results
}
# Save for traceability
save_evaluation_run(evaluation_run)
4. Automate Regression Testing
# Catch degradation before deployment
def test_no_regression():
"""Ensure new version doesn't degrade performance"""
# Baseline from production
baseline_scores = load_baseline_scores()
# Current version
current_scores = run_evaluation()
# Check each metric
for metric in baseline_scores:
current = current_scores[metric]
baseline = baseline_scores[metric]
# Allow 5% degradation, but flag it
threshold = baseline * 0.95
assert current >= threshold, \
f"{metric} regressed: {current:.3f} < {baseline:.3f}"
❌ Common Pitfalls
Pitfall 1: Over-relying on LLM-as-Judge
# Problem: Using GPT-4 to evaluate GPT-4
evaluator_model = "gpt-4"
test_model = "gpt-4"
# Solution: Use different models or add human validation
evaluator_model = "claude-3-opus" # Different family
test_model = "gpt-4"
# And sample for human review
if random.random() < 0.1: # 10% sample
flag_for_human_review(test_case)
Pitfall 2: Ignoring Cost in Evaluation
# Don't forget: Evaluation itself costs money!
# Bad: Evaluate every response in production
for response in all_responses:
evaluate(response) # Doubles your API costs!
# Good: Sample + async + batching
if random.random() < 0.05: # 5% sample
async_evaluate_batch(response) # Batch for efficiency
Pitfall 3: Static Thresholds
# Bad: One threshold for all scenarios
faithfulness_threshold = 0.7 # Always
# Good: Context-aware thresholds
def get_threshold(query_type):
if query_type == "financial_data":
return 0.95 # High stakes = high threshold
elif query_type == "general_info":
return 0.7 # Lower stakes = lower threshold
elif query_type == "creative_writing":
return 0.5 # Creativity valued over faithfulness
Conclusion: The Path Forward
We’ve covered a lot of ground. Here’s your action plan:
gantt
title Your LLM Evaluation Roadmap
dateFormat X
axisFormat %s
section This Week
Install DeepEval :0, 1
Write 3 test cases :1, 2
Run first evaluation :2, 1
section This Month
Setup CI/CD evaluation :3, 3
Add prod monitoring :6, 4
Create alerts :10, 2
section This Quarter
Build custom metrics :12, 6
Establish baselines :18, 4
Create feedback loop :22, 6
📅 This Week
- Install DeepEval
pip install deepeval - Write 3 test cases
for your most critical use cases - Run evaluation
note metric differences
📊 This Month
- Set up CI/CD
catch regressions early - Implement monitoring
with LangSmith or Braintrust - Create alerts
for quality & cost issues
🎯 This Quarter
- Build custom metrics
domain-specific needs - Establish baselines
for all critical metrics - Feedback loop
production → evaluation
Key Takeaways
- BLEU/ROUGE are dead for LLM evaluation - Use faithfulness and informativeness
- Evaluation ≠ Observability - You need both
- DeepEval for testing, LangSmith for monitoring - Most teams use multiple tools
- Automate everything - Manual evaluation doesn’t scale
- Your metrics should match your use case - Financial systems need higher faithfulness than creative writing
What’s Next?
In upcoming posts, I’ll cover:
- Part 2: Hallucination Detection Deep Dive (SelfCheckGPT, HHEM, PsiloQA)
- Part 3: RAG Evaluation - The Complete 7-Dimension Framework
- Part 4: Custom Metrics for Domain-Specific Evaluation
- Part 5: Cost Optimization While Maintaining Quality
Want to stay updated? Subscribe to my newsletter or follow me on Twitter.
Questions? Drop them in the comments below or reach out on LinkedIn.
Resources
Code Repository
All code examples from this post: github.com/akshayuppal/llm-evaluation-guide
Tools
Further Reading
- LLM Evaluation Metrics: 15 You Need to Know
- DeepEval vs Ragas Comparison
- Best LLM Monitoring Tools 2026
- Hallucination Detection Research
This post is part of my “ML in Production” series where I share practical guides for building production-ready ML systems. All ad revenue from this blog supports educational opportunities for underprivileged communities. 💙