LLM Evaluation with Evidently.ai¶
Overview¶
Evidently.ai provides tools for evaluating LLM outputs using descriptors (row-level metrics) and reports. It supports automated prompt optimization and LLM-as-a-Judge patterns for quality assessment.
Quick Reference¶
| Component | Purpose |
|---|---|
Dataset | Wrapper for evaluation data |
Descriptor | Row-level score or label |
Report | Aggregate metrics |
TextEvals | Text quality metrics |
LLMJudge | LLM-based evaluation |
PromptOptimizer | Automated prompt tuning |
Basic Setup¶
import pandas as pd
from evidently import Dataset, DataDefinition
from evidently.descriptors import TextLength, Sentiment, WordCount
# Sample data
data = [
{"question": "What is Python?", "answer": "Python is a programming language."},
{"question": "Explain AI.", "answer": "AI is artificial intelligence."},
]
df = pd.DataFrame(data)
# Define data structure
definition = DataDefinition(text_columns=["question", "answer"])
# Create Evidently Dataset
eval_dataset = Dataset.from_pandas(df, data_definition=definition)
Text Descriptors¶
Basic Metrics¶
from evidently.descriptors import TextLength, WordCount, Sentiment
# Add descriptors
eval_dataset.add_descriptors(descriptors=[
TextLength(column="answer"),
WordCount(column="answer"),
Sentiment(column="answer")
])
# View results
eval_dataset.as_dataframe()
Available Descriptors¶
| Descriptor | Description |
|---|---|
TextLength | Character count |
WordCount | Word count |
Sentiment | Sentiment score (-1 to 1) |
RegexMatch | Regex pattern matching |
Contains | Substring presence |
IsValidJSON | JSON validity check |
IsValidPython | Python syntax check |
LLM-as-a-Judge¶
Binary Classification¶
import os
from evidently.descriptors import LLMJudge
from evidently.llm import OpenAIProvider
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
# Configure Ollama as provider
provider = OpenAIProvider(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama",
model=MODEL
)
# Create judge
judge = LLMJudge(
provider=provider,
template="Is this answer helpful? Answer YES or NO.\n\nQuestion: {question}\nAnswer: {answer}",
include_reasoning=True
)
eval_dataset.add_descriptors(descriptors=[judge])
Multi-Class Classification¶
from evidently.descriptors import LLMJudge
judge = LLMJudge(
provider=provider,
template="""Classify this query into one category: BOOKING, CANCELLATION, GENERAL.
Query: {query}
Category:""",
options=["BOOKING", "CANCELLATION", "GENERAL"],
include_reasoning=True
)
Quality Scoring¶
from evidently.descriptors import LLMJudge
quality_judge = LLMJudge(
provider=provider,
template="""Rate this code review on a scale of 1-5.
Code Review: {review}
Score (1-5):""",
score_range=(1, 5)
)
Prompt Optimization¶
Setup Optimizer¶
from evidently.llm import PromptOptimizer, OpenAIProvider
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
provider = OpenAIProvider(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama",
model=MODEL
)
optimizer = PromptOptimizer(
provider=provider,
max_iterations=10
)
Binary Classification Optimization¶
# Initial prompt template
initial_prompt = """Classify if this code review is good or bad.
Review: {review}
Answer (GOOD or BAD):"""
# Define judge for evaluation
judge = LLMJudge(
provider=provider,
template=initial_prompt,
options=["GOOD", "BAD"]
)
# Run optimization
best_prompt = optimizer.optimize(
dataset=eval_dataset,
initial_template=initial_prompt,
target_column="label", # Ground truth column
judge=judge
)
print("Best prompt found:")
print(best_prompt)
Multi-Class Optimization¶
initial_prompt = """Classify this query.
Query: {query}
Category (BOOKING/CANCELLATION/GENERAL):"""
judge = LLMJudge(
provider=provider,
template=initial_prompt,
options=["BOOKING", "CANCELLATION", "GENERAL"]
)
best_prompt = optimizer.optimize(
dataset=dataset,
initial_template=initial_prompt,
target_column="category",
judge=judge
)
Reports¶
Generate Report¶
from evidently import Report
from evidently.metrics import TextDescriptorsDriftMetric
report = Report(metrics=[
TextDescriptorsDriftMetric(column="answer")
])
report.run(reference_data=reference_dataset, current_data=current_dataset)
report.show()
Save Report¶
Common Patterns¶
Evaluate RAG Quality¶
from evidently.descriptors import LLMJudge, TextLength, Contains
# Relevance judge
relevance_judge = LLMJudge(
provider=provider,
template="""Is this answer relevant to the question?
Question: {question}
Answer: {answer}
Answer YES or NO:"""
)
# Factuality judge
factuality_judge = LLMJudge(
provider=provider,
template="""Is this answer factually accurate based on the context?
Context: {context}
Answer: {answer}
Answer YES or NO:"""
)
eval_dataset.add_descriptors([
relevance_judge,
factuality_judge,
TextLength(column="answer")
])
Compare Models¶
# Evaluate model A
model_a_dataset = run_inference(model_a, test_data)
model_a_dataset.add_descriptors([quality_judge])
# Evaluate model B
model_b_dataset = run_inference(model_b, test_data)
model_b_dataset.add_descriptors([quality_judge])
# Compare
print("Model A average score:", model_a_dataset.as_dataframe()["quality"].mean())
print("Model B average score:", model_b_dataset.as_dataframe()["quality"].mean())
Troubleshooting¶
Slow Evaluation¶
Symptom: Evaluation takes too long
Fix:
- Reduce dataset size for initial testing
- Use smaller/faster judge model
- Batch requests where possible
Inconsistent Judgments¶
Symptom: LLM judge gives inconsistent scores
Fix:
- Lower temperature (0.0-0.3)
- Make prompt more specific
- Add examples to prompt
- Use structured output options
Optimization Not Improving¶
Symptom: Prompt optimization stuck
Fix:
- Increase
max_iterations - Try different initial prompts
- Check ground truth labels are correct
- Use more training examples
When to Use This Skill
Use when:
- Measuring LLM output quality
- Comparing different prompts
- Automating prompt engineering
- Building evaluation pipelines
- Monitoring LLM performance over time
Evaluating Thinking Models¶
For thinking models (Qwen3-Thinking), evaluate both thinking quality and response quality:
thinking_quality_judge = LLMJudge(
provider=provider,
template="""Evaluate the quality of reasoning in this response.
Question: {question}
Response: {response}
Score the THINKING quality (1-5):
1 = No reasoning shown
2 = Minimal reasoning
3 = Some step-by-step thinking
4 = Good reasoning with self-questioning
5 = Excellent thorough reasoning
Score:""",
score_range=(1, 5)
)
Cross-References¶
bazzite-ai-jupyter:langchain- LangChain for LLM callsbazzite-ai-jupyter:rag- RAG evaluation patternsbazzite-ai-jupyter:sft- Training thinking modelsbazzite-ai-jupyter:inference- Thinking model parsingbazzite-ai-ollama:openai- Ollama OpenAI compatibility