Skip to content

LLM Evaluation with Evidently.ai

Overview

Evidently.ai provides tools for evaluating LLM outputs using descriptors (row-level metrics) and reports. It supports automated prompt optimization and LLM-as-a-Judge patterns for quality assessment.

Quick Reference

Component Purpose
Dataset Wrapper for evaluation data
Descriptor Row-level score or label
Report Aggregate metrics
TextEvals Text quality metrics
LLMJudge LLM-based evaluation
PromptOptimizer Automated prompt tuning

Basic Setup

import pandas as pd
from evidently import Dataset, DataDefinition
from evidently.descriptors import TextLength, Sentiment, WordCount

# Sample data
data = [
    {"question": "What is Python?", "answer": "Python is a programming language."},
    {"question": "Explain AI.", "answer": "AI is artificial intelligence."},
]

df = pd.DataFrame(data)

# Define data structure
definition = DataDefinition(text_columns=["question", "answer"])

# Create Evidently Dataset
eval_dataset = Dataset.from_pandas(df, data_definition=definition)

Text Descriptors

Basic Metrics

from evidently.descriptors import TextLength, WordCount, Sentiment

# Add descriptors
eval_dataset.add_descriptors(descriptors=[
    TextLength(column="answer"),
    WordCount(column="answer"),
    Sentiment(column="answer")
])

# View results
eval_dataset.as_dataframe()

Available Descriptors

Descriptor Description
TextLength Character count
WordCount Word count
Sentiment Sentiment score (-1 to 1)
RegexMatch Regex pattern matching
Contains Substring presence
IsValidJSON JSON validity check
IsValidPython Python syntax check

LLM-as-a-Judge

Binary Classification

import os
from evidently.descriptors import LLMJudge
from evidently.llm import OpenAIProvider

OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"

# Configure Ollama as provider
provider = OpenAIProvider(
    base_url=f"{OLLAMA_HOST}/v1",
    api_key="ollama",
    model=MODEL
)

# Create judge
judge = LLMJudge(
    provider=provider,
    template="Is this answer helpful? Answer YES or NO.\n\nQuestion: {question}\nAnswer: {answer}",
    include_reasoning=True
)

eval_dataset.add_descriptors(descriptors=[judge])

Multi-Class Classification

from evidently.descriptors import LLMJudge

judge = LLMJudge(
    provider=provider,
    template="""Classify this query into one category: BOOKING, CANCELLATION, GENERAL.

Query: {query}

Category:""",
    options=["BOOKING", "CANCELLATION", "GENERAL"],
    include_reasoning=True
)

Quality Scoring

from evidently.descriptors import LLMJudge

quality_judge = LLMJudge(
    provider=provider,
    template="""Rate this code review on a scale of 1-5.

Code Review: {review}

Score (1-5):""",
    score_range=(1, 5)
)

Prompt Optimization

Setup Optimizer

from evidently.llm import PromptOptimizer, OpenAIProvider

OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"

provider = OpenAIProvider(
    base_url=f"{OLLAMA_HOST}/v1",
    api_key="ollama",
    model=MODEL
)

optimizer = PromptOptimizer(
    provider=provider,
    max_iterations=10
)

Binary Classification Optimization

# Initial prompt template
initial_prompt = """Classify if this code review is good or bad.

Review: {review}

Answer (GOOD or BAD):"""

# Define judge for evaluation
judge = LLMJudge(
    provider=provider,
    template=initial_prompt,
    options=["GOOD", "BAD"]
)

# Run optimization
best_prompt = optimizer.optimize(
    dataset=eval_dataset,
    initial_template=initial_prompt,
    target_column="label",  # Ground truth column
    judge=judge
)

print("Best prompt found:")
print(best_prompt)

Multi-Class Optimization

initial_prompt = """Classify this query.

Query: {query}

Category (BOOKING/CANCELLATION/GENERAL):"""

judge = LLMJudge(
    provider=provider,
    template=initial_prompt,
    options=["BOOKING", "CANCELLATION", "GENERAL"]
)

best_prompt = optimizer.optimize(
    dataset=dataset,
    initial_template=initial_prompt,
    target_column="category",
    judge=judge
)

Reports

Generate Report

from evidently import Report
from evidently.metrics import TextDescriptorsDriftMetric

report = Report(metrics=[
    TextDescriptorsDriftMetric(column="answer")
])

report.run(reference_data=reference_dataset, current_data=current_dataset)
report.show()

Save Report

report.save_html("evaluation_report.html")
report.save_json("evaluation_report.json")

Common Patterns

Evaluate RAG Quality

from evidently.descriptors import LLMJudge, TextLength, Contains

# Relevance judge
relevance_judge = LLMJudge(
    provider=provider,
    template="""Is this answer relevant to the question?

Question: {question}
Answer: {answer}

Answer YES or NO:"""
)

# Factuality judge
factuality_judge = LLMJudge(
    provider=provider,
    template="""Is this answer factually accurate based on the context?

Context: {context}
Answer: {answer}

Answer YES or NO:"""
)

eval_dataset.add_descriptors([
    relevance_judge,
    factuality_judge,
    TextLength(column="answer")
])

Compare Models

# Evaluate model A
model_a_dataset = run_inference(model_a, test_data)
model_a_dataset.add_descriptors([quality_judge])

# Evaluate model B
model_b_dataset = run_inference(model_b, test_data)
model_b_dataset.add_descriptors([quality_judge])

# Compare
print("Model A average score:", model_a_dataset.as_dataframe()["quality"].mean())
print("Model B average score:", model_b_dataset.as_dataframe()["quality"].mean())

Troubleshooting

Slow Evaluation

Symptom: Evaluation takes too long

Fix:

  • Reduce dataset size for initial testing
  • Use smaller/faster judge model
  • Batch requests where possible

Inconsistent Judgments

Symptom: LLM judge gives inconsistent scores

Fix:

  • Lower temperature (0.0-0.3)
  • Make prompt more specific
  • Add examples to prompt
  • Use structured output options

Optimization Not Improving

Symptom: Prompt optimization stuck

Fix:

  • Increase max_iterations
  • Try different initial prompts
  • Check ground truth labels are correct
  • Use more training examples

When to Use This Skill

Use when:

  • Measuring LLM output quality
  • Comparing different prompts
  • Automating prompt engineering
  • Building evaluation pipelines
  • Monitoring LLM performance over time

Evaluating Thinking Models

For thinking models (Qwen3-Thinking), evaluate both thinking quality and response quality:

thinking_quality_judge = LLMJudge(
    provider=provider,
    template="""Evaluate the quality of reasoning in this response.

Question: {question}
Response: {response}

Score the THINKING quality (1-5):
1 = No reasoning shown
2 = Minimal reasoning
3 = Some step-by-step thinking
4 = Good reasoning with self-questioning
5 = Excellent thorough reasoning

Score:""",
    score_range=(1, 5)
)

Cross-References

  • bazzite-ai-jupyter:langchain - LangChain for LLM calls
  • bazzite-ai-jupyter:rag - RAG evaluation patterns
  • bazzite-ai-jupyter:sft - Training thinking models
  • bazzite-ai-jupyter:inference - Thinking model parsing
  • bazzite-ai-ollama:openai - Ollama OpenAI compatibility