Building a No-Vibe LLM Evaluation System: A Practical How-To Guide
Introduction
If you've ever relied on an LLM evaluation system that feels more like a vibe check—scoring outputs with vague metrics and subjective human judgment—you know the frustration. Hallucinations slip through, and decisions aren't reproducible. I built a lightweight evaluation layer in pure Python that replaces that guesswork with a structured approach. By separating attribution, specificity, and relevance, it catches false claims before they reach production. This guide walks you through building your own version, step by step.

What You Need
- Basic knowledge of Python (functions, lists, dictionaries)
- Python 3.8+ installed on your machine
- A few sample LLM outputs (text strings) and their expected source documents or knowledge base
- Optional: a simple vector database or list of facts for attribution checks
- No external libraries required—pure Python only
Step-by-Step Instructions
Step 1: Define Your Evaluation Criteria
Before writing code, clarify what each metric means in your context:
- Attribution: Does the output cite or rely on provided sources? (Yes/No or a score)
- Specificity: Does the output contain concrete details (numbers, names, dates) rather than vague generalities?
- Relevance: Does the output directly answer the query or stay on topic?
Write these definitions down as clear rules. For example: “An output is attributed if at least 70% of its claims can be traced to a known source.”
Step 2: Build a Function to Parse LLM Output
Create a Python function that breaks the LLM text into individual claims (sentences or clauses).
def parse_claims(text):
import re
# Simple sentence splitting
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
return [s for s in sentences if len(s) > 10]
This gives you a list of claim strings to evaluate independently.
Step 3: Implement the Attribution Check
Attribution ensures every claim is backed by a source. You'll need a reference set of facts (e.g., a dictionary of {fact: source_id}). The function checks if each claim matches any known fact (using exact or fuzzy matching).
def check_attribution(claim, knowledge_base):
for fact, source in knowledge_base.items():
if fact.lower() in claim.lower():
return True, source
return False, None
Return a boolean and the source ID. For a more robust system, use TF-IDF or an embedding model, but pure Python works for a prototype.
Step 4: Implement the Specificity Check
Specificity measures detail. Count occurrences of digits, proper nouns (capitalized words that aren't at sentence start), and named entities. Create a scoring function:
def check_specificity(claim):
import re
digits = len(re.findall(r'\d+', claim))
proper_nouns = len(re.findall(r'\b[A-Z][a-z]+\b', claim))
return (digits + proper_nouns) > 2 # arbitrary threshold
Adjust the threshold based on your domain.

Step 5: Implement the Relevance Check
Relevance compares output to the user's query. Use simple keyword overlap or character n-grams:
def check_relevance(output, query):
query_words = set(query.lower().split())
output_words = set(output.lower().split())
overlap = len(query_words & output_words) / len(query_words)
return overlap > 0.3
Again, tune the threshold.
Step 6: Combine into a Decision Layer
Create a single function that takes output, query, and knowledge base, then returns a pass/fail decision and a report.
def evaluate_output(output, query, knowledge_base):
claims = parse_claims(output)
results = []
for claim in claims:
attr = check_attribution(claim, knowledge_base)
spec = check_specificity(claim)
rel = check_relevance(claim, query)
results.append({'claim': claim, 'attribution': attr[0], 'specificity': spec, 'relevance': rel})
# Decision: pass if all claims meet all criteria
passed = all(r['attribution'] and r['specificity'] and r['relevance'] for r in results)
return {'passed': passed, 'details': results}
This is the core layer that replaces “vibes” with reproducible decisions.
Step 7: Test and Iterate
Run your function on a set of known-good and known-bad examples. Adjust thresholds and criteria until false positives/negatives are minimized. Log every failure to improve your knowledge base and rules. Over time, you can add more sophisticated checks (e.g., contradiction detection) while keeping the same three-pillar architecture.
Tips for Success
- Start small: Don't try to cover every edge case. Build for one use case first, then expand.
- Keep it modular: Each check should be independently testable. You can replace simple functions with ML models later.
- Use threshold tuning: Run a grid search over your test set to find the best cutoff values for specificity and relevance.
- Document your criteria: Write down why you chose each threshold—this makes the system reproducible and debuggable.
- Combine with human review: For high-stakes applications, use this layer as a first filter, then pass borderline cases to a human.
Related Articles
- Why Emma Grede Calls Remote Work 'Career Suicide': Key Q&A
- Record $850 Billion in Retail Returns Cripples Profits – Industry Experts Reveal Urgent Solutions
- Ted Turner: The Visionary Behind 24-Hour News Dies at 87
- Retail Returns Hit $850 Billion: Answers to Your Biggest Questions on Protecting Profits
- Runpod's Community-First Funding Model Disrupts Traditional VC Path - CEO Zhen Lu Reveals Inside Story
- The Backbone of Business Success: How Workplace Infrastructure Drives Performance
- Revolutionary Terminal File Manager Yazi Gains Traction Among Linux Users
- Why I Switched from Android Music to an iPod: A Nostalgic Audiophile's Take