Advent of ML Day 6: Measuring Success
12/6/2024
How do you know if your AI system is working? What metrics should you track? When should you use LLM-as-a-judge? Over the past few days, we've explored various components of AI systems - from tokenizers to embeddings to hybrid search. Today, we'll tackle the crucial but often overlooked topic of evaluation.
I'm thinking about these questions a lot today. I've been building a new retrieval strategy at StackOne this week. It's been a lot of testing chunking, finding the limits of usable context windows and building datasets to try and test everything.
The changes feel good - response times are slower but within threshold, and the results look better at first glance. But "feeling good" and "looking better" aren't metrics you can track or improve upon systematically. This is the challenge at the heart of building AI systems: how do you measure success in a meaningful way?
AI Testing
If you're coming from a software engineering background, the way people work in AI and ML might feel quite alien. It often seems like everyone is playing in notebooks rather than working on production systems. That's because the logic flow in these systems is normally not that complex - take some data from somewhere, store some indexes in a DB, and at runtime query and send it to an LLM.
However, it's the knowledge and insight to build the right system that is hard to come by. As everyone has been telling me for the past two years, AI demos very well but production is hard. How do you handle adversarial questions? How do you handle hallucinations? How do you handle data drift? What do you do when your user uploads an 8,500-page PDF file of handwritten notes?
Eval Strategy
1. Start with Simple Assertions
The simplest place to start is with basic string matching and assertions. Find test cases where you know exactly what the output should look like. While testing, literally add these assertions to your running code. This helps you:
- Get your prompts in the right space
- Ensure your system fails gracefully
- Build confidence in some basic functionality
For example, if you're building a system to extract dates from text, you might assert that "December 25th, 2024" is correctly identified as a date.
2. Measure Retrieval Quality
Retrieval (see Day 3, Day 4 and Day 5) is normally where everything starts falling apart and can be the lowest hanging fruit for improvement. I've definitely been guilty of spending time making end to end tests when it was actually the retrieval that was the problem.
- Precision@K: How many of the top K retrieved documents are relevant?
- Recall@K: What fraction of relevant documents are in the top K?
- Mean Reciprocal Rank (MRR): How highly ranked is the first relevant document?
If you're using rerankers (from Day 5), track the average reranker scores over time. A declining trend might indicate:
- Data drift in your source documents
- Changes in user query patterns
- Issues with your embedding model
3. LLM-as-a-Judge
LLM-as-a-judge isn't a silver bullet. Here are some guidelines:
- Avoid using models from the same family to evaluate each other (e.g., GPT-4o evaluating GPT-4o outputs). They tend to be overly nice to their family.
- Use pairwise comparisons instead of absolute scoring.
- Include clear evaluation criteria in your judge prompts
- Validate judge decisions against human evaluations
Models have no idea what "good" is, you will have more success with solid questions like: does this answer contain the 3 points from this reference answer? Be sure to validate your judges against human evaluations and include examples of good and bad in your prompts.
Building Your Dataset
You need three things to start evaluating:
- Representative questions/queries
- Gold standard answers or relevance judgments
- A process for collecting more data
The questions you should be able to get from users, your boss, or a previous product. The answers can be harder to come by.
Subject Matter Expert (SME) Approach
- Find the person who knows your domain best, or buckle up cause this is going to be you.
- Send them one question per day via email or Slack and try get an answer back.
- Often the hardest part of this process is you have to become the SME yourself.
Synthetic Data Generation
- Use larger models to generate test cases based on some examples and domain knowledge
- Validate a subset with humans
- Use for stress testing and edge cases
Creating a Data Flywheel
The real power comes from creating a continuous improvement cycle:
- Collect user interactions and feedback
- Label and validate examples
- Use validated examples to:
- Improve retrieval
- Train better judges
- Generate synthetic data
- Feed improvements back into production
- Repeat
This flywheel effect not only improves your current system but also opens up possibilities for fine-tuning models and more sophisticated strategies in future.
Practical Tips
-
Start Small: Begin with a core set of test cases that represent your most important use cases.
-
Log Everything: You can't improve what you don't measure. Log:
- User queries
- Retrieved documents
- Generated responses
- User feedback (explicit and implicit)
-
Build for Iteration: Your first evaluation system won't be perfect. Design it so you can easily:
- Add new test cases
- Modify evaluation criteria
- Update gold standard answers
-
Make Evaluation Easy: Remove friction from the evaluation process:
- Build simple tools for annotators
- Automate what you can
- Make results easily accessible
Remember, the goal isn't to achieve perfect scores on your metrics - it's to build a system that reliably helps your users. Sometimes a simple system with clear limitations is better than a complex one that fails in unpredictable ways.
Happy day 6!
Matt
Resources:
Sign up for (very occasional) updates on new posts.