Getting started with automated evaluations

January 16, 2025

2
0

The state of evals today

We see AI teams leverage a few common approaches to evals:

Vibes-based: engineers and PMs remember some interesting test cases and eyeball the results
Benchmarks and/or black box: MMLU for general tasks, HellaSwag for common sense, TruthfulQA for truthfulness, HumanEval for code generation, many more
Stitched together manual review: a combination of examples saved in spreadsheets, a script to run through test cases, and humans (engineers/PMs/SMEs) manually checking examples

While these methods can be useful, they all have major limitations:

Relying on “gut feelings” or manual reviews doesn’t scale.
General benchmarks aren’t specific enough for the application and are hard to adjust.

This makes it hard for engineering teams to assess product performance, leading to slow development and frustrating issues like:

Making changes without knowing how they affect users.
Constantly chasing bugs or regressions.
Scoring responses manually, one at a time.
Tracking experiments and examples manually (or not tracking them at all).

Automated evaluations

Automated evaluations are easy to set up and can make an immediate impact on AI development speed. In this section, we will walk through 3 great approaches: LLM evaluators, heuristics, and comparative evals.

LLM evaluators

LLMs are incredibly useful for evaluating responses out-of-the-box, even with minimal prompting. Anything you can ask a human to evaluate, you can (at least partially) encode into an LLM evaluator. Here are some examples:

Comparing a generated output vs. an expected output – instead of having an engineer scroll through an Excel spreadsheet and manually compare generated responses vs expected responses, you can use a factuality prompt to compare the two. Many of our customers use this type of test to detect and prevent hallucinations

prompt: |-
  You are comparing a submitted answer to an expert answer on a given question. Here is the data:
  [BEGIN DATA]
  ************
  [Question]: {{input}}
  ************
  [Expert]: {{expected}}
  ************
  [Submission]: {{output}}
  ************
  [END DATA]

  Compare the factual content of the submitted answer with the expert answer. Ignore any differences in style, grammar, or punctuation.
  The submitted answer may either be a subset or superset of the expert answer, or it may conflict with it. Determine which case applies. Answer the question by selecting one of the following options:
  (A) The submitted answer is a subset of the expert answer and is fully consistent with it.
  (B) The submitted answer is a superset of the expert answer and is fully consistent with it.
  (C) The submitted answer contains all the same details as the expert answer.
  (D) There is a disagreement between the submitted answer and the expert answer.
  (E) The answers differ, but these differences don't matter from the perspective of factuality.
choice_scores:
  "A": 0.4
  "B": 0.6
  "C": 1
  "D": 0
  "E": 1

Checking whether an output fully addresses a question – if you provide a task and a response, LLMs do a great job of scoring whether the response is relevant and addresses all parts of the task

The above two methods are great places to start.

Heuristics

Heuristics are a valuable objective way to score responses. It is found that the best heuristics fall into one of two buckets:

Functional – ensuring the output fulfills a specific functional criteria
- Examples: testing if an output is valid markdown, if generated code is executable, if the model selected a valid option from a list, Levenshtein distance

Subjective – using objective heuristics as a proxy for subjective factors
- Examples: checking if an output exceeds a certain number of words (conciseness), checking if an output contains the word “sorry” (usefulness/tone)

Importantly – to make heuristic scoring as valuable as possible, it should be extremely easy for engineering teams to see updated scores after every change, quickly drill down into interesting examples, and add/tweak heuristics.

Comparative evals

Comparative evals compare an updated set of responses vs. a previous iteration. This is particularly helpful in understanding whether your application is improving as you make changes. Comparative evals also do not require expected responses, so they can be a great option for very subjective tasks. Here are a few examples:

Testing whether summarization is improving

prompt: |-
  You are comparing a submitted summary of a given text to an expert summary. Here is the data:
  [BEGIN DATA]
  ************
  [Text]: {{input}}
  ************
  A: {{expected}}
  ************
  B: {{output}}
  ************
  [END DATA]

  Compare summary A with summary B. Ignore any differences in style, grammar, or punctuation.
  Determine which summary better describes the original text.
choice_scores:
  "A": 0
  "B": 1

Comparing cost, token usage, duration (especially when switching between models)
Starting with a standard template like battle and tweaking the questions and scores over time to be use-case specific

prompt: |-
  You are comparing responses to the following instructions.

  [Instruction 1]
  {{instructions}}
  [Response 1]
  {{output}}

  [Instruction 2]
  {{instructions}}
  [Response 2]
  {{expected}}


  Is the first response better than the second? You must provide one answer based on your subjective view.
choice_scores:
  "Yes": 1.0
  "No": 0.0

Continuous iteration

While there is no replacement for human review, setting up basic structure around automated evals unlocks the ability for developers to start iterating quickly. The ideal AI dev loop enables teams to immediately understand performance, track experiments over time, identify and drill down into interesting examples, and codify what “good” looks like. This also makes human review time much higher leverage as you can point reviewers to useful examples and continuously utilize their scores.

Gendral