AI Skills

AI Evals: How to Test an LLM App

Arpit TripathiArpit TripathiLinkedIn·June 15, 2026·11 min read

Testing an LLM app by vibes breaks at output 200, not output 5. The 3 eval layers, and which to build first.

An AI eval is a repeatable test that scores your LLM app's output, a practice often called LLM evaluation, so you can tell whether a change made it better or worse. A working eval suite has three layers: automated metrics for cheap broad coverage, an LLM-as-a-judge for scoring fuzzy quality at scale, and human review for the cases the first two miss. You do not need all three before your next prompt change. You need the one minimum-viable layer that catches the failure you are most afraid of shipping.

Most teams skip evals entirely and test by vibes: change the prompt, eyeball five outputs, ship. It works, until output 200 silently breaks while output 5 still looks fine. Evals replace the eyeball with a number you can compare across versions. The rest of this guide explains the three layers, which one to build first, and how to keep them honest.

What an AI eval actually is

An eval has three parts: a dataset of inputs (and ideally expected outputs or reference answers), a scorer that turns each output into a number or label, and an aggregate you track over time. Run the same dataset through two versions of your app, compare the aggregate scores, and you have evidence instead of a hunch. OpenAI frames evals as structured tests for measuring model performance despite the nondeterministic nature of AI systems, which is the core problem: the same prompt can return different text on every call, so a single spot-check proves nothing.

The mistake is treating an eval as a single accuracy percentage. Real apps fail in specific ways: a RAG bot invents a citation, a support agent leaks an internal note, a summarizer drops the one number that mattered. Each failure mode wants its own scorer. The eval suite is the collection of those scorers, not one global grade.

The three layers, ranked by what they cost you

Layer 1: automated metrics (deterministic scorers)

Start here, because deterministic scorers are free, instant, and never disagree with themselves. These check the clear-cut stuff: does the output parse as valid JSON, does it match a schema, is it under the length limit, does it contain the required keyword, does the extracted answer equal the reference exactly. Braintrust draws the line cleanly: deterministic scorers handle format validation, length checks, schema compliance, and keyword presence, while judges handle the fuzzier criteria.

If your app has any structured contract (a tool call, a JSON response, a classification label), this layer alone catches a surprising share of regressions and runs on every commit at no cost. It is also the only layer with zero ambiguity: a malformed JSON is malformed, no judgment required, and the same output always earns the same score, so a failing eval points at the code, never at the scorer's mood. That makes Layer 1 the cheapest insurance you can buy against the regressions that are easiest to introduce and most boring to catch by hand.

Layer 2: LLM-as-a-judge (scoring quality at scale)

When the thing you want to measure is fuzzy (is this answer relevant, faithful to the source, helpful, correct, free of bias), a deterministic check cannot express it. This is where you hand the output to a separate LLM with a rubric and ask it to score. LLM-as-a-judge is the default 2026 method for evaluating LLM apps at scale, grading outputs against criteria like answer relevancy, faithfulness, helpfulness, bias, and correctness.

The reason this works is that judges agree with humans more than you would expect. A well-prompted GPT-4-class judge agrees with human reviewers about 85% of the time, which is higher than the roughly 81% agreement two human annotators reach with each other on the same task. The judge is not perfect, but it is at least as consistent as the humans it replaces, and it never gets tired on output 5,000.

Insight

Here is the number that ends the build-versus-buy argument: LLM judges reach roughly 80% agreement with human preferences (a separate preference-agreement measure) at about 500x to 5,000x lower cost than manual review. At 10,000 evaluations a month, that is the difference between an API bill of cents per case and a five-figure human annotation contract.

Human review of 100,000 outputs is roughly 52 days of full-time work for one reviewer (assuming about 15 seconds per output), an illustrative figure rather than a measured one; the judge does the same volume in an afternoon for the price of inference. You trade a little accuracy for two orders of magnitude in throughput, and that trade only makes sense because the accuracy you give up is small and the volume you gain is enormous.

Layer 3: human review (the cases the others miss)

You never fully retire human review; you concentrate it. Humans handle the cases automated scorers cannot, and, just as important, they produce the ground truth that tells you whether your judge is trustworthy. The 2026 best practice is to validate judge scores against human annotations and keep a human in the loop for edge cases, low-confidence outputs, and any criterion where a bad call is expensive.

The workflow most teams converge on: run deterministic scorers and the judge against production traffic continuously, flag traces where scorers show low confidence or disagree, route those to a human, then turn the human's findings into a new automated scorer or a refined rubric. Human time becomes the input that improves layers 1 and 2 rather than a wall you hit on every release.

DimensionAutomated metricsLLM-as-a-judge
Best forFormat, schema, exact match, length, keywordsRelevancy, faithfulness, helpfulness, bias, correctness
Cost per caseEffectively zeroCents (about 500x to 5,000x cheaper than humans)
Agreement with humansExact where defined, blind elsewhereAbout 85% for a strong judge, near human-human
Main riskCannot see quality, only structureDrifts or self-contradicts if unvalidated
Speed at 100k outputsSecondsAn afternoon vs weeks of full-time human work

Which layer to build first

Build the cheapest scorer that would have caught your scariest failure: Layer 1 if that failure is structural, a Layer 2 judge if it is qualitative. Pick the one failure that would embarrass you in front of a user or break a downstream system, and write the cheapest scorer that would have caught it. If that failure is structural (malformed output, missing field, wrong label), Layer 1 is your minimum-viable eval and you can stop there for now. If it is qualitative (hallucinated facts, off-topic answers, unsafe tone), you need a Layer 2 judge.

  • Failure is structural (bad JSON, schema break, wrong class label): build Layer 1 only. It runs on every commit at no cost.
  • Failure is qualitative (made-up facts, irrelevant or unsafe answers): build a Layer 2 judge with a narrow rubric for that one criterion.
  • Either way, hand-label 100 to 200 examples first. That set is your regression dataset and your judge's answer key.
  • Add the other layers only when a real failure escapes the one you built. What the eval-tool tutorials will not tell you: you do not need a metric encyclopedia, just the one scorer that catches the failure you fear most.
  • Wire whichever layer you build into CI so a prompt change cannot merge without a score.
Pro Tip

Before any prompt or model change, freeze a dataset of 100 to 200 real inputs with the outputs you consider correct. That frozen set is the difference between 'the new prompt feels better' and 'the new prompt scored 0.91 versus 0.84.' Without it, every layer above is guessing.

How to keep your judge honest

The single biggest mistake is treating judge scores as ground truth without ever checking them against humans. A judge is a model, so it inherits model failures: it can favor longer answers, prefer the first option in a pairwise comparison, or score inconsistently across runs. The fix is calibration, not faith. The most reliable calibration is direct: have humans label a held-out set and measure how often the judge agrees.

  • Have humans score 100 to 200 examples, then compare those scores to the judge and measure agreement or correlation.
  • If alignment is poor, refine the rubric or judge prompt and re-measure. Do not ship a judge you have not validated.
  • Use chain-of-thought: ask the judge to reason before it scores, a debiasing technique documented in the original LLM-as-a-judge research below.
  • In pairwise comparisons, swap the order of the two answers and average, to cancel the position bias that judges are known to show.
  • Re-validate after you change the judge model. A new model is a new judge with a new answer key.

The MT-Bench study that introduced LLM-as-a-judge measured both position bias and the chain-of-thought fix directly: judges tend to prefer the answer shown first, and prompting the judge to reason before scoring improves its consistency. Swapping answer order and averaging is the standard correction for the first problem, which is why it has become routine practice.

For the framework itself, OpenAI's open-source evals project is a reasonable canonical starting point: a Python package and benchmark registry for running evals locally or in CI. Note its lifecycle, though. OpenAI has announced that its hosted Evals product becomes read-only on October 31, 2026 and shuts down on November 30, 2026, so treat the open-source repo and the Evals API as the durable surfaces and avoid building hard dependencies on the hosted UI.

Where memory-grounded apps need evals most

Any app that answers from a store of documents, notes, or past messages lives or dies on faithfulness: does the answer actually follow from the retrieved context, or did the model fill a gap with invention. A faithfulness eval gives the judge the retrieved context plus the answer and asks whether every claim in the answer is supported by that context. That is precisely the criterion a Layer 2 judge scores well and a Layer 1 metric cannot see, which is why retrieval-grounded products lean hardest on the judge layer. A schema check confirms the answer is shaped correctly; only a judge can ask whether the answer is true to the source it claims to cite.

MemX is a consumer AI memory app, an external memory layer over your own documents, photos, and notes across Android, iOS, and WhatsApp. When a system answers from your personal memory, a wrong or invented answer is not a benchmark miss, it is a misrepresentation of your own life, so faithfulness evals carry real weight. MemX is private by architecture: per-user keys, encryption at rest, and an on-device first pass, which also shapes how evaluation runs against personal data rather than a shared public set.

Frequently Asked Questions
01What are AI evals?

AI evals are repeatable tests that score an LLM app's outputs against a fixed dataset, so you can measure whether a change improved or regressed quality. A full suite uses three layers: automated metrics, an LLM-as-a-judge, and human review.

02How do I test an LLM app?

Freeze 100 to 200 real inputs with the outputs you consider correct, then score new versions against that set. Use deterministic checks for structure, an LLM judge for fuzzy quality, and human review for edge cases. Wire it into CI.

03Is LLM-as-a-judge accurate enough to trust?

A well-prompted strong judge agrees with humans about 85% of the time, higher than the roughly 81% two humans agree with each other. It is reliable once you validate its scores against human labels, but not before.

04How much cheaper is an LLM judge than human review?

Roughly 500x to 5,000x cheaper while reaching about 80% agreement with human preferences. Scoring 100,000 outputs costs cents per case by judge versus weeks of full-time manual review.

05Do I still need human reviewers if I use an LLM judge?

Yes, but fewer and more focused. Humans handle cases automated scorers miss and produce the ground truth that calibrates your judge. The best practice keeps a human in the loop for edge cases and low-confidence outputs.

Stop testing by vibes. Pick the one failure you cannot afford to ship, build the cheapest layer that catches it, freeze a small labeled dataset, and put it in front of your next prompt change. Add the other layers when a real failure slips past the first one, and validate any judge against humans before you trust its number.

Read Next

Or try MemX to access 40+ AI models in one place — including Claude Sonnet 4.6 and GPT-5.4 — and get your questions answered today.

Was this article helpful?

Found this useful? Share it with someone who needs it.

Free · iOS, Android & WhatsApp

Stop losing what you save.
Let MemX remember it for you.

Every screenshot, photo, PDF and voice note — captured, encrypted, and instantly searchable. Ask in plain English, get the answer in seconds.

  • Reads text inside images and handwriting
  • Private and encrypted by default
  • Free to start, no credit card

Takes under a minute to set up. Your data stays yours.

Arpit Tripathi
Written by
Arpit TripathiLinkedIn

Founder of MemX. Ex-Google Staff Tech Lead Manager, ex-AWS Senior SDE (Elastic Block Store). Writes about practical AI on the MemX blog.

Keep reading

More guides for AI-powered students.