AI Benchmarks Explained: MMLU, GPQA

Every AI model launch ships with a wall of percentages: MMLU, GPQA Diamond, SWE-bench Verified, Arena Elo. Each one measures a different thing, and only some of them still tell you anything useful. The short version, before the details: the most-quoted benchmark on every launch slide, MMLU, can no longer tell the best model from the second best. The numbers that still separate frontier systems are the harder, newer ones built specifically because the old ones got too easy. This guide decodes each score so you can read a launch headline the way a skeptic would.

Here is the map. MMLU tests broad factual knowledge across 57 academic subjects. GPQA Diamond tests graduate-level science reasoning. SWE-bench Verified tests whether a model can fix real software bugs on its own. Arena Elo tests which answer humans prefer in a blind side-by-side vote. Knowing which of these still differentiates the top of the field, and which has flatlined, is the difference between being impressed by a number and understanding it.

The 30-second decoder table

Use this as the cheat sheet; the sections below explain the why.

Benchmark	What it measures	Still differentiates top models?
MMLU	Broad factual knowledge, 57 subjects, multiple choice	No. Top models cluster near the ceiling
GPQA Diamond	Graduate science reasoning, expert-written, Google-proof	Partly. Scores spread wide, but climbing fast
SWE-bench Verified	Fixing real GitHub bugs autonomously with passing tests	Yes, but read with care. Frontier now in the low-to-mid 90s, and contamination is a real concern
Arena Elo	Human preference in blind head-to-head votes	Yes, but it measures preference, not correctness

MMLU: broad knowledge, now saturated

MMLU stands for Massive Multitask Language Understanding. It is a set of roughly 15,900 multiple-choice questions spanning 57 subjects, from abstract algebra and college medicine to international law and moral scenarios. Each question gives four answer choices, so a model that guesses blindly scores about 25 percent. It became the default knowledge benchmark because its breadth makes it hard to game with narrow specialization.

Here is the catch that launch slides skip. The strongest models now score in the high 80s and low 90s on MMLU, which means the benchmark is saturated. When several models all land within a point or two of each other near the top, the score stops separating them. A 91 versus a 92 on MMLU is noise, not a meaningful capability gap. So when you see a model boasting a record MMLU, the honest translation is that it joined a crowd, not that it pulled ahead of one.

Insight

Rule of thumb: if a benchmark's top scores all sit within a few points of each other near 90 percent, that benchmark has stopped doing its job. It cannot tell the best model from the second best.

Why MMLU-Pro exists

MMLU-Pro is the direct answer to that saturation. It rebuilds the test with harder, more reasoning-heavy questions and expands the answer set from four choices to ten. That single change drops the blind-guessing baseline from 25 percent to 10 percent and adds three times the distractors, so a model has to actually reason rather than eliminate obviously wrong options. The result is a wider spread between models: gaps that looked like two points on MMLU stretch to nine points or more on MMLU-Pro, which is exactly what a useful benchmark should do.

Also on MemX

AI Explained

Prompt Caching Explained: 3 Providers

12 min read→

AI Explained

Temperature, Top-P, Top-K Explained

11 min read→

AI Explained

Does Gemini Remember? Personal Context Explained

7 min read→

GPQA Diamond: graduate science that resists Google

GPQA stands for Graduate-Level Google-Proof Q&A. It is a set of 448 multiple-choice questions in biology, physics, and chemistry, written by domain experts. Diamond is the hardest curated subset: 198 questions selected for maximum difficulty and quality. The defining feature is in the name. These questions are designed so that a smart non-expert cannot look up the answer.

The human baselines make the difficulty concrete. Validators with PhDs in the matching field reach around 65 percent accuracy, or about 74 percent once you discount mistakes they later flagged themselves. Skilled non-experts do far worse: given over 30 minutes and unrestricted web access, they reach only about 34 percent. That gap is the whole point. Web search does not rescue you, so a high model score signals genuine reasoning rather than retrieval.

This is also what makes GPQA a useful sanity check on the saturation question. When you wonder whether a score is impressive, anchor it to the human numbers rather than to 100 percent. A model at 65 percent on GPQA Diamond is matching the experts who wrote the questions; a model in the 90s has cleared a bar that domain PhDs miss a third of the time. That framing, score against human baseline rather than against a perfect 100, is the single most reliable way to judge whether any percentage is good.

GPQA Diamond is more informative than MMLU right now because model scores still range widely across it, even as the leaders push into the 90s. But the same saturation story is starting here too. When a frontier model clears 90 percent on a test where human experts sit at 65, the benchmark is approaching the end of its useful life, and the field will need the next harder thing.

SWE-bench Verified: can it actually fix the bug?

SWE-bench Verified measures something more concrete than knowledge: whether a model can fix a real software bug on its own. The model is handed an actual GitHub issue from a popular open-source Python project and must produce a code patch that resolves the issue without breaking the existing test suite. Running the tests decides pass or fail, not a judge's opinion. The Verified subset contains 500 tasks, hand-screened by 93 professional developers to remove broken or ambiguous problems from the original SWE-bench.

This is currently the most decision-relevant benchmark for anyone choosing a coding model, because it tests an end-to-end agentic task instead of a single answer. As of mid-2026, frontier models sit in the low-to-mid 90s on it, clearing most of the issues in the set. But the top number deserves real caution. Researchers have flagged that a meaningful share of the underlying test cases have flaws, and that some of these GitHub issues likely leaked into training data, which inflates scores. The same models that look near-perfect here drop sharply on contamination-resistant variants, so read SWE-bench Verified as evidence of strong coding ability, not as proof that four-fifths or nine-tenths of your real bugs would be fixed.

There is a second reason to read the headline number carefully: even the benchmark's own publisher has stepped back from it. OpenAI wrote up why it stopped relying on SWE-bench Verified. When a score climbs into the 90s, the interesting question shifts from who scored highest to whether the test still measures real-world bug fixing at all. The honest read is that SWE-bench Verified is good for ranking the field today but should be paired with harder, contamination-resistant successors before you trust any single percentage.

Pro Tip

If you only check one benchmark before picking a coding assistant, check SWE-bench Verified, not MMLU. Fixing a bug is closer to your real workload than answering trivia. Just weight the gaps between models more than the absolute number.

Arena Elo: what humans actually prefer

Arena Elo, from Chatbot Arena (now LMArena), is the odd one out: it has no fixed question set and no answer key. Users type a prompt, get two anonymous model responses, and pick the better one. Identities are revealed only after the vote, which strips out brand bias. Those millions of blind votes feed an Elo-style rating, computed with a Bradley-Terry model, that estimates how often one model beats another head to head.

Read the gaps, not the raw numbers. A 50-point Elo gap means the higher-rated model wins about 57 percent of matchups; a 100-point gap means roughly 64 percent. So a 20 or 30 point lead that looks decisive on a leaderboard is closer to a coin flip than the ranking suggests, which is why a single-point Elo crown rarely means much in practice.

The crucial caveat is what Elo cannot see. It captures preference, not correctness. A model can win votes by being friendlier, more confident, or better formatted while being factually wrong, because a voter skimming two replies rewards the one that reads well, not the one that is right. That style-versus-substance blind spot is exactly why Arena Elo has to be cross-referenced rather than replaced: it is the only one of these four that measures how a model lands with a real human, but it needs a correctness-based benchmark beside it to confirm the preferred answer is also the true one. Treat a high Elo as proof that people enjoy the model, and a high GPQA or SWE-bench as proof that it is actually right.

The honest caveat: no single number wins

Each benchmark has a blind spot, so cross-reference them. MMLU and its peers can be partly contaminated when test questions leak into training data. SWE-bench rewards a specific agentic coding setup and a specific language, and carries the contamination problem above. Arena Elo rewards style. None of them captures latency, cost, tool use, long-context reliability, or how a model behaves on your specific data. The pattern across the field is a treadmill: a benchmark gets saturated, a harder one replaces it, and the cycle repeats.

Humanity's Last Exam is the current frontier of that treadmill. It is 2,500 expert-vetted, Google-proof questions across more than a hundred subjects, built explicitly because models cleared 90 percent on MMLU. Top models still score under 55 percent on it as of mid-2026, and they tend to be overconfident when wrong, which is a reminder that a benchmark crossing into the 90s is the signal to start watching the next one.

How to read a launch headline in 30 seconds

Ignore MMLU as a differentiator. If two models both score in the high 80s or 90s, that line tells you nothing about which is better.
Weight GPQA Diamond for science and reasoning, but check whether the leaders have bunched up near the top yet, and judge the score against the 65 percent human-expert baseline rather than against 100.
Trust SWE-bench Verified for relative coding ability, but discount the absolute score for test-set leakage and read the gaps between models.
Treat Arena Elo as a preference signal, not a correctness signal. Read the point gaps, not the raw rating.
Be suspicious of any benchmark missing from a launch. Vendors show the charts where they win.
Match the benchmark to your task. A research assistant, a coding agent, and a chat product each care about different rows.

Benchmarks measure models. They don't measure your memory.

One thing no public benchmark scores is how well a model knows you: your documents, your past conversations, the photo you saved last month, the note you wrote last year. That gap is what an external memory layer fills. MemX is a consumer AI memory app that sits over your own files, photos, notes, and chats across Android, iOS, and WhatsApp, so the assistant you use can recall your context instead of starting cold every session.

MemX is private by architecture: each user's data is isolated under per-user keys, encrypted at rest, with an on-device first pass before anything leaves your phone. Benchmarks like MMLU and GPQA tell you how smart a model is in the abstract. A memory layer determines how useful that intelligence is on your actual life, which no leaderboard captures.

Frequently Asked Questions

01What does MMLU measure in AI?

MMLU, or Massive Multitask Language Understanding, measures broad factual knowledge across 57 academic subjects using multiple-choice questions with four answer choices. It was the default knowledge benchmark for years, but top models now cluster near the ceiling, so it no longer separates the best systems.

02What is the difference between MMLU and GPQA?

MMLU tests broad knowledge across 57 subjects and is largely saturated. GPQA Diamond tests graduate-level reasoning in biology, physics, and chemistry with Google-proof questions that experts only answer about 65 percent of the time. GPQA is harder and still spreads model scores more meaningfully.

03What does SWE-bench Verified actually test?

SWE-bench Verified tests whether a model can autonomously fix a real GitHub bug. The model gets an actual issue from an open-source Python project and must write a patch that passes the existing tests. It has 500 human-screened tasks, and frontier models now sit in the low-to-mid 90s in mid-2026, though that top number is partly inflated by test-set contamination.

04Is a higher Arena Elo score better?

A higher Arena Elo means humans preferred that model's answers in blind head-to-head votes, not that it is more accurate. A 100-point gap means the higher model wins about 64 percent of matchups. It measures preference and style, so pair it with correctness-based benchmarks.

05Why do AI benchmarks keep getting replaced?

Benchmarks get replaced because models saturate them. Once top systems all score near the ceiling, the test stops differentiating them. MMLU led to MMLU-Pro and harder tests like GPQA Diamond and Humanity's Last Exam, where frontier models still score under 55 percent, leaving room to measure progress.

AI Benchmarks Explained: MMLU, GPQA

The 30-second decoder table

MMLU: broad knowledge, now saturated

Why MMLU-Pro exists

GPQA Diamond: graduate science that resists Google

SWE-bench Verified: can it actually fix the bug?

Arena Elo: what humans actually prefer

The honest caveat: no single number wins

How to read a launch headline in 30 seconds

Benchmarks measure models. They don't measure your memory.

Stop losing what you save.
Let MemX remember it for you.

Keep reading

The 30-second decoder table

MMLU: broad knowledge, now saturated

Why MMLU-Pro exists

GPQA Diamond: graduate science that resists Google

SWE-bench Verified: can it actually fix the bug?

Arena Elo: what humans actually prefer

The honest caveat: no single number wins

How to read a launch headline in 30 seconds

Benchmarks measure models. They don't measure your memory.

Stop losing what you save.Let MemX remember it for you.

Keep reading

Stop losing what you save.
Let MemX remember it for you.