AI Explained

Prompt Caching Explained: 3 Providers

Aditya Kumar JhaAditya Kumar JhaLinkedIn·June 20, 2026·12 min read

How prompt caching cuts API input costs on Anthropic, OpenAI, and Gemini, and the one-byte mistake that drops your hit rate to zero.

Prompt caching cuts the input-token cost of a repeated prompt prefix by reusing computation the model already did, with discounts ranging from about 50 percent on OpenAI to roughly 90 percent on a cached read with Anthropic. The catch that almost every vendor page buries: the cached prefix has to be byte-identical, so one changing timestamp, a per-user ID, or a stray trailing newline placed inside that prefix drops your hit rate to zero and you pay full price.

That single rule is why teams enable caching, watch their bill barely move, and conclude the feature does not work. It works. The prompt was just not stable. This guide explains the mechanism, gives a neutral three-provider comparison of Anthropic, OpenAI, and Google Gemini, and shows the exact failure mode that wastes the discount.

Insight

It works. Your prompt just was not stable. One changing byte before the cached boundary and you pay full price.

What prompt caching actually does

Prompt caching stores the model's internal computation for a fixed prefix of your prompt so the next request with the same prefix skips recomputing it. When a transformer processes input tokens, it builds intermediate state, the key-value attention representation (commonly called the KV cache), for every token. If your next call starts with the identical text, the provider can reuse that stored state instead of recomputing it, and bills those reused tokens at a steep discount.

This is prefix caching, sometimes called context caching. It is not the same as semantic caching, which returns a stored response for a similar question. Prompt caching never changes the model output. It only changes how much you pay to feed a repeated prefix into the model.

Insight

The cache key is the prefix, not the whole prompt. Everything from the start of the request up to the cached point must match exactly. Anything after it can change freely without breaking the hit.

The byte-identical prefix rule

A cache hit requires the prefix to match exactly, character for character. Anthropic states cache hits need 100 percent identical prompt segments, including all text and images up to the cached block. OpenAI says cache hits are only possible for exact prefix matches. Gemini treats cached content as a prefix to the prompt. In all three, a single differing byte before the cached boundary means a miss.

What the pricing pages will not tell you: this is where the money quietly leaks. Developers naturally drop dynamic values near the top of a system prompt: the current date, a request ID, a logged-in user's name, an A/B flag, a session token. Each of those changes on every call. Place any of them before the cached boundary and the prefix changes every call, so the cache never hits, and you pay the full input rate while believing caching is on.

Common things that silently break the cache

  • A timestamp or current-date string injected into the system prompt
  • A per-user ID, name, or account tier placed before the shared instructions
  • A trailing newline or whitespace difference introduced by a template change
  • Reordered tool definitions or a changed tool list (Anthropic invalidates the whole cache when tools change)
  • An A/B test flag or feature toggle interpolated near the top of the prompt
  • Non-deterministic JSON serialization where key order shifts between requests
Pro Tip

Structure every prompt as stable-then-dynamic. Put fixed instructions, system rules, retrieved documents, and few-shot examples first. Push everything that changes (the user message, IDs, timestamps) to the very end, after the cached boundary. OpenAI's own guidance is to keep variable content at the end of the prompt.

Anthropic: explicit cache control, deepest read discount

Anthropic uses opt-in caching: you mark which blocks to cache with a cache_control flag, and you pay a write premium the first time. Cache reads cost 0.1x the base input rate, roughly a 90 percent discount. Cache writes cost 1.25x base for the 5-minute TTL and 2x base for the 1-hour TTL. On Claude Opus 4.8 at a 5 dollar base input rate, that is a 6.25 dollar write for 5-minute TTL, a 10 dollar write for 1-hour TTL, and a 0.50 dollar read.

With the 1.25x write and 0.1x read, a single cache hit already pays back the write cost. Note the TTL: the cached entry expires after its window of inactivity, and each read refreshes the timer. If your traffic is bursty enough that gaps exceed the TTL, you pay the write premium repeatedly.

Insight

Verify the live numbers before you commit a budget. Anthropic occasionally revises cache TTL behavior, and base rates shift with model versions. Treat the figures here as the published structure, not a frozen quote.

OpenAI: automatic, no write fee, smaller discount

OpenAI caches automatically with no code changes and no write fee. The provider detects repeated prefixes server-side and applies the discount on its own. Standard cached input tokens are billed at roughly half the normal input rate, a 50 percent discount on the cached portion. Caching activates once the prompt reaches 1024 tokens or longer, and the cached prefix grows in 128-token increments.

The absence of a write fee is the defining trait: you never pay a premium to populate the cache, so there is no break-even calculation. The trade-off is a shallower discount than Anthropic's read rate. OpenAI advertises up to 90 percent lower input token cost and up to 80 percent lower latency as two separate figures, while the per-token discount on the cached segment itself is 50 percent. The latency win can matter as much as cost for chat. Reusing the cached prefix shortens time to first token on long, repeated contexts. Because the caching is automatic, the byte-identical rule still applies invisibly: you get no flag confirming a hit unless you read the cached-token count in the usage field of the response.

Pro Tip

Check the usage object on every OpenAI response. The cached_tokens count tells you whether the prefix actually hit. If it stays at zero on a prompt you expected to reuse, something dynamic is sitting inside your prefix.

Google Gemini: implicit and explicit, plus a storage fee

Gemini splits caching into two modes: implicit and explicit. Implicit caching is on by default for Gemini 2.5 and newer models and applies a discount automatically when a request reuses a cached prefix. Explicit caching is opt-in: you create a cache object with a TTL and reference it, which guarantees the discount but requires a model-specific minimum context size (a few thousand tokens, varying by model) and adds a storage charge.

The discount on cached input reads runs deep, commonly cited in the 75 percent range and reaching further on some model tiers. The distinguishing cost is storage: explicit caching bills a per-token, per-hour fee to keep the cache warm for its TTL, on top of the reduced read rate. Storage figures sit in the rough range of 1 dollar per million tokens per hour for Flash tiers up to several dollars for Pro tiers, so confirm the current number for your exact model on Google's pricing page before relying on it.

That storage fee changes the math. Unlike OpenAI, where caching is free to populate, and unlike Anthropic, where you pay once per write, Gemini explicit caching charges for every hour the entry lives. If you read the cached context only a couple of times per hour, the storage fee can eat most of the saving. High-reuse workloads that hit the same large context many times per hour are where it pays off.

Three providers, side by side

DimensionAnthropicOpenAI / Gemini
How it activatesExplicit: you mark blocks with cache_controlOpenAI automatic on prefix detection; Gemini implicit by default plus optional explicit cache objects
Cached read discount~90% off (read at 0.1x base input)OpenAI ~50% off cached input; Gemini commonly ~75% off, deeper on some tiers
Write / setup fee1.25x base for 5-min TTL, 2x base for 1-hour TTLOpenAI: none; Gemini explicit: per-token per-hour storage fee
Minimum to cacheMinimum cacheable block size applies per modelOpenAI: 1024 tokens; Gemini explicit: model-dependent minimum, roughly 2,048 to 4,096 tokens
Best fitHigh-reuse prompts where the deep read discount dominatesOpenAI for low-reuse or unpredictable traffic (no write fee); Gemini for very large contexts reused many times per hour
Prefix ruleByte-identical prefix requiredByte-identical prefix required (both)

Which model wins for your workload

The deciding variable is your read-to-write ratio: how many times you reuse a cached prefix before it changes or expires. Low-reuse workloads, where you write a prefix and read it only once or twice, favor OpenAI's no-write-fee model, because you never pay a premium that you cannot amortize. High-reuse workloads, where the same prefix is read many times, favor Anthropic's deep read discount, which keeps paying off long after the small write premium is recovered.

Gemini's storage fee adds a time dimension that the other two lack. Its explicit caching rewards keeping a very large context hot and querying it intensely within the hour. If your reuse is spread thinly across a long window, the per-hour storage charge can outweigh the discount.

  • Few reads per cached prefix: OpenAI, because there is no write fee to recover.
  • Many reads per cached prefix, modest context: Anthropic, because the read discount is deepest.
  • Huge context queried many times within an hour: Gemini explicit caching, where storage cost is diluted across many hits.
  • Unpredictable or bursty traffic: prefer automatic caching (OpenAI implicit, Gemini implicit) so you never pay a write you cannot use.

A checklist before you ship caching

  • Move every dynamic value (date, user ID, request ID, flags) to the end of the prompt, after the cached prefix.
  • Freeze tool definitions and their order; reordering tools can invalidate the whole cache.
  • Serialize JSON deterministically so key order never shifts between requests.
  • Strip stray trailing whitespace and newlines from templated prefixes.
  • Read the response usage field and confirm the cached-token count is non-zero on repeated calls.
  • Re-verify each vendor's current rates and TTL behavior before committing a cost forecast.

Where MemX fits

Prompt caching is an API-side cost lever for teams building on these models. It solves a different problem than the memory a person wants from a consumer AI app, which is recall of their own information over time, not a cheaper input bill. MemX is an external AI memory layer for your own documents, photos, and notes across Android, iOS, and WhatsApp, so an assistant can reference what you have saved instead of re-reading it from scratch each session. MemX is private by architecture, with per-user keys, encryption at rest, and an on-device first pass. If your interest in caching is really about an assistant that remembers your context, that is the layer MemX provides.

Frequently asked questions

Frequently Asked Questions
01What is prompt caching and how does it cut cost?

Prompt caching stores the model's computed state for a fixed prompt prefix so repeated requests skip recomputing it. The reused tokens are billed at a discount, from about 50 percent on OpenAI to roughly 90 percent on an Anthropic cache read. Output tokens are unaffected; only the repeated input portion gets cheaper.

02Why is my prompt cache not hitting?

Almost always because something dynamic sits inside the cached prefix. A changing timestamp, per-user ID, A/B flag, reordered tool list, or a stray trailing newline makes the prefix differ on every call. The match must be byte-identical, so move all variable content to the end of the prompt.

03Does prompt caching reduce latency?

Yes. Reusing a cached prefix means the model skips recomputing it, which shortens time to first token on long, repeated contexts. OpenAI cites up to 80 percent lower latency on cached prefixes. The exact gain depends on how much of your prompt is cached and how large that prefix is.

04Anthropic vs OpenAI vs Gemini prompt caching: which is cheapest?

It depends on reuse. OpenAI charges no write fee, which suits low-reuse or bursty traffic. Anthropic has the deepest read discount, best for high-reuse prefixes. Gemini adds a per-hour storage fee that pays off only when a large context is queried many times within the hour.

05Is prompt caching automatic or do I have to enable it?

It varies. OpenAI applies it automatically on prefixes of 1024 tokens or more. Gemini caches implicitly by default and also offers explicit caching. Anthropic requires you to mark blocks explicitly with a cache_control flag and pay a small write premium the first time.

Prompt caching is one of the cleanest cost wins in an LLM stack, but only if the prefix stays still. Get the stable-then-dynamic ordering right, confirm hits in the usage data, and pick the provider whose pricing shape matches how often you reuse a prefix. Then recheck the live rates, because every number here is a published structure that vendors revise between releases.

Read Next

Or try MemX to access 40+ AI models in one place — including Claude Sonnet 4.6 and GPT-5.4 — and get your questions answered today.

Was this article helpful?

Found this useful? Share it with someone who needs it.

Free · iOS, Android & WhatsApp

Stop losing what you save.
Let MemX remember it for you.

Every screenshot, photo, PDF and voice note — captured, encrypted, and instantly searchable. Ask in plain English, get the answer in seconds.

  • Reads text inside images and handwriting
  • Private and encrypted by default
  • Free to start, no credit card

Takes under a minute to set up. Your data stays yours.

Aditya Kumar Jha
Written by
Aditya Kumar JhaLinkedIn

Core software engineer at MemX, where he builds the website, backend, and data systems. Also a published author of six books on Amazon KDP, writing on AI, memory, and behavior.

Keep reading

More guides for AI-powered students.