Skip to main content

Prompt / KV Caching

Prompt caching can reduce latency and cost when you send multiple requests that reuse the same prompt context.

What It Is

Large language model requests often repeat the same content: system instructions, tool definitions, examples, documents, repository context, or a long conversation history. With prompt caching, the API can reuse work from a recent request instead of processing the repeated prompt prefix from scratch.

This is sometimes called KV caching because the reused data is the model's derived key/value state for that prompt prefix. You do not need to manage this state yourself.

Availability

KV caching is available on some models. In the Model Catalog, models that show a price for cached tokens generally support prompt/KV caching.

Cache use is not guaranteed for any individual request. A request may miss the cache because the prompt has not been seen recently, the reused portion does not match closely enough, or the cache was evicted to make room for other workloads.

What It Means for You

When a cache hit happens, repeated input tokens may be billed at the cached-token price and the request may start faster. The model output is still generated normally for each request.

Caching is most useful for workloads with large, repeated context, such as:

  • Coding assistants that repeatedly send repository, file, or conversation context
  • Agents with stable system prompts and tool definitions
  • Document analysis where the same long document is queried multiple times
  • Multi-turn conversations where earlier context is reused across requests

Caching should be treated as an optimization, not as storage or application state. Your application should behave correctly whether a request hits the cache or not.

Improving Cache Reuse

You cannot force a cache hit, but you can make reuse more likely:

  • Put stable, repeated content first in the request.
  • Put user-specific or per-request content near the end.
  • Keep system prompts, tool definitions, examples, and long context blocks identical across related requests.
  • Send related requests within a short timeframe when possible.
  • Avoid changing timestamps, random IDs, or other volatile text inside the reusable prompt prefix.

In general, prompt/KV cache reuse is more likely when multiple requests with the same long prefix arrive close together. There are no guarantees for how long a prompt prefix remains cached.

Data Storage

KV cache data is currently stored ephemerally in memory and is not written to disk. We do not store raw prompts as part of the KV cache; the cache contains derived model representations of the prompt prefix.

This behavior may evolve as we continue improving caching capabilities. We will update this page if storage behavior or cache guarantees change.