Skip to main content
Prompt caching reduces costs and latency by reusing previously computed context. When you send requests with the same system prompt, tool definitions, or conversation history, cached tokens are charged at a discounted rate.

How It Works

  1. First request: Full prompt is processed and cached by the provider
  2. Subsequent requests: Cached prefix is reused — up to 90% cheaper and 80% faster
Caching works automatically for supported providers. No code changes required for most use cases.

Automatic Caching

For OpenAI-compatible requests, caching is automatic when your prompt exceeds 1,024 tokens. Place static content (system prompt, tool definitions) at the beginning:
from openai import OpenAI

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="kyma-your-api-key"
)

response = client.chat.completions.create(
    model="gemini-2.5-flash",
    messages=[
        # Static content (cached after first request)
        {"role": "system", "content": "Your long system prompt here..."},
        # Dynamic content (never cached)
        {"role": "user", "content": "User's question"}
    ]
)

# Check cache stats
print(response.usage.cached_tokens)  # Tokens read from cache

Best Practices

Structure prompts for caching

Place stable content first, dynamic content last:
1. System instructions (static) ← CACHED
2. Tool definitions (static)    ← CACHED
3. Few-shot examples (static)   ← CACHED
4. Conversation history          ← CACHED (grows incrementally)
5. Current user message          ← NOT CACHED (changes each request)

For coding agents

Coding agents (OpenClaw, Cline, Roo Code) automatically benefit from caching because they send the same system prompt + tool definitions with every request. Typical savings for a 50-request coding session:
  • Without caching: 400K tokens × full price
  • With caching: 47K effective tokens (88% savings)

What to avoid

  • Don’t put timestamps or request IDs in system prompts — breaks cache
  • Don’t reorder tool definitions between requests
  • Keep system prompt identical across requests

Cache Stats in Response

Kyma normalizes cache statistics from all providers into a unified format:
{
  "usage": {
    "prompt_tokens": 5050,
    "completion_tokens": 200,
    "cached_tokens": 5000,
    "cache_write_tokens": 0,
    "cost": 0.000382,
    "cache_discount": 0.002430
  }
}
FieldDescription
cached_tokensTokens read from cache (90% discounted)
cache_write_tokensTokens written to cache on first request
costTotal cost charged for this request (USD)
cache_discountAmount saved from caching (USD)
These fields appear in both streaming (final usage chunk) and non-streaming responses.

Supported Models

Check supports_caching in the models endpoint:
curl https://kymaapi.com/v1/models | jq '.data[] | select(.supports_caching == true) | .id'
Currently supported on models served via:
  • Groq — automatic caching for prompts >1,024 tokens
  • Google AI — automatic caching, reduced rates
  • OpenRouter — pass-through provider caching

Pricing

Cached tokens are charged at 10% of the normal input price (90% discount).
Token TypeRateExample (Llama 3.3 70B)
Input (non-cached)Full price$0.797 / 1M tokens
Input (cached)10% of input price$0.0797 / 1M tokens
OutputFull price$1.067 / 1M tokens
Example savings for a 50-request coding session:
System prompt: 5,000 tokens (stable across requests)
User messages: ~500 tokens each (dynamic)

Without caching:
  50 × 5,000 × $0.797/1M = $0.199 (input only)

With caching:
  1 × 5,000 × $0.797/1M (first request) +
  49 × 5,000 × $0.0797/1M (cached) = $0.024

Savings: $0.175 (88% reduction)
The usage.cost field in every response shows the actual amount charged, so you can track savings in real-time.