Documentation Index
Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt
Use this file to discover all available pages before exploring further.
Prompt caching reduces costs and latency by reusing previously computed context. When you send requests with the same system prompt, tool definitions, or conversation history, cached tokens are charged at a discounted rate.
See real-time cache stats across all models on the Rankings page.
How It Works
- First request: Full prompt is processed and cached by the provider
- Subsequent requests: Cached prefix is reused — up to 90% cheaper and 80% faster
Caching works automatically for supported providers. No code changes required for most use cases.
Real-World Impact
Based on production data from the Kyma community (7-day rolling window):
- 22%+ overall cache hit rate across all models
- deepseek-v3 leads with 56% cache hit rate — heavy agentic usage
- gemini-2.5-flash and gemma-4-31b consistently hit 20-35%
- Coding agents (OpenClaw, Cline, Roo Code) see the highest cache rates due to repeated system prompts
Check the live numbers at kymaapi.com/rankings?tab=cache.
Automatic Caching
For OpenAI-compatible requests, caching is automatic when your prompt exceeds 1,024 tokens. Place static content (system prompt, tool definitions) at the beginning:
from openai import OpenAI
client = OpenAI(
base_url="https://kymaapi.com/v1",
api_key="ky-your-api-key"
)
response = client.chat.completions.create(
model="deepseek-v3",
messages=[
# Static content (cached after first request)
{"role": "system", "content": "Your long system prompt here..."},
# Dynamic content (never cached)
{"role": "user", "content": "User's question"}
]
)
# Check cache stats in response
print(f"Cached: {response.usage.cached_tokens}")
print(f"Cost: ${response.usage.cost}")
print(f"Saved: ${response.usage.cache_discount}")
Best Practices
Structure prompts for caching
Place stable content first, dynamic content last:
1. System instructions (static) ← CACHED
2. Tool definitions (static) ← CACHED
3. Few-shot examples (static) ← CACHED
4. Conversation history ← CACHED (grows incrementally)
5. Current user message ← NOT CACHED (changes each request)
For coding agents
Coding agents (OpenClaw, Cline, Roo Code, Claude Code) automatically benefit from caching because they send the same system prompt + tool definitions with every request.
Real production example — 50-request coding session with deepseek-v3:
| Without caching | With caching (56% hit rate) |
|---|
| Effective input tokens | 250,000 | 110,000 |
| Input cost | $0.203 | $0.049 |
| Savings | — | $0.154 (76%) |
What to avoid
- Don’t put timestamps or request IDs in system prompts — breaks cache
- Don’t reorder tool definitions between requests
- Keep system prompt identical across requests
Cache Stats in Response
Kyma normalizes cache statistics from all providers into a unified format:
{
"usage": {
"prompt_tokens": 5050,
"completion_tokens": 200,
"cached_tokens": 5000,
"cache_write_tokens": 0,
"cost": 0.000382,
"cache_discount": 0.002430
}
}
| Field | Description |
|---|
cached_tokens | Tokens read from cache (90% discounted) |
cache_write_tokens | Tokens written to cache on first request |
cost | Total cost charged for this request (USD) |
cache_discount | Amount saved from caching (USD) |
These fields appear in both streaming (final usage chunk) and non-streaming responses.
Tracking Your Savings
Per-request
Every API response includes usage.cost (what you paid) and usage.cache_discount (what you saved). Sum these over your session to track total savings.
Visit the Cache Stats rankings to see:
- Overall cache hit rate across all Kyma users
- Per-model cache breakdown (cached vs uncached vs output tokens)
- Total community savings in USD
Supported Models
All models on Kyma support prompt caching. Cache effectiveness varies by model — Kyma normalizes the behavior so you always see the same cached_tokens shape and the same 90% discount.
Check which models are actively caching:
curl https://kymaapi.com/v1/models | jq '.data[] | {id, supports_caching}'
Pricing
Cached tokens are charged at 10% of the normal input price (90% discount).
| Token type | Rate |
|---|
| Input (non-cached) | Full price |
| Input (cached) | 10% of input price |
| Output | Full price |
Example — deepseek-v3 pricing:
| Price per 1M tokens |
|---|
| Input (full) | $0.810 |
| Input (cached) | $0.081 |
| Output | $2.295 |
50-request coding session breakdown:
System prompt: 5,000 tokens (stable across requests)
User messages: ~500 tokens each (dynamic)
Without caching:
50 × 5,000 × $0.810/1M = $0.203 (input only)
With caching:
1 × 5,000 × $0.810/1M (first request) +
49 × 5,000 × $0.081/1M (cached) = $0.024
Savings: $0.179 (88% reduction)
The usage.cost and usage.cache_discount fields in every response let you track savings in real-time.