Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt

Use this file to discover all available pages before exploring further.

Prompt caching reduces costs and latency by reusing previously computed context. When you send requests with the same system prompt, tool definitions, or conversation history, cached tokens are charged at a discounted rate.
See real-time cache stats across all models on the Rankings page.

How It Works

  1. First request: Full prompt is processed and cached by the provider
  2. Subsequent requests: Cached prefix is reused — up to 90% cheaper and 80% faster
Caching works automatically for supported providers. No code changes required for most use cases.

Real-World Impact

Based on production data from the Kyma community (7-day rolling window):
  • 22%+ overall cache hit rate across all models
  • deepseek-v3 leads with 56% cache hit rate — heavy agentic usage
  • gemini-2.5-flash and gemma-4-31b consistently hit 20-35%
  • Coding agents (OpenClaw, Cline, Roo Code) see the highest cache rates due to repeated system prompts
Check the live numbers at kymaapi.com/rankings?tab=cache.

Automatic Caching

For OpenAI-compatible requests, caching is automatic when your prompt exceeds 1,024 tokens. Place static content (system prompt, tool definitions) at the beginning:
from openai import OpenAI

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="ky-your-api-key"
)

response = client.chat.completions.create(
    model="deepseek-v3",
    messages=[
        # Static content (cached after first request)
        {"role": "system", "content": "Your long system prompt here..."},
        # Dynamic content (never cached)
        {"role": "user", "content": "User's question"}
    ]
)

# Check cache stats in response
print(f"Cached: {response.usage.cached_tokens}")
print(f"Cost: ${response.usage.cost}")
print(f"Saved: ${response.usage.cache_discount}")

Best Practices

Structure prompts for caching

Place stable content first, dynamic content last:
1. System instructions (static) ← CACHED
2. Tool definitions (static)    ← CACHED
3. Few-shot examples (static)   ← CACHED
4. Conversation history          ← CACHED (grows incrementally)
5. Current user message          ← NOT CACHED (changes each request)

For coding agents

Coding agents (OpenClaw, Cline, Roo Code, Claude Code) automatically benefit from caching because they send the same system prompt + tool definitions with every request. Real production example — 50-request coding session with deepseek-v3:
Without cachingWith caching (56% hit rate)
Effective input tokens250,000110,000
Input cost$0.203$0.049
Savings$0.154 (76%)

What to avoid

  • Don’t put timestamps or request IDs in system prompts — breaks cache
  • Don’t reorder tool definitions between requests
  • Keep system prompt identical across requests

Cache Stats in Response

Kyma normalizes cache statistics from all providers into a unified format:
{
  "usage": {
    "prompt_tokens": 5050,
    "completion_tokens": 200,
    "cached_tokens": 5000,
    "cache_write_tokens": 0,
    "cost": 0.000382,
    "cache_discount": 0.002430
  }
}
FieldDescription
cached_tokensTokens read from cache (90% discounted)
cache_write_tokensTokens written to cache on first request
costTotal cost charged for this request (USD)
cache_discountAmount saved from caching (USD)
These fields appear in both streaming (final usage chunk) and non-streaming responses.

Tracking Your Savings

Per-request

Every API response includes usage.cost (what you paid) and usage.cache_discount (what you saved). Sum these over your session to track total savings.

Community-wide

Visit the Cache Stats rankings to see:
  • Overall cache hit rate across all Kyma users
  • Per-model cache breakdown (cached vs uncached vs output tokens)
  • Total community savings in USD

Supported Models

All models on Kyma support prompt caching. Cache effectiveness varies by model — Kyma normalizes the behavior so you always see the same cached_tokens shape and the same 90% discount. Check which models are actively caching:
curl https://kymaapi.com/v1/models | jq '.data[] | {id, supports_caching}'

Pricing

Cached tokens are charged at 10% of the normal input price (90% discount).
Token typeRate
Input (non-cached)Full price
Input (cached)10% of input price
OutputFull price
Example — deepseek-v3 pricing:
Price per 1M tokens
Input (full)$0.810
Input (cached)$0.081
Output$2.295
50-request coding session breakdown:
System prompt: 5,000 tokens (stable across requests)
User messages: ~500 tokens each (dynamic)

Without caching:
  50 × 5,000 × $0.810/1M = $0.203 (input only)

With caching:
  1 × 5,000 × $0.810/1M (first request) +
  49 × 5,000 × $0.081/1M (cached) = $0.024

Savings: $0.179 (88% reduction)
The usage.cost and usage.cache_discount fields in every response let you track savings in real-time.