Documentation Index
Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt
Use this file to discover all available pages before exploring further.
Kyma uses two limit types and enforces them per user:
- Text and code endpoints (chat, completions) are rate-based — RPM, TPM, and a per-model RPM.
- Media endpoints (image, video, audio, realtime) are concurrency-based — only actively running jobs count against your limit. Queued jobs do not consume a slot.
Why concurrency for media: jobs run 5–600 seconds. Per-minute metering would either be far too coarse (silent burst capacity) or far too strict (single long render burns your whole quota). Concurrency caps + balance pre-checks track real load.
Tier ladder
Your tier is determined by lifetime_purchased — the total Stripe-purchased face value of credits across your account’s history. Signup bonuses, referral credits, and promotional grants don’t count.
| Tier | Min deposit | RPM | per-model RPM | TPM | Max tokens / request |
|---|
| Tier 0 (free) | $0 | 30 | 25 | 200,000 | 200,000 |
| Tier 1 | $5 | 60 | 40 | 500,000 | 500,000 |
| Tier 2 | $50 | 120 | 80 | 2,000,000 | 1,000,000 |
| Tier 3 | $250 | 200 | 150 | 5,000,000 | 3,000,000 |
| Tier 4 | $1,000 | 300 | 200 | 10,000,000 | 10,000,000 |
Tier is recomputed after every successful purchase. There is no waiting period — top up $5 and your Tier 1 limits apply immediately.
Image limits
| Tier | image_concurrent | Queue depth cap |
|---|
| Tier 0 | 2 | 6 |
| Tier 1 | 10 | 30 |
| Tier 2 | 20 | 60 |
| Tier 3 | 28 | 84 |
| Tier 4 | 38 | (combined, see below) |
Video limits
| Tier | video_concurrent | Queue depth cap |
|---|
| Tier 0 | 0 (blocked) | 0 |
| Tier 1 | 4 | 12 |
| Tier 2 | 8 | 24 |
| Tier 3 | 10 | 30 |
| Tier 4 | 38 | (combined, see below) |
Tier 0 blocks video generation entirely. Deposit any amount ($5+) to unlock it.
Tier 4 image and video share a combined pool of 38 concurrent jobs with a queue depth cap of 114. So one Tier 4 account can run all 38 slots as image, all 38 as video, or any mix in between. Lower tiers keep separate per-modality caps.
Audio limits — per-provider sub-pools
Each audio provider has its own concurrency pool, isolated per tenant. Saturating STT (Groq) does not block TTS (ElevenLabs) or audio understanding (Vertex), and vice versa. No other gateway documents this — Kyma is the only one that enforces sub-pool isolation as a first-class limit, because we route each audio capability through a different upstream with very different pool sizes.
Which models use which sub-pool
| Sub-pool | Models routed through it |
|---|
groq | whisper-v3-turbo (default STT — transcribe alias) |
openai | gpt-4o-mini-transcribe-2025-12-15 (premium STT — transcribe-quality alias) |
vertex | gemini-3-flash-audio (audio understanding), plus the STT fallback when Whisper is down |
elevenlabs | eleven-multilingual-v2, eleven-turbo-v2, eleven-flash-v2, elevenlabs-music, elevenlabs-sfx |
minimax | minimax-speech-hd, minimax-speech-turbo, minimax-music, minimax-music-pro, voice clone, voice design |
Per-tier concurrent slots
Each row is a per-user, per-sub-pool cap. The total audio column is the legacy aggregate (max across sub-pools) returned by /v1/auth/limits for compatibility.
| Tier | groq | openai | vertex | elevenlabs | minimax | Total audio |
|---|
| Tier 0 | 1 | 1 | 2 | 1 | 1 | 2 |
| Tier 1 | 3 | 2 | 10 | 2 | 3 | 10 |
| Tier 2 | 8 | 4 | 25 | 3 | 8 | 25 |
| Tier 3 | 15 | 8 | 50 | 4 | 14 | 50 |
| Tier 4 | 30 | 18 | 100 | 5 | 20 | 100 |
ElevenLabs sub-pool caps are tightest by design — ElevenLabs Pro plan caps the upstream pool at 10 concurrent connections, so Tier 4 = 5 leaves headroom for at most two Tier 4 users at full burst before other tenants see upstream 429s forwarded as Kyma 429s with Retry-After. A second ElevenLabs Pro key is on the roadmap to lift this to 10.
The openai sub-pool caps reflect OpenAI’s tier-3 STT quota for Kyma’s upstream account — they sit lower than groq because OpenAI’s gpt-4o-mini-transcribe is priced ~4.5× higher per minute, so heavy STT workloads should default to transcribe (Groq) and reserve the Quality tier for use cases that explicitly need OpenAI’s accuracy on conversational audio or code-switching languages.
Realtime audio limits
| Limit | Value |
|---|
| Concurrent realtime sessions per user | 3 (combined across OpenAI Realtime Translate + Gemini Live) |
| Session token TTL | 5 minutes (WebSocket must connect within this window) |
| Heartbeat interval | 30 seconds |
| Heartbeat timeout | 90 seconds (session reaped, unused minutes refunded) |
| Max session duration | 30 minutes per session |
See Realtime Audio for the full lifecycle.
Queue + backpressure
Image and video endpoints have an in-process FIFO queue ahead of the concurrency gate. The queue depth cap is 3× the concurrent cap per modality (e.g. Tier 2 image: 20 concurrent + 60 queue = 80 in-flight per user). Audio and realtime do not queue — at-capacity requests return 429 immediately.
When you exceed limits:
- Active capacity full but queue has room → request queues. The HTTP connection stays open until your request is picked up; no client action needed.
- Queue full too →
429 with error.code: queue_full and Retry-After: <seconds> based on the typical job duration for that modality.
- Audio or realtime concurrency full →
429 with error.code: concurrent_limit_exceeded and Retry-After: <seconds> based on the active sub-pool’s expected drain time.
- Text RPM / TPM exceeded →
429 with error.code: rate_limit_exceeded and Retry-After: <seconds> until the rolling window slides.
429 response shape
{
"error": {
"type": "rate_limit_error",
"code": "concurrent_limit_exceeded",
"message": "Speech-to-Text pool full (3/3 slots in use). Retry after 4s.",
"retry_after": 4
}
}
Retry-After is also set as an HTTP header (in seconds, integer). Clients should prefer the header for parsing; the body field is a convenience for environments where headers are awkward.
Error codes
error.code | When | Retry-After present? |
|---|
rate_limit_exceeded | Text RPM, per-model RPM, or TPM exceeded | Yes (seconds until window slides) |
concurrent_limit_exceeded | Media or audio sub-pool full | Yes (estimated drain time) |
queue_full | Image or video queue at depth cap | Yes (estimated queue clear time) |
too_many_sessions | Realtime session count = 3 | No (end an existing session) |
insufficient_balance | Balance below request hold | No (top up) |
On 200 responses for rate-limited endpoints (text path):
X-Kyma-RateLimit-Tier — your current tier number (0–4)
X-Kyma-RateLimit-RPM-Remaining — requests remaining in the current minute window
On 429 responses (any path):
Retry-After — integer seconds. Always present.
X-Kyma-RateLimit-Code — same value as error.code for easy machine parsing.
How limits are computed
Your tier is the maximum of:
- Deposit ladder:
lifetime_purchased (Stripe purchases only) crossed against the minimum-deposit thresholds in the tier table.
- Manual override: a
tier_override flag set per-account by Kyma ops for partners and heavy users.
Tier results are cached for 5 minutes. After a top-up, expect new limits to apply within 5 minutes worst case.
Need higher limits?
Two paths:
1. Natural tier upgrade. Deposit more credits to cross the next threshold. Tiers apply immediately — no application needed.
2. Heavy users on a tight ramp. If you’re shipping a product on Kyma and expect to scale beyond Tier 3 limits before you’d naturally cross the $1000 deposit threshold, email hello@kymaapi.com with:
- Product name and URL
- Expected monthly usage (rough is fine)
- Primary capabilities (audio / text / image / video mix)
Real examples already on this path: watch-cli, echoly, OpenClaw. Tier 4 limits without the $1000 deposit are available case-by-case. Internal SLA: 1 business day response.
Note: this is distinct from the future Kyma Partner Program (revshare for app creators who integrate Kyma) — that’s a separate offering with its own page when it launches.
See also