Rate Limits

Kyma uses two limit types and enforces them per user:

Text and code endpoints (chat, completions) are rate-based — RPM, TPM, and a per-model RPM.
Media endpoints (image, video, audio, realtime) are concurrency-based — only actively running jobs count against your limit. Queued jobs do not consume a slot.

Why concurrency for media: jobs run 5–600 seconds. Per-minute metering would either be far too coarse (silent burst capacity) or far too strict (single long render burns your whole quota). Concurrency caps + balance pre-checks track real load.

Tier ladder

Your tier is determined by lifetime_purchased — the total Stripe-purchased face value of credits across your account’s history. Signup bonuses, referral credits, and promotional grants don’t count.

Tier	Min deposit	RPM	per-model RPM	TPM	Max tokens / request
Tier 0 (free)	$0	30	25	200,000	200,000
Tier 1	$5	60	40	500,000	500,000
Tier 2	$50	120	80	2,000,000	1,000,000
Tier 3	$250	200	150	5,000,000	3,000,000
Tier 4	$1,000	300	200	10,000,000	10,000,000

Tier is recomputed after every successful purchase. There is no waiting period — top up $5 and your Tier 1 limits apply immediately.

Image limits

Tier	`image_concurrent`	Queue depth cap
Tier 0	2	6
Tier 1	10	30
Tier 2	20	60
Tier 3	28	84
Tier 4	38	(combined, see below)

Video limits

Tier	`video_concurrent`	Queue depth cap
Tier 0	0 (blocked)	0
Tier 1	4	12
Tier 2	8	24
Tier 3	10	30
Tier 4	38	(combined, see below)

Tier 0 blocks video generation entirely. Deposit any amount ($5+) to unlock it.

Tier 4 combined media pool

Tier 4 image and video share a combined pool of 38 concurrent jobs with a queue depth cap of 114. So one Tier 4 account can run all 38 slots as image, all 38 as video, or any mix in between. Lower tiers keep separate per-modality caps.

Audio limits — per-provider sub-pools

Each audio provider has its own concurrency pool, isolated per tenant. Saturating STT (Groq) does not block TTS (ElevenLabs) or audio understanding (Vertex), and vice versa. No other gateway documents this — Kyma is the only one that enforces sub-pool isolation as a first-class limit, because we route each audio capability through a different upstream with very different pool sizes.

Which models use which sub-pool

Sub-pool	Models routed through it
`groq`	`whisper-v3-turbo` (default STT — `transcribe` alias)
`openai`	`gpt-4o-mini-transcribe-2025-12-15` (premium STT — `transcribe-quality` alias)
`vertex`	`gemini-3-flash-audio` (audio understanding), plus the STT fallback when Whisper is down
`elevenlabs`	`eleven-multilingual-v2`, `eleven-turbo-v2`, `eleven-flash-v2`, `elevenlabs-music`, `elevenlabs-sfx`
`minimax`	`minimax-speech-hd`, `minimax-speech-turbo`, `minimax-music`, `minimax-music-pro`, voice clone, voice design

Per-tier concurrent slots

Each row is a per-user, per-sub-pool cap. The total audio column is the legacy aggregate (max across sub-pools) returned by /v1/auth/limits for compatibility.

Tier	groq	openai	vertex	elevenlabs	minimax	Total audio
Tier 0	1	1	2	1	1	2
Tier 1	3	2	10	2	3	10
Tier 2	8	4	25	3	8	25
Tier 3	15	8	50	4	14	50
Tier 4	30	18	100	5	20	100

ElevenLabs sub-pool caps are tightest by design — ElevenLabs Pro plan caps the upstream pool at 10 concurrent connections, so Tier 4 = 5 leaves headroom for at most two Tier 4 users at full burst before other tenants see upstream 429s forwarded as Kyma 429s with Retry-After. A second ElevenLabs Pro key is on the roadmap to lift this to 10. The openai sub-pool caps reflect OpenAI’s tier-3 STT quota for Kyma’s upstream account — they sit lower than groq because OpenAI’s gpt-4o-mini-transcribe is priced ~4.5× higher per minute, so heavy STT workloads should default to transcribe (Groq) and reserve the Quality tier for use cases that explicitly need OpenAI’s accuracy on conversational audio or code-switching languages.

Realtime audio limits

Limit	Value
Concurrent realtime sessions per user	3 (combined across OpenAI Realtime Translate + Gemini Live)
Session token TTL	5 minutes (WebSocket must connect within this window)
Heartbeat interval	30 seconds
Heartbeat timeout	90 seconds (session reaped, unused minutes refunded)
Max session duration	30 minutes per session

See Realtime Audio for the full lifecycle.

Queue + backpressure

Image and video endpoints have an in-process FIFO queue ahead of the concurrency gate. The queue depth cap is 3× the concurrent cap per modality (e.g. Tier 2 image: 20 concurrent + 60 queue = 80 in-flight per user). Audio and realtime do not queue — at-capacity requests return 429 immediately. When you exceed limits:

Active capacity full but queue has room → request queues. The HTTP connection stays open until your request is picked up; no client action needed.
Queue full too → 429 with error.code: queue_full and Retry-After: <seconds> based on the typical job duration for that modality.
Audio or realtime concurrency full → 429 with error.code: concurrent_limit_exceeded and Retry-After: <seconds> based on the active sub-pool’s expected drain time.
Text RPM / TPM exceeded → 429 with error.code: rate_limit_exceeded and Retry-After: <seconds> until the rolling window slides.

429 response shape

{
  "error": {
    "type": "rate_limit_error",
    "code": "concurrent_limit_exceeded",
    "message": "Speech-to-Text pool full (3/3 slots in use). Retry after 4s.",
    "retry_after": 4
  }
}

Retry-After is also set as an HTTP header (in seconds, integer). Clients should prefer the header for parsing; the body field is a convenience for environments where headers are awkward.

Error codes

`error.code`	When	`Retry-After` present?
`rate_limit_exceeded`	Text RPM, per-model RPM, or TPM exceeded	Yes (seconds until window slides)
`concurrent_limit_exceeded`	Media or audio sub-pool full	Yes (estimated drain time)
`queue_full`	Image or video queue at depth cap	Yes (estimated queue clear time)
`too_many_sessions`	Realtime session count = 3	No (end an existing session)
`insufficient_balance`	Balance below request hold	No (top up)

Response headers

On 200 responses for rate-limited endpoints (text path):

X-Kyma-RateLimit-Tier — your current tier number (0–4)
X-Kyma-RateLimit-RPM-Remaining — requests remaining in the current minute window

On 429 responses (any path):

Retry-After — integer seconds. Always present.
X-Kyma-RateLimit-Code — same value as error.code for easy machine parsing.

How limits are computed

Your tier is the maximum of:

Deposit ladder: lifetime_purchased (Stripe purchases only) crossed against the minimum-deposit thresholds in the tier table.
Manual override: a tier_override flag set per-account by Kyma ops for partners and heavy users.

Tier results are cached for 5 minutes. After a top-up, expect new limits to apply within 5 minutes worst case.

Need higher limits?

Two paths: 1. Natural tier upgrade. Deposit more credits to cross the next threshold. Tiers apply immediately — no application needed. 2. Heavy users on a tight ramp. If you’re shipping a product on Kyma and expect to scale beyond Tier 3 limits before you’d naturally cross the $1000 deposit threshold, email hello@kymaapi.com with:

Product name and URL
Expected monthly usage (rough is fine)
Primary capabilities (audio / text / image / video mix)

Real examples already on this path: watch-cli, echoly, OpenClaw. Tier 4 limits without the $1000 deposit are available case-by-case. Internal SLA: 1 business day response. Note: this is distinct from the future Kyma Partner Program (revshare for app creators who integrate Kyma) — that’s a separate offering with its own page when it launches.

Guides

Use Cases

Integrations

Kyma Tools

Tier ladder

Image limits

Video limits

Tier 4 combined media pool

Audio limits — per-provider sub-pools

Which models use which sub-pool

Per-tier concurrent slots

Realtime audio limits

Queue + backpressure

429 response shape

Error codes

Response headers

How limits are computed

Need higher limits?

See also

Guides

Use Cases

Integrations

Kyma Tools

Documentation Index

​Tier ladder

​Image limits

​Video limits

​Tier 4 combined media pool

​Audio limits — per-provider sub-pools

​Which models use which sub-pool

​Per-tier concurrent slots

​Realtime audio limits

​Queue + backpressure

​429 response shape

​Error codes

​Response headers

​How limits are computed

​Need higher limits?

​See also

Tier ladder

Image limits

Video limits

Tier 4 combined media pool

Audio limits — per-provider sub-pools

Which models use which sub-pool

Per-tier concurrent slots

Realtime audio limits

Queue + backpressure

429 response shape

Error codes

Response headers

How limits are computed

Need higher limits?

See also