Kyma’s realtime audio endpoint streams live conversation audio between your client and Google’s Gemini Live native-audio model. Use it for voice agents, live translation, interactive tutors, and any product where the user speaks and the model speaks back in real time.Documentation Index
Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt
Use this file to discover all available pages before exploring further.
How it works
The realtime endpoint is a two-step flow because of how browser security and Google’s Vertex AI auth model interact:- POST mints a short-lived session token (5 min TTL) and returns a
ws_url. - WebSocket opens to
wss://api.kymaapi.com/v1/live/proxy/{id}?token=…— Kyma terminates this connection and proxies it to Vertex Live API using a server-side service-account credential.
Flow
The client must wait for theBidiGenerateContentSetupComplete frame before sending audio. Frames sent before setup completes will be dropped silently.
Quickstart — browser JS
Play audio response
The model returns PCM16 24kHz audio chunks (base64-wrapped JSON). Decode to a
Float32Array, schedule into an AudioBufferSourceNode:Audio format reference
| Direction | Format | Sample rate | Channels | Encoding |
|---|---|---|---|---|
| Client → server | PCM16 (signed 16-bit little-endian) | 16,000 Hz | mono | base64 string in JSON mediaChunks[].data |
| Server → client | PCM16 (signed 16-bit little-endian) | 24,000 Hz | mono | base64 string in JSON inlineData.data |
Configuration
The mint request body accepts:Currently only
gemini-2.5-flash-native-audio-preview-12-2025 is supported.BCP-47 short code. One of:
ar, bn, de, en, es, fa, fr, hi, id, it, ja, ko, nl, pl, pt, ru, sv, ta, te, th, tr, ur, vi, zhOne of 30 prebuilt voices:
Achernar, Achird, Algenib, Algieba, Alnilam, Aoede, Autonoe, Callirrhoe, Charon, Despina, Enceladus, Erinome, Fenrir, Gacrux, Iapetus, Kore, Laomedeia, Leda, Orus, Pulcherrima, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, ZubenelgenubiResponse shape
200 OK from POST /v1/live/sessions:
session_token is single-use — once consumed by your WebSocket handshake it is invalidated, so a leaked token cannot be replayed. The token TTL is 5 minutes; if you don’t open the WebSocket within that window the session is reaped and you can mint a fresh one.
Pricing
| Item | Rate |
|---|---|
| Active session | $0.0389 / minute |
| Initial collateral hold | $0.20 (placed on session create) |
| Unused minutes | Refunded on session end |
Limits
| Limit | Value | Behavior on breach |
|---|---|---|
| Concurrent sessions per user | 3 (combined across providers) | 429 too_many_sessions |
| Max session duration | 30 minutes (1800s) | Auto-close, unused minutes refunded |
| Session token TTL | 5 minutes | WS handshake rejected; mint a new session |
| Heartbeat interval | 30 seconds | Required to prove liveness |
| Heartbeat timeout | 90 seconds | Session reaped, unused minutes refunded |
Errors
POST /v1/live/sessions:
| Status | error.type | error.code | When |
|---|---|---|---|
400 | invalid_request | — | Body missing / invalid, unsupported language, unsupported voice |
400 | model_not_found | — | Wrong model id |
401 | unauthorized | — | Missing API key / session token |
402 | billing_error | insufficient_balance | Balance < $0.20 hold |
429 | rate_limit_error | too_many_sessions | User already has 3 active sessions |
500 | internal_error | — | DB insert or hold RPC failed |
503 | service_unavailable | — | Live sessions disabled or Vertex token mint failed |
| Code | Reason | When |
|---|---|---|
1000 | normal | Client / server closed cleanly |
1011 | vertex_auth_failed | Vertex SA token mint failed at proxy open |
1011 | upstream_construct_failed | Vertex WebSocket constructor threw |
Reference
POST /v1/live/sessions— mint a session tokenPOST /v1/live/sessions/{id}/heartbeat— keep session alive (every 30s)POST /v1/live/sessions/{id}/end— close session and settle billingwss://api.kymaapi.com/v1/live/proxy/{id}?token=…— WebSocket bidirectional audio
See also
- Audio Understand — non-realtime audio Q&A (tone, music, scene)
- Audio Transcriptions — speech-to-text, sync
- Rate Limits — concurrency model + 429 shapes
- Models —
gemini-3-flash-audio— companion model for non-realtime audio understanding