Realtime Audio

Kyma’s realtime audio endpoint streams live conversation audio between your client and Google’s Gemini Live native-audio model. Use it for voice agents, live translation, interactive tutors, and any product where the user speaks and the model speaks back in real time.

# 1. Mint a session
curl -X POST https://api.kymaapi.com/v1/live/sessions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash-native-audio-preview-12-2025",
    "config": {
      "speech_config": {
        "language_code": "vi",
        "voice_config": { "prebuilt_voice_config": { "voice_name": "Kore" } }
      }
    }
  }'
# → { "session_token": "...", "ws_url": "wss://api.kymaapi.com/v1/live/proxy/<id>?token=...", ... }

How it works

The realtime endpoint is a two-step flow because of how browser security and the upstream realtime auth model interact:

POST mints a short-lived session token (5 min TTL) and returns a ws_url.
WebSocket opens to wss://api.kymaapi.com/v1/live/proxy/{id}?token=… — Kyma terminates this connection and proxies it to the upstream realtime API using server-side credentials.

Why a server-side proxy: the upstream realtime API does not yet expose an ephemeral-token endpoint, and handing a browser a long-lived upstream credential would grant broad access for an hour — unsafe. A short-lived session token plus a server-side proxy is the recommended architecture for browser clients. Same Google Gemini Live model, same voices, same languages as before — now scaling to 5000 concurrent sessions (up from a 50-session cap).

Flow

The client must wait for the BidiGenerateContentSetupComplete frame before sending audio. Frames sent before setup completes will be dropped silently.

Quickstart — browser JS

Mint the session

const resp = await fetch("https://api.kymaapi.com/v1/live/sessions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${KYMA_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "gemini-2.5-flash-native-audio-preview-12-2025",
    config: {
      speech_config: {
        language_code: "vi",
        voice_config: { prebuilt_voice_config: { voice_name: "Kore" } },
      },
    },
  }),
});
const session = await resp.json();
// session.ws_url, session.session_token, session.heartbeat_url, session.end_url, ...

Open the WebSocket

const ws = new WebSocket(session.ws_url);
ws.binaryType = "arraybuffer";

ws.addEventListener("message", (event) => {
  const frame = JSON.parse(event.data);
  if (frame.setupComplete) {
    // Ready — safe to start sending audio.
    startCaptureLoop();
    return;
  }
  if (frame.serverContent?.modelTurn?.parts) {
    // PCM16 24kHz audio chunk arrived — see Step 4.
    handleAudioChunk(frame.serverContent.modelTurn.parts);
  }
  if (frame.serverContent?.interrupted) {
    // User started speaking over the model — stop any queued playback.
    stopPlayback();
  }
});

Capture mic with AudioWorklet

MediaRecorder outputs Opus/WebM container, not raw PCM. The realtime API requires PCM16 16kHz mono. Use AudioWorklet to tap raw Float32 samples from getUserMedia, downsample 48kHz → 16kHz, and quantize to Int16. Open-source browser audio-capture examples (Apache-2.0) cover the full implementation.

Pseudo-code outline:

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 48000 });
await audioContext.audioWorklet.addModule("/pcm-recorder-worklet.js");
const source = audioContext.createMediaStreamSource(stream);
const worklet = new AudioWorkletNode(audioContext, "pcm-recorder");

worklet.port.onmessage = (event) => {
  // event.data is Int16Array at 16kHz from the worklet.
  const base64 = btoa(String.fromCharCode(...new Uint8Array(event.data.buffer)));
  ws.send(JSON.stringify({
    realtimeInput: {
      mediaChunks: [{ mimeType: "audio/pcm;rate=16000", data: base64 }],
    },
  }));
};

source.connect(worklet);

Play audio response

The model returns PCM16 24kHz audio chunks (base64-wrapped JSON). Decode to a Float32Array, schedule into an AudioBufferSourceNode:

const playbackContext = new AudioContext({ sampleRate: 24000 });
let nextPlayTime = playbackContext.currentTime;

function handleAudioChunk(parts) {
  for (const part of parts) {
    if (!part.inlineData?.data) continue;
    const bytes = Uint8Array.from(atob(part.inlineData.data), c => c.charCodeAt(0));
    const int16 = new Int16Array(bytes.buffer);
    const float32 = Float32Array.from(int16, v => v / 32768);
    const buffer = playbackContext.createBuffer(1, float32.length, 24000);
    buffer.copyToChannel(float32, 0);
    const source = playbackContext.createBufferSource();
    source.buffer = buffer;
    source.connect(playbackContext.destination);
    nextPlayTime = Math.max(nextPlayTime, playbackContext.currentTime);
    source.start(nextPlayTime);
    nextPlayTime += buffer.duration;
  }
}

Heartbeat and end

const heartbeat = setInterval(() => {
  fetch(session.heartbeat_url, {
    method: "POST",
    headers: { Authorization: `Bearer ${KYMA_API_KEY}` },
  });
}, session.heartbeat_interval_ms); // 30000

// On disconnect:
async function endSession() {
  clearInterval(heartbeat);
  ws.close();
  await fetch(session.end_url, {
    method: "POST",
    headers: { Authorization: `Bearer ${KYMA_API_KEY}` },
  });
}

Audio format reference

Direction	Format	Sample rate	Channels	Encoding
Client → server	PCM16 (signed 16-bit little-endian)	16,000 Hz	mono	base64 string in JSON `mediaChunks[].data`
Server → client	PCM16 (signed 16-bit little-endian)	24,000 Hz	mono	base64 string in JSON `inlineData.data`

Note the asymmetric sample rates. The output playback context must be 24 kHz; using 16 kHz will pitch the model’s voice up by 50%.

Configuration

The mint request body accepts:

model

string

default:"gemini-2.5-flash-native-audio-preview-12-2025"

Which Gemini Live model to run. One of:

gemini-2.5-flash-native-audio-preview-12-2025 (default) — conversational native-audio understanding + generation. Use for voice agents and multilingual chat.
gemini-3.5-live-translate-preview — low-latency speech-to-speech translation across 70+ languages, preserving the speaker’s intonation, pacing, and pitch. Use for live translation.

Omitting model keeps the default native-audio behavior unchanged.

config.speech_config.language_code

string

default:"en"

BCP-47 short code. One of:ar, bn, de, en, es, fa, fr, hi, id, it, ja, ko, nl, pl, pt, ru, sv, ta, te, th, tr, ur, vi, zh

config.speech_config.voice_config.prebuilt_voice_config.voice_name

string

default:"Kore"

One of 30 prebuilt voices:Achernar, Achird, Algenib, Algieba, Alnilam, Aoede, Autonoe, Callirrhoe, Charon, Despina, Enceladus, Erinome, Fenrir, Gacrux, Iapetus, Kore, Laomedeia, Leda, Orus, Pulcherrima, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, Zubenelgenubi

config.system_instruction

string

Currently dropped on the proxy path. If you pass a custom instruction it will be ignored and the default conversational persona is used. Threading this through the token store is a planned follow-up. Contact Kyma if you need it before then.

Response shape

200 OK from POST /v1/live/sessions:

{
  "session_token": "f3c2b1d4e5a6...",
  "ws_url": "wss://api.kymaapi.com/v1/live/proxy/<uuid>?token=<session_token>",
  "expires_at": 1700000000,
  "kyma_session_id": "<uuid>",
  "model": "gemini-2.5-flash-native-audio-preview-12-2025",
  "heartbeat_url": "https://api.kymaapi.com/v1/live/sessions/<uuid>/heartbeat",
  "end_url": "https://api.kymaapi.com/v1/live/sessions/<uuid>/end",
  "heartbeat_interval_ms": 30000
}

The session_token is single-use — once consumed by your WebSocket handshake it is invalidated, so a leaked token cannot be replayed. The token TTL is 5 minutes; if you don’t open the WebSocket within that window the session is reaped and you can mint a fresh one.

Pricing

Model	Active session	Initial hold
`gemini-2.5-flash-native-audio-preview-12-2025`	$0.0389 / minute	$0.20
`gemini-3.5-live-translate-preview`	$0.0634 / minute	$0.32

Unused minutes are refunded on session end. Kyma marks up Google’s wholesale rate by 1.35× — same markup convention as every other Kyma model. Sessions are settled minute-by-minute; if you end after 4m30s the 5th minute is refunded. The initial collateral hold (≈5 minutes) is placed on session create and released against settled minutes when the session ends.

Limits

Limit	Value	Behavior on breach
Concurrent sessions per user	Tier 0 (free): 8. Paid tiers: uncapped (bounded by balance + per-session hold). Combined across providers.	`429 too_many_sessions` (Tier 0 only)
Max session duration	30 minutes (1800s)	Auto-close, unused minutes refunded
Session token TTL	5 minutes	WS handshake rejected; mint a new session
Heartbeat interval	30 seconds	Required to prove liveness
Heartbeat timeout	90 seconds	Session reaped, unused minutes refunded

Need higher concurrency? See Rate Limits — Need higher limits?.

Errors

POST /v1/live/sessions:

Status	`error.type`	`error.code`	When
`400`	`invalid_request`	—	Body missing / invalid, unsupported language, unsupported voice
`400`	`model_not_found`	—	Wrong model id
`401`	`unauthorized`	—	Missing API key / session token
`402`	`billing_error`	`insufficient_balance`	Balance below the model’s initial hold ( $0.20 native-audio /$ 0.32 translate)
`429`	`rate_limit_error`	`too_many_sessions`	Free-tier (Tier 0) user at the 8-session cap; paid tiers are uncapped
`500`	`internal_error`	—	DB insert or hold RPC failed
`503`	`service_unavailable`	—	Live sessions disabled or upstream token mint failed

WebSocket close codes:

Code	Reason	When
`1000`	normal	Client / server closed cleanly
`1011`	`upstream_auth_failed`	Upstream auth token mint failed at proxy open
`1011`	`upstream_construct_failed`	Upstream WebSocket constructor threw

Reference

POST /v1/live/sessions — mint a session token
POST /v1/live/sessions/{id}/heartbeat — keep session alive (every 30s)
POST /v1/live/sessions/{id}/end — close session and settle billing
wss://api.kymaapi.com/v1/live/proxy/{id}?token=… — WebSocket bidirectional audio

Guides

Use Cases

Integrations

Kyma Tools

How it works

Flow

Quickstart — browser JS

Audio format reference

Configuration

Response shape

Pricing

Limits

Errors

Reference

See also

​How it works

​Flow

​Quickstart — browser JS

​Audio format reference

​Configuration

​Response shape

​Pricing

​Limits

​Errors

​Reference

​See also

How it works

Flow

Quickstart — browser JS

Audio format reference

Configuration

Response shape

Pricing

Limits

Errors

Reference

See also