Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt

Use this file to discover all available pages before exploring further.

Kyma’s realtime audio endpoint streams live conversation audio between your client and Google’s Gemini Live native-audio model. Use it for voice agents, live translation, interactive tutors, and any product where the user speaks and the model speaks back in real time.
# 1. Mint a session
curl -X POST https://api.kymaapi.com/v1/live/sessions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemini-2.5-flash-native-audio-preview-12-2025",
    "config": {
      "speech_config": {
        "language_code": "vi",
        "voice_config": { "prebuilt_voice_config": { "voice_name": "Kore" } }
      }
    }
  }'
# → { "session_token": "...", "ws_url": "wss://api.kymaapi.com/v1/live/proxy/<id>?token=...", ... }

How it works

The realtime endpoint is a two-step flow because of how browser security and Google’s Vertex AI auth model interact:
  1. POST mints a short-lived session token (5 min TTL) and returns a ws_url.
  2. WebSocket opens to wss://api.kymaapi.com/v1/live/proxy/{id}?token=… — Kyma terminates this connection and proxies it to Vertex Live API using a server-side service-account credential.
Why a server-side proxy: Vertex Live API does not yet expose an ephemeral-token endpoint (js-genai#766, unscheduled). Handing a browser a long-lived Vertex OAuth token would grant project-wide Vertex AI access for an hour — unsafe. The proxy is the architecture Google itself recommends for browser clients (gemini-live-api-examples). Same model, same voices, same languages as the previous AI Studio integration — but the Vertex region unlocks 5000 concurrent sessions per project (vs the 50 cap that AI Studio enforced).

Flow

The client must wait for the BidiGenerateContentSetupComplete frame before sending audio. Frames sent before setup completes will be dropped silently.

Quickstart — browser JS

1

Mint the session

const resp = await fetch("https://api.kymaapi.com/v1/live/sessions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${KYMA_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    model: "gemini-2.5-flash-native-audio-preview-12-2025",
    config: {
      speech_config: {
        language_code: "vi",
        voice_config: { prebuilt_voice_config: { voice_name: "Kore" } },
      },
    },
  }),
});
const session = await resp.json();
// session.ws_url, session.session_token, session.heartbeat_url, session.end_url, ...
2

Open the WebSocket

const ws = new WebSocket(session.ws_url);
ws.binaryType = "arraybuffer";

ws.addEventListener("message", (event) => {
  const frame = JSON.parse(event.data);
  if (frame.setupComplete) {
    // Ready — safe to start sending audio.
    startCaptureLoop();
    return;
  }
  if (frame.serverContent?.modelTurn?.parts) {
    // PCM16 24kHz audio chunk arrived — see Step 4.
    handleAudioChunk(frame.serverContent.modelTurn.parts);
  }
  if (frame.serverContent?.interrupted) {
    // User started speaking over the model — stop any queued playback.
    stopPlayback();
  }
});
3

Capture mic with AudioWorklet

MediaRecorder outputs Opus/WebM container, not raw PCM. Vertex Live requires PCM16 16kHz mono. Use AudioWorklet to tap raw Float32 samples from getUserMedia, downsample 48kHz → 16kHz, and quantize to Int16. See google-gemini/gemini-live-api-examples for a complete browser implementation (Apache-2.0).
Pseudo-code outline:
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext({ sampleRate: 48000 });
await audioContext.audioWorklet.addModule("/pcm-recorder-worklet.js");
const source = audioContext.createMediaStreamSource(stream);
const worklet = new AudioWorkletNode(audioContext, "pcm-recorder");

worklet.port.onmessage = (event) => {
  // event.data is Int16Array at 16kHz from the worklet.
  const base64 = btoa(String.fromCharCode(...new Uint8Array(event.data.buffer)));
  ws.send(JSON.stringify({
    realtimeInput: {
      mediaChunks: [{ mimeType: "audio/pcm;rate=16000", data: base64 }],
    },
  }));
};

source.connect(worklet);
4

Play audio response

The model returns PCM16 24kHz audio chunks (base64-wrapped JSON). Decode to a Float32Array, schedule into an AudioBufferSourceNode:
const playbackContext = new AudioContext({ sampleRate: 24000 });
let nextPlayTime = playbackContext.currentTime;

function handleAudioChunk(parts) {
  for (const part of parts) {
    if (!part.inlineData?.data) continue;
    const bytes = Uint8Array.from(atob(part.inlineData.data), c => c.charCodeAt(0));
    const int16 = new Int16Array(bytes.buffer);
    const float32 = Float32Array.from(int16, v => v / 32768);
    const buffer = playbackContext.createBuffer(1, float32.length, 24000);
    buffer.copyToChannel(float32, 0);
    const source = playbackContext.createBufferSource();
    source.buffer = buffer;
    source.connect(playbackContext.destination);
    nextPlayTime = Math.max(nextPlayTime, playbackContext.currentTime);
    source.start(nextPlayTime);
    nextPlayTime += buffer.duration;
  }
}
5

Heartbeat and end

const heartbeat = setInterval(() => {
  fetch(session.heartbeat_url, {
    method: "POST",
    headers: { Authorization: `Bearer ${KYMA_API_KEY}` },
  });
}, session.heartbeat_interval_ms); // 30000

// On disconnect:
async function endSession() {
  clearInterval(heartbeat);
  ws.close();
  await fetch(session.end_url, {
    method: "POST",
    headers: { Authorization: `Bearer ${KYMA_API_KEY}` },
  });
}

Audio format reference

DirectionFormatSample rateChannelsEncoding
Client → serverPCM16 (signed 16-bit little-endian)16,000 Hzmonobase64 string in JSON mediaChunks[].data
Server → clientPCM16 (signed 16-bit little-endian)24,000 Hzmonobase64 string in JSON inlineData.data
Note the asymmetric sample rates. The output playback context must be 24 kHz; using 16 kHz will pitch the model’s voice up by 50%.

Configuration

The mint request body accepts:
model
string
required
Currently only gemini-2.5-flash-native-audio-preview-12-2025 is supported.
config.speech_config.language_code
string
default:"en"
BCP-47 short code. One of:ar, bn, de, en, es, fa, fr, hi, id, it, ja, ko, nl, pl, pt, ru, sv, ta, te, th, tr, ur, vi, zh
config.speech_config.voice_config.prebuilt_voice_config.voice_name
string
default:"Kore"
One of 30 prebuilt voices:Achernar, Achird, Algenib, Algieba, Alnilam, Aoede, Autonoe, Callirrhoe, Charon, Despina, Enceladus, Erinome, Fenrir, Gacrux, Iapetus, Kore, Laomedeia, Leda, Orus, Pulcherrima, Puck, Rasalgethi, Sadachbia, Sadaltager, Schedar, Sulafat, Umbriel, Vindemiatrix, Zephyr, Zubenelgenubi
config.system_instruction
string
Currently dropped on the Vertex proxy path. If you pass a custom instruction it will be ignored and the default conversational persona is used. Threading this through the token store is a planned follow-up. Contact Kyma if you need it before the upstream patch lands.

Response shape

200 OK from POST /v1/live/sessions:
{
  "session_token": "f3c2b1d4e5a6...",
  "ws_url": "wss://api.kymaapi.com/v1/live/proxy/<uuid>?token=<session_token>",
  "expires_at": 1700000000,
  "kyma_session_id": "<uuid>",
  "model": "gemini-2.5-flash-native-audio-preview-12-2025",
  "heartbeat_url": "https://api.kymaapi.com/v1/live/sessions/<uuid>/heartbeat",
  "end_url": "https://api.kymaapi.com/v1/live/sessions/<uuid>/end",
  "heartbeat_interval_ms": 30000
}
The session_token is single-use — once consumed by your WebSocket handshake it is invalidated, so a leaked token cannot be replayed. The token TTL is 5 minutes; if you don’t open the WebSocket within that window the session is reaped and you can mint a fresh one.

Pricing

ItemRate
Active session$0.0389 / minute
Initial collateral hold$0.20 (placed on session create)
Unused minutesRefunded on session end
Kyma marks up Google’s wholesale rate ($0.0288/min worst case) by 1.35× — same markup convention as every other Kyma model. Sessions are settled minute-by-minute; if you end after 4m30s the 5th minute is refunded.

Limits

LimitValueBehavior on breach
Concurrent sessions per user3 (combined across providers)429 too_many_sessions
Max session duration30 minutes (1800s)Auto-close, unused minutes refunded
Session token TTL5 minutesWS handshake rejected; mint a new session
Heartbeat interval30 secondsRequired to prove liveness
Heartbeat timeout90 secondsSession reaped, unused minutes refunded
Need higher concurrency? See Rate Limits — Need higher limits?.

Errors

POST /v1/live/sessions:
Statuserror.typeerror.codeWhen
400invalid_requestBody missing / invalid, unsupported language, unsupported voice
400model_not_foundWrong model id
401unauthorizedMissing API key / session token
402billing_errorinsufficient_balanceBalance < $0.20 hold
429rate_limit_errortoo_many_sessionsUser already has 3 active sessions
500internal_errorDB insert or hold RPC failed
503service_unavailableLive sessions disabled or Vertex token mint failed
WebSocket close codes:
CodeReasonWhen
1000normalClient / server closed cleanly
1011vertex_auth_failedVertex SA token mint failed at proxy open
1011upstream_construct_failedVertex WebSocket constructor threw

Reference

  • POST /v1/live/sessions — mint a session token
  • POST /v1/live/sessions/{id}/heartbeat — keep session alive (every 30s)
  • POST /v1/live/sessions/{id}/end — close session and settle billing
  • wss://api.kymaapi.com/v1/live/proxy/{id}?token=… — WebSocket bidirectional audio

See also