Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt

Use this file to discover all available pages before exploring further.

Kyma serves two audio models behind two synchronous endpoints, billed per minute of processed audio:
  • Transcription — what was said. OpenAI Whisper API compatible.
  • Audio understanding — everything else. Tone, music, SFX, language, speaker emotion. Custom Kyma endpoint.
Both are sync (no polling), use multipart/form-data upload, and accept mp3 / wav / m4a / ogg / webm / flac up to 25 MB (~30 min of mono 16kHz mp3).

Pick a model

ModelEndpointBest forCost / minMin billable
whisper-v3-turbo/v1/audio/transcriptionsTranscripts, captions, voice agents$0.00091 min
gemini-3-flash-audio/v1/audio/understandTone, music, SFX, mood, language$0.0006481 min
Both bill in 1-minute increments, rounded up. A 1-hour file: 0.054transcribe+0.054 transcribe + 0.039 understand = $0.093 total.

Aliases

Use these in the model field instead of pinning a specific SKU. Aliases auto-track the current best model — when a faster Whisper or a stronger Gemini lands, you don’t change your code.
AliasResolves to
transcribewhisper-v3-turbo
audio-understandgemini-3-flash-audio

whisper-v3-turbo

Speech-to-text. 228x realtime inference. Returns transcripts with per-segment timestamps and detected language. OpenAI Whisper API compatible.
curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@meeting.mp3" \
  -F "model=transcribe"
  • Cost: $0.0009 / min
  • Best for: meeting notes, voice agent input, podcast captions
  • Output: text + segments with timestamps

gemini-3-flash-audio

Audio understanding. Hears tone, music, SFX, language, speaker emotion — the things a transcript drops on the floor. Ask a free-form question, get a natural-language answer.
curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@scene.mp3" \
  -F "model=audio-understand" \
  -F "question=Describe the mood and any background music."
  • Cost: $0.000648 / min
  • Best for: mood/tone analysis, music recognition, scene understanding
  • Output: free-form answer to your question

Use them together — the decomposition trick

A video is just frames + audio. You almost never need a multimodal LLM on the full video — each piece has a fast, near-free tool:
URL ──▶ yt-dlp ──▶ video.mp4 ──┬──▶ ffmpeg ──▶ frames/*.jpg

                                └──▶ ffmpeg ──▶ audio.mp3 ──┬──▶ /v1/audio/transcriptions
                                                            │     (the words)

                                                            └──▶ /v1/audio/understand
                                                                  (everything else)
watch-cli is an open-source orchestrator built on exactly this pattern — install it, point at any social video URL, and your agent gets back frames + transcript + audio scene.
ApproachCost / 1-hour videoTime
Multimodal LLM on full video~$530-60s
Decompose with Kyma audio< $0.10~10-15s
The savings compound when an agent watches dozens of videos in a session.

See also