Audio

Kyma serves two audio models behind two synchronous endpoints, billed per minute of processed audio:

Transcription — what was said. OpenAI Whisper API compatible.
Audio understanding — everything else. Tone, music, SFX, language, speaker emotion. Custom Kyma endpoint.

Both are sync (no polling), use multipart/form-data upload, and accept mp3 / wav / m4a / ogg / webm / flac up to 25 MB (~30 min of mono 16kHz mp3).

Pick a model

Model	Endpoint	Best for	Cost / min	Min billable
`whisper-v3-turbo`	`/v1/audio/transcriptions`	Transcripts, captions, voice agents	$0.0009	1 min
`gemini-3-flash-audio`	`/v1/audio/understand`	Tone, music, SFX, mood, language	$0.000648	1 min

Both bill in 1-minute increments, rounded up. A 1-hour file:

0.054 transcribe +

0.039 understand = $0.093 total.

Aliases

Use these in the model field instead of pinning a specific SKU. Aliases auto-track the current best model — when a faster Whisper or a stronger Gemini lands, you don’t change your code.

Alias	Resolves to
`transcribe`	`whisper-v3-turbo`
`audio-understand`	`gemini-3-flash-audio`

whisper-v3-turbo

Speech-to-text. 228x realtime inference. Returns transcripts with per-segment timestamps and detected language. OpenAI Whisper API compatible.

curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@meeting.mp3" \
  -F "model=transcribe"

Cost: $0.0009 / min
Best for: meeting notes, voice agent input, podcast captions
Output: text + segments with timestamps

gemini-3-flash-audio

Audio understanding. Hears tone, music, SFX, language, speaker emotion — the things a transcript drops on the floor. Ask a free-form question, get a natural-language answer.

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@scene.mp3" \
  -F "model=audio-understand" \
  -F "question=Describe the mood and any background music."

Cost: $0.000648 / min
Best for: mood/tone analysis, music recognition, scene understanding
Output: free-form answer to your question

Use them together — the decomposition trick

A video is just frames + audio. You almost never need a multimodal LLM on the full video — each piece has a fast, near-free tool:

URL ──▶ yt-dlp ──▶ video.mp4 ──┬──▶ ffmpeg ──▶ frames/*.jpg
                                │
                                └──▶ ffmpeg ──▶ audio.mp3 ──┬──▶ /v1/audio/transcriptions
                                                            │     (the words)
                                                            │
                                                            └──▶ /v1/audio/understand
                                                                  (everything else)

watch-cli is an open-source orchestrator built on exactly this pattern — install it, point at any social video URL, and your agent gets back frames + transcript + audio scene.

Approach	Cost / 1-hour video	Time
Multimodal LLM on full video	~$5	30-60s
Decompose with Kyma audio	< $0.10	~10-15s

The savings compound when an agent watches dozens of videos in a session.

Models

Pick a model

Aliases

whisper-v3-turbo

gemini-3-flash-audio

Use them together — the decomposition trick

See also

Models

Documentation Index

​Pick a model

​Aliases

​whisper-v3-turbo

​gemini-3-flash-audio

​Use them together — the decomposition trick

​See also

Pick a model

Aliases

whisper-v3-turbo

gemini-3-flash-audio

Use them together — the decomposition trick

See also