Documentation Index
Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt
Use this file to discover all available pages before exploring further.
Kyma serves two audio models behind two synchronous endpoints, billed per minute of processed audio:
- Transcription — what was said. OpenAI Whisper API compatible.
- Audio understanding — everything else. Tone, music, SFX, language, speaker emotion. Custom Kyma endpoint.
Both are sync (no polling), use multipart/form-data upload, and accept mp3 / wav / m4a / ogg / webm / flac up to 25 MB (~30 min of mono 16kHz mp3).
Pick a model
| Model | Endpoint | Best for | Cost / min | Min billable |
|---|
whisper-v3-turbo | /v1/audio/transcriptions | Transcripts, captions, voice agents | $0.0009 | 1 min |
gemini-3-flash-audio | /v1/audio/understand | Tone, music, SFX, mood, language | $0.000648 | 1 min |
Both bill in 1-minute increments, rounded up. A 1-hour file: 0.054transcribe+0.039 understand = $0.093 total.
Aliases
Use these in the model field instead of pinning a specific SKU. Aliases auto-track the current best model — when a faster Whisper or a stronger Gemini lands, you don’t change your code.
| Alias | Resolves to |
|---|
transcribe | whisper-v3-turbo |
audio-understand | gemini-3-flash-audio |
whisper-v3-turbo
Speech-to-text. 228x realtime inference. Returns transcripts with per-segment timestamps and detected language. OpenAI Whisper API compatible.
curl -X POST https://kymaapi.com/v1/audio/transcriptions \
-H "Authorization: Bearer $KYMA_API_KEY" \
-F "file=@meeting.mp3" \
-F "model=transcribe"
- Cost: $0.0009 / min
- Best for: meeting notes, voice agent input, podcast captions
- Output: text + segments with timestamps
gemini-3-flash-audio
Audio understanding. Hears tone, music, SFX, language, speaker emotion — the things a transcript drops on the floor. Ask a free-form question, get a natural-language answer.
curl -X POST https://kymaapi.com/v1/audio/understand \
-H "Authorization: Bearer $KYMA_API_KEY" \
-F "file=@scene.mp3" \
-F "model=audio-understand" \
-F "question=Describe the mood and any background music."
- Cost: $0.000648 / min
- Best for: mood/tone analysis, music recognition, scene understanding
- Output: free-form answer to your question
Use them together — the decomposition trick
A video is just frames + audio. You almost never need a multimodal LLM on the full video — each piece has a fast, near-free tool:
URL ──▶ yt-dlp ──▶ video.mp4 ──┬──▶ ffmpeg ──▶ frames/*.jpg
│
└──▶ ffmpeg ──▶ audio.mp3 ──┬──▶ /v1/audio/transcriptions
│ (the words)
│
└──▶ /v1/audio/understand
(everything else)
watch-cli is an open-source orchestrator built on exactly this pattern — install it, point at any social video URL, and your agent gets back frames + transcript + audio scene.
| Approach | Cost / 1-hour video | Time |
|---|
| Multimodal LLM on full video | ~$5 | 30-60s |
| Decompose with Kyma audio | < $0.10 | ~10-15s |
The savings compound when an agent watches dozens of videos in a session.
See also