Documentation Index
Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt
Use this file to discover all available pages before exploring further.
Synchronous endpoint for audio understanding beyond transcription. Upload a clip, ask a question — get back a natural-language answer that captures the parts a transcript loses: tone, mood, music style, ambient SFX, speaker emotion, language.
curl -X POST https://kymaapi.com/v1/audio/understand \
-H "Authorization: Bearer $KYMA_API_KEY" \
-F "file=@clip.mp3" \
-F "model=audio-understand" \
-F "question=Describe the mood, music, and any background sounds."
When to use this vs transcribe
| Question | Endpoint |
|---|
| What words are spoken? | /v1/audio/transcriptions |
| Is the speaker angry, calm, tired? | this endpoint |
| What kind of music is playing? | this endpoint |
| Are there sirens, applause, traffic in the background? | this endpoint |
| What language is being spoken? | either |
A good rule of thumb: if the answer can be reconstructed from transcript text, use transcribe. If it depends on how something sounds, use audio-understand.
Request
multipart/form-data upload.
Audio file. Supports mp3, wav, m4a, ogg, webm, flac. Max 25 MB inline (~30 minutes of mono 16kHz mp3).
Free-form text question about the audio. Be specific about what you want — “What is the mood?” is fine, but “Describe the music style, BPM, and any background SFX” gets more useful answers.
model
string
default:"audio-understand"
Either the alias audio-understand (recommended) or a pinned SKU like gemini-3-flash-audio. See Audio models.
Optional duration hint in seconds. When supplied, billing rounds up from this exact value. When omitted, Kyma estimates duration from file size (assumes 32 kbps mp3) which can over-estimate for high-bitrate inputs. Pass the real duration when you have it — that’s what ffprobe gives you in one line.
Response
200 OK with the answer text and a Kyma billing block.
{
"answer": "The speaker is reciting a line in a deep, slow voice with theatrical gravitas. There is a low ambient drone underneath but no music. The mood is somber and resolute — a deity's vow rather than a casual remark.",
"model": "gemini-3-flash-audio",
"billing": {
"duration_sec": 5,
"billable_minutes": 1,
"cost_usd": 0.000648,
"balance_usd": 41.468,
"duration_source": "caller_hint"
}
}
The model’s answer to your question.
The Kyma model SKU that served the request.
Either caller_hint (you passed duration_sec) or size_estimate (Kyma estimated from file bytes).
Final cost charged for this request.
Pricing
| Model | Per minute |
|---|
gemini-3-flash-audio | $0.000648 |
Billed per minute, rounded up (a 30-second clip costs 1 minute = 0.000648).1−hourfile:0.039.
Errors
| Status | error.code | When |
|---|
400 | invalid_request | Missing file or question field |
400 | not_an_audio_model | model is not an audio-understanding SKU |
401 | auth_error | Missing or invalid API key |
402 | billing_error | Insufficient credits |
404 | not_enabled | Audio gate not enabled on this account |
413 | invalid_request | File > 25 MB |
502 | provider_error | Upstream call failed |
Examples
Mood and music brief for a video clip
curl -X POST https://kymaapi.com/v1/audio/understand \
-H "Authorization: Bearer $KYMA_API_KEY" \
-F "file=@scene.mp3" \
-F "model=audio-understand" \
-F "question=Describe the mood, dominant instruments, tempo, and any background sounds."
Speaker emotion check
curl -X POST https://kymaapi.com/v1/audio/understand \
-H "Authorization: Bearer $KYMA_API_KEY" \
-F "file=@call-recording.mp3" \
-F "model=audio-understand" \
-F "question=Is the speaker frustrated, neutral, or pleased? One word, then a one-sentence justification."
Pass exact duration for accurate billing
DURATION=$(ffprobe -v error -show_entries format=duration -of default=nw=1:nk=1 clip.mp3 | awk '{ printf "%d", ($1 + 1) }')
curl -X POST https://kymaapi.com/v1/audio/understand \
-H "Authorization: Bearer $KYMA_API_KEY" \
-F "file=@clip.mp3" \
-F "model=audio-understand" \
-F "question=What style of music is this?" \
-F "duration_sec=$DURATION"
See also
- Audio Transcriptions - speech-to-text
- Audio models - SKUs behind the
audio-understand alias
watch-cli - open-source CLI that pairs this endpoint with transcribe to give agents full audio understanding from any social video URL