Skip to main content
POST
/
v1
/
audio
/
understand
Audio Understand
curl --request POST \
  --url https://kymaapi.com/v1/audio/understand \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "question": "<string>",
  "model": "<string>",
  "duration_sec": 123
}
'
{
  "answer": "<string>",
  "model": "<string>",
  "billing.duration_source": "<string>",
  "billing.cost_usd": 123
}

Documentation Index

Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt

Use this file to discover all available pages before exploring further.

Synchronous endpoint for audio understanding beyond transcription. Upload a clip, ask a question — get back a natural-language answer that captures the parts a transcript loses: tone, mood, music style, ambient SFX, speaker emotion, language.
curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@clip.mp3" \
  -F "model=audio-understand" \
  -F "question=Describe the mood, music, and any background sounds."

When to use this vs transcribe

QuestionEndpoint
What words are spoken?/v1/audio/transcriptions
Is the speaker angry, calm, tired?this endpoint
What kind of music is playing?this endpoint
Are there sirens, applause, traffic in the background?this endpoint
What language is being spoken?either
A good rule of thumb: if the answer can be reconstructed from transcript text, use transcribe. If it depends on how something sounds, use audio-understand.

Request

multipart/form-data upload.
file
file
required
Audio file. Supports mp3, wav, m4a, ogg, webm, flac. Max 25 MB inline (~30 minutes of mono 16kHz mp3).
question
string
required
Free-form text question about the audio. Be specific about what you want — “What is the mood?” is fine, but “Describe the music style, BPM, and any background SFX” gets more useful answers.
model
string
default:"audio-understand"
Either the alias audio-understand (recommended) or a pinned SKU like gemini-3-flash-audio. See Audio models.
duration_sec
number
Optional duration hint in seconds. When supplied, billing rounds up from this exact value. When omitted, Kyma estimates duration from file size (assumes 32 kbps mp3) which can over-estimate for high-bitrate inputs. Pass the real duration when you have it — that’s what ffprobe gives you in one line.

Response

200 OK with the answer text and a Kyma billing block.
{
  "answer": "The speaker is reciting a line in a deep, slow voice with theatrical gravitas. There is a low ambient drone underneath but no music. The mood is somber and resolute — a deity's vow rather than a casual remark.",
  "model": "gemini-3-flash-audio",
  "billing": {
    "duration_sec": 5,
    "billable_minutes": 1,
    "cost_usd": 0.000648,
    "balance_usd": 41.468,
    "duration_source": "caller_hint"
  }
}
answer
string
The model’s answer to your question.
model
string
The Kyma model SKU that served the request.
billing.duration_source
string
Either caller_hint (you passed duration_sec) or size_estimate (Kyma estimated from file bytes).
billing.cost_usd
number
Final cost charged for this request.

Pricing

ModelPer minute
gemini-3-flash-audio$0.000648
Billed per minute, rounded up (a 30-second clip costs 1 minute = 0.000648).1hourfile:0.000648). 1-hour file: 0.039.

Errors

Statuserror.codeWhen
400invalid_requestMissing file or question field
400not_an_audio_modelmodel is not an audio-understanding SKU
401auth_errorMissing or invalid API key
402billing_errorInsufficient credits
404not_enabledAudio gate not enabled on this account
413invalid_requestFile > 25 MB
502provider_errorUpstream call failed

Examples

Mood and music brief for a video clip

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@scene.mp3" \
  -F "model=audio-understand" \
  -F "question=Describe the mood, dominant instruments, tempo, and any background sounds."

Speaker emotion check

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@call-recording.mp3" \
  -F "model=audio-understand" \
  -F "question=Is the speaker frustrated, neutral, or pleased? One word, then a one-sentence justification."

Pass exact duration for accurate billing

DURATION=$(ffprobe -v error -show_entries format=duration -of default=nw=1:nk=1 clip.mp3 | awk '{ printf "%d", ($1 + 1) }')

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@clip.mp3" \
  -F "model=audio-understand" \
  -F "question=What style of music is this?" \
  -F "duration_sec=$DURATION"

See also

  • Audio Transcriptions - speech-to-text
  • Audio models - SKUs behind the audio-understand alias
  • watch-cli - open-source CLI that pairs this endpoint with transcribe to give agents full audio understanding from any social video URL