Audio Understand

Synchronous endpoint for audio understanding beyond transcription. Upload a clip, ask a question — get back a natural-language answer that captures the parts a transcript loses: tone, mood, music style, ambient SFX, speaker emotion, language.

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@clip.mp3" \
  -F "model=audio-understand" \
  -F "question=Describe the mood, music, and any background sounds."

When to use this vs transcribe

Question	Endpoint
What words are spoken?	`/v1/audio/transcriptions`
Is the speaker angry, calm, tired?	this endpoint
What kind of music is playing?	this endpoint
Are there sirens, applause, traffic in the background?	this endpoint
What language is being spoken?	either

A good rule of thumb: if the answer can be reconstructed from transcript text, use transcribe. If it depends on how something sounds, use audio-understand.

Request

multipart/form-data upload.

file

required

Audio file. Supports mp3, wav, m4a, ogg, webm, flac. Max 25 MB inline (~30 minutes of mono 16kHz mp3).

question

string

required

Free-form text question about the audio. Be specific about what you want — “What is the mood?” is fine, but “Describe the music style, BPM, and any background SFX” gets more useful answers.

model

string

default:"audio-understand"

Either the alias audio-understand (recommended) or a pinned SKU like gemini-3-flash-audio. See Audio models.

duration_sec

number

Optional duration hint in seconds. When supplied, billing rounds up from this exact value. When omitted, Kyma estimates duration from file size (assumes 32 kbps mp3) which can over-estimate for high-bitrate inputs. Pass the real duration when you have it — that’s what ffprobe gives you in one line.

Response

200 OK with the answer text and a Kyma billing block.

{
  "answer": "The speaker is reciting a line in a deep, slow voice with theatrical gravitas. There is a low ambient drone underneath but no music. The mood is somber and resolute — a deity's vow rather than a casual remark.",
  "model": "gemini-3-flash-audio",
  "billing": {
    "duration_sec": 5,
    "billable_minutes": 1,
    "cost_usd": 0.000648,
    "balance_usd": 41.468,
    "duration_source": "caller_hint"
  }
}

answer

string

The model’s answer to your question.

model

string

The Kyma model SKU that served the request.

billing.duration_source

string

Either caller_hint (you passed duration_sec) or size_estimate (Kyma estimated from file bytes).

billing.cost_usd

number

Final cost charged for this request.

Pricing

Model	Per minute
`gemini-3-flash-audio`	$0.000648

Billed per minute, rounded up (a 30-second clip costs 1 minute =

0.000648). 1-hour file:

0.039.

Errors

Status	`error.code`	When
`400`	`invalid_request`	Missing `file` or `question` field
`400`	`not_an_audio_model`	`model` is not an audio-understanding SKU
`401`	`auth_error`	Missing or invalid API key
`402`	`billing_error`	Insufficient credits
`404`	`not_enabled`	Audio gate not enabled on this account
`413`	`invalid_request`	File > 25 MB
`502`	`provider_error`	Upstream call failed

Examples

Mood and music brief for a video clip

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@scene.mp3" \
  -F "model=audio-understand" \
  -F "question=Describe the mood, dominant instruments, tempo, and any background sounds."

Speaker emotion check

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@call-recording.mp3" \
  -F "model=audio-understand" \
  -F "question=Is the speaker frustrated, neutral, or pleased? One word, then a one-sentence justification."

Pass exact duration for accurate billing

DURATION=$(ffprobe -v error -show_entries format=duration -of default=nw=1:nk=1 clip.mp3 | awk '{ printf "%d", ($1 + 1) }')

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@clip.mp3" \
  -F "model=audio-understand" \
  -F "question=What style of music is this?" \
  -F "duration_sec=$DURATION"

Endpoints

When to use this vs transcribe

Request

Response

Pricing

Errors

Examples

Mood and music brief for a video clip

Speaker emotion check

Pass exact duration for accurate billing

See also

Endpoints

Documentation Index

​When to use this vs transcribe

​Request

​Response

​Pricing

​Errors

​Examples

​Mood and music brief for a video clip

​Speaker emotion check

​Pass exact duration for accurate billing

​See also

When to use this vs transcribe

Request

Response

Pricing

Errors

Examples

Mood and music brief for a video clip

Speaker emotion check

Pass exact duration for accurate billing

See also