Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt

Use this file to discover all available pages before exploring further.

Overview

gemini-3-flash-audio is Google’s Gemini 3 Flash Preview tuned for audio understanding, not transcription. It listens to a clip and answers free-form questions about how it sounds — speaker emotion, music style, ambient SFX, language, mood, pacing. It’s the part of audio that a transcript loses. Pair it with whisper-v3-turbo and you have full audio context for any clip.

Specs

FieldValue
Model IDgemini-3-flash-audio
CreatorGoogle
Best forAudio scene Q&A, mood/emotion, music recognition
Max file size25 MB
Max duration~30 min inline payload
Input modalitiesAudio
Output modalitiesText
Pricing modePer minute
Min billable1 minute (rounded up)
Release stagePreview

Pricing

Cost
Per minute$0.000648
1-hour file$0.039
30-second clip$0.000648 (rounds up to 1 min)

Use this when

  • The question is about how something sounds, not what was said: mood, emotion, music style, ambient SFX.
  • You need to know what language is being spoken (and want a reasoned answer, not just a code).
  • You’re triaging audio for a video pipeline and want a one-sentence scene description per clip.

Pick something else when

  • You only need the words: use whisper-v3-turbo — it’s cheaper and faster.
  • You need real-time audio (sub-100ms latency) — Gemini 3 Flash audio is fast but not real-time.

Example

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@scene.mp3" \
  -F "model=gemini-3-flash-audio" \
  -F "question=Describe the mood, dominant instruments, tempo, and any background sounds."
The endpoint is sync — the response includes the model’s answer plus a billing block. See endpoint reference for all parameters.

Tips

  • Be specific in the question. “What’s the mood?” works, but “Is the speaker frustrated, neutral, or pleased? One word, then a one-sentence justification” gets a more useful answer.
  • Pass duration_sec when you have it (ffprobe gives you it in one line). Saves over-charging on high-bitrate inputs where size-based estimation runs long.
  • Audio scene first, transcript second. A 1-sentence audio summary often gives an agent enough context to decide whether the transcript is even worth fetching.

Aliases that resolve here

  • audio-understand — the canonical alias for “ask a question about audio”. Resolves to this SKU today.
If you want stable behavior across alias changes, pin gemini-3-flash-audio directly.

See also