Gemini 3 Flash (Audio)

Overview

gemini-3-flash-audio is Google’s Gemini 3 Flash Preview tuned for audio understanding, not transcription. It listens to a clip and answers free-form questions about how it sounds — speaker emotion, music style, ambient SFX, language, mood, pacing. It’s the part of audio that a transcript loses. Pair it with whisper-v3-turbo and you have full audio context for any clip.

Specs

Field	Value
Model ID	`gemini-3-flash-audio`
Creator	Google
Best for	Audio scene Q&A, mood/emotion, music recognition
Max file size	25 MB
Max duration	~30 min inline payload
Input modalities	Audio
Output modalities	Text
Pricing mode	Per minute
Min billable	1 minute (rounded up)
Release stage	Preview

Pricing

	Cost
Per minute	$0.000648
1-hour file	$0.039
30-second clip	$0.000648 (rounds up to 1 min)

Use this when

The question is about how something sounds, not what was said: mood, emotion, music style, ambient SFX.
You need to know what language is being spoken (and want a reasoned answer, not just a code).
You’re triaging audio for a video pipeline and want a one-sentence scene description per clip.

Pick something else when

You only need the words: use whisper-v3-turbo — it’s cheaper and faster.
You need real-time audio (sub-100ms latency) — Gemini 3 Flash audio is fast but not real-time.

Example

curl -X POST https://kymaapi.com/v1/audio/understand \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@scene.mp3" \
  -F "model=gemini-3-flash-audio" \
  -F "question=Describe the mood, dominant instruments, tempo, and any background sounds."

The endpoint is sync — the response includes the model’s answer plus a billing block. See endpoint reference for all parameters.

Tips

Be specific in the question. “What’s the mood?” works, but “Is the speaker frustrated, neutral, or pleased? One word, then a one-sentence justification” gets a more useful answer.
Pass duration_sec when you have it (ffprobe gives you it in one line). Saves over-charging on high-bitrate inputs where size-based estimation runs long.
Audio scene first, transcript second. A 1-sentence audio summary often gives an agent enough context to decide whether the transcript is even worth fetching.

Aliases that resolve here

audio-understand — the canonical alias for “ask a question about audio”. Resolves to this SKU today.

If you want stable behavior across alias changes, pin gemini-3-flash-audio directly.

Models

Gemini 3 Flash (Audio)

Overview

Specs

Pricing

Use this when

Pick something else when

Example

Tips

Aliases that resolve here

See also

​Overview

​Specs

​Pricing

​Use this when

​Pick something else when

​Example

​Tips

​Aliases that resolve here

​See also

Overview

Specs

Pricing

Use this when

Pick something else when

Example

Tips

Aliases that resolve here

See also