Skip to main content
POST
/
v1
/
audio
/
transcriptions
Audio Transcriptions
curl --request POST \
  --url https://kymaapi.com/v1/audio/transcriptions \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '
{
  "model": "<string>",
  "language": "<string>",
  "response_format": "<string>",
  "temperature": 123,
  "prompt": "<string>"
}
'
{
  "text": "<string>",
  "language": "<string>",
  "duration": 123,
  "segments": [
    {}
  ],
  "model": "<string>",
  "billing.billable_minutes": 123,
  "billing.cost_usd": 123,
  "billing.balance_usd": 123
}

Documentation Index

Fetch the complete documentation index at: https://docs.kymaapi.com/llms.txt

Use this file to discover all available pages before exploring further.

Synchronous endpoint. Upload an audio file, get the transcript back in one call. Compatible with the OpenAI Whisper API — drop-in replacement for https://api.openai.com/v1/audio/transcriptions.
curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@meeting.mp3" \
  -F "model=transcribe"

Request

multipart/form-data upload.
file
file
required
Audio file. Supports mp3, wav, m4a, ogg, webm, flac. Max 25 MB. ~30 minutes of mono 16kHz mp3 fits comfortably.
model
string
default:"transcribe"
Either the alias transcribe (recommended — auto-tracks the current best ASR model) or a pinned SKU like whisper-v3-turbo. See Audio models.
language
string
ISO-639-1 code (e.g. en, vi, ja). Optional — Whisper auto-detects when omitted. Supplying it improves accuracy on short clips.
response_format
string
default:"verbose_json"
One of: json, verbose_json, text, srt, vtt. JSON formats embed a billing block in the response body. text returns the bare transcript and srt / vtt return subtitle files; for those three, billing rides on X-Kyma-* response headers so the body stays a clean transcript or subtitle file.
temperature
number
default:"0"
Sampling temperature 0–1. Default 0 (deterministic).
prompt
string
Optional priming text. Use it to nudge the model toward known proper nouns, acronyms, or domain vocabulary in your audio.

Response

200 OK with the transcript and a Kyma billing block.
{
  "task": "transcribe",
  "language": "English",
  "duration": 5.03,
  "text": "For too long, I have watched mortals suffer.",
  "segments": [
    {
      "id": 0,
      "start": 0,
      "end": 4.74,
      "text": "For too long, I have watched mortals suffer.",
      "tokens": [50365, 1171, 886, 938, 11, 286, 362, 6337, 6599, 1124, 9753, 13, 50602],
      "temperature": 0,
      "avg_logprob": -0.20,
      "compression_ratio": 0.85,
      "no_speech_prob": 0.0
    }
  ],
  "model": "whisper-v3-turbo",
  "billing": {
    "duration_sec": 5.03,
    "billable_minutes": 1,
    "cost_usd": 0.0009,
    "balance_usd": 41.469
  }
}
text
string
The full transcript.
language
string
Detected language (full name, e.g. "English").
duration
number
Audio duration in seconds (decoded from the file, not estimated).
segments
array
Per-segment timestamps and text. Only present when response_format is verbose_json.
model
string
The Kyma model SKU that served the request.
billing.billable_minutes
number
Minutes charged. Audio is billed in 1-minute increments, rounded up.
billing.cost_usd
number
Final cost charged for this request.
billing.balance_usd
number
Remaining balance after this charge.

Non-JSON formats

When response_format is text, srt, or vtt, the body is a plain transcript or subtitle file (no JSON envelope) and billing comes back on response headers:
HeaderMeaning
X-Kyma-ModelThe model SKU that served the request
X-Kyma-Duration-SecDetected audio duration in seconds
X-Kyma-Billable-MinutesMinutes charged
X-Kyma-Cost-USDFinal cost in USD
X-Kyma-Balance-USDRemaining account balance
srt returns a SubRip subtitle file (application/x-subrip; charset=utf-8); vtt returns a WebVTT file (text/vtt; charset=utf-8). Both are built from the same per-segment timestamps verbose_json exposes, so the timing matches across formats.

Pricing

ModelPer minute
whisper-v3-turbo$0.0009
Billed per minute, rounded up (a 5-second clip costs 1 minute = 0.0009).1hourfile:0.0009). 1-hour file: 0.054.

Errors

Statuserror.codeWhen
400invalid_requestMissing file field, or not multipart/form-data
400not_a_transcription_modelmodel is not a transcription SKU
401auth_errorMissing or invalid API key
402billing_errorInsufficient credits
404not_enabledAudio gate not enabled on this account
413invalid_requestFile > 25 MB
502provider_errorUpstream transcription failed

Examples

Pin a specific model

curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@interview.mp3" \
  -F "model=whisper-v3-turbo" \
  -F "response_format=verbose_json" \
  -F "language=en"

Just the transcript text

curl -X POST https://kymaapi.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $KYMA_API_KEY" \
  -F "file=@clip.mp3" \
  -F "model=transcribe" \
  -F "response_format=text"
Returns the bare transcript without segments or metadata.

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="kyma-...",
)

with open("meeting.mp3", "rb") as f:
    result = client.audio.transcriptions.create(
        model="transcribe",
        file=f,
    )

print(result.text)

See also

  • Audio Understand - the rest of the audio scene (tone, music, mood)
  • Audio models - SKUs behind the transcribe alias
  • watch-cli - open-source CLI that uses these endpoints to give any agent eyes and ears for any social video