Data Extraction & Structured Output

Best Model for This

Model	Why	Cost per 1K extractions
`qwen-3-32b`	Fast, accurate JSON, low cost	~$0.40
`deepseek-v3`	Best for complex nested schemas	~$0.75
`llama-3.3-70b`	Good balance of speed + accuracy	~$1.00

Costs assume ~400 tokens input + ~200 tokens output per extraction.

Quick Start

import json
from openai import OpenAI
from pydantic import BaseModel

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="ky-your-api-key"
)

class Invoice(BaseModel):
    vendor: str
    amount: float
    currency: str
    date: str
    line_items: list[str]

SYSTEM = ('Extract invoice data as JSON. Schema: {"vendor": str, "amount": float, '
          '"currency": str, "date": "YYYY-MM-DD", "line_items": [str]}. JSON only.')

def extract_invoice(text: str) -> Invoice:
    resp = client.chat.completions.create(
        model="qwen-3-32b",
        messages=[{"role": "system", "content": SYSTEM}, {"role": "user", "content": text}],
        response_format={"type": "json_object"},
        temperature=0
    )
    return Invoice(**json.loads(resp.choices[0].message.content))

raw = "Invoice from Acme Corp, March 15 2025. 10x Widget A @ $5, 2x Widget B @ $12.50. Total $75 USD."
invoice = extract_invoice(raw)
print(f"Vendor: {invoice.vendor}, Amount: {invoice.currency}{invoice.amount}")

import OpenAI from "openai";
import { z } from "zod";

const client = new OpenAI({
  baseURL: "https://kymaapi.com/v1",
  apiKey: "ky-your-api-key",
});

const InvoiceSchema = z.object({
  vendor: z.string(),
  amount: z.number(),
  currency: z.string(),
  date: z.string(),
  line_items: z.array(z.string()),
});

const SYSTEM = 'Extract invoice data as JSON. Schema: {"vendor": string, "amount": number, ' +
               '"currency": string, "date": "YYYY-MM-DD", "line_items": string[]}. JSON only.';

async function extractInvoice(text) {
  const resp = await client.chat.completions.create({
    model: "qwen-3-32b",
    messages: [{ role: "system", content: SYSTEM }, { role: "user", content: text }],
    response_format: { type: "json_object" },
    temperature: 0,
  });
  return InvoiceSchema.parse(JSON.parse(resp.choices[0].message.content));
}

const raw = "Invoice from Acme Corp, March 15 2025. 10x Widget A @ $5, 2x Widget B @ $12.50. Total $75 USD.";
const invoice = await extractInvoice(raw);
console.log(`Vendor: ${invoice.vendor}, Amount: ${invoice.currency}${invoice.amount}`);

Tips & Best Practices

Always set temperature=0 — extraction is deterministic, not creative. Higher temperatures introduce variation in field names and values.
Always validate output — use Pydantic or Zod. Models occasionally miss optional fields or format dates differently.
Provide examples in the system prompt — one example of input + expected output dramatically improves accuracy on complex schemas.
Use response_format: json_object — guarantees JSON-parseable output, prevents markdown wrapping or prose before the JSON.

Cost Estimate

Volume	Model	Monthly cost
10K extractions/day	`qwen-3-32b`	~$12/month
10K extractions/day	`deepseek-v3`	~$22/month
100K extractions/day	`qwen-3-32b`	~$120/month

Assumes ~400 tokens input + ~200 tokens output per extraction. Token count scales with document length.

Next Steps

Structured Outputs — enforced JSON schema reference
Error Handling — handle malformed JSON and retries
Prompt Caching — cache the system prompt when processing batches

Guides

Use Cases

Integrations

Kyma Tools

Data Extraction & Structured Output

Best Model for This

Quick Start

Tips & Best Practices

Cost Estimate

Next Steps

​Best Model for This

​Quick Start

​Tips & Best Practices

​Cost Estimate

​Next Steps

Best Model for This

Quick Start

Tips & Best Practices

Cost Estimate

Next Steps