Skip to main content

Best Model for This

ModelWhyCost per query
gemini-2.5-flash1M context — load entire codebases~$0.009
qwen-3-32bFast synthesis for short contexts~$0.004
deepseek-v3Best reasoning over complex documents~$0.007
Costs assume ~2K tokens context + ~300 tokens output per query.

Quick Start

from openai import OpenAI

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="ky-your-api-key"
)

def retrieve_context(query: str) -> list[str]:
    # Your retrieval logic: vector DB, keyword search, etc.
    return [
        "Kyma API exposes active models through /v1/models and one OpenAI-compatible endpoint.",
        "All models use OpenAI-compatible /v1/chat/completions endpoint.",
        "Gemini 2.5 Flash provides 1M context for large-document RAG.",
    ]

def rag_answer(question: str) -> str:
    chunks = retrieve_context(question)
    context = "\n\n".join(f"[{i+1}] {c}" for i, c in enumerate(chunks))

    response = client.chat.completions.create(
        model="gemini-2.5-flash",
        messages=[
            {
                "role": "system",
                "content": "Answer using only the provided context. "
                           "Cite sources with [1], [2] etc. "
                           "If the answer isn't in the context, say so."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

print(rag_answer("What models does Kyma API support?"))

Tips & Best Practices

  • Use gemini-2.5-flash for large documents — its 1M token context window can hold an entire codebase or book. Skip chunking for smaller corpora.
  • Enable prompt caching for repeated context — if the same document is queried multiple times, caching cuts input cost by 90%. See Prompt Caching.
  • Be explicit about citation style — asking the model to cite [1], [2] reduces hallucination and makes answers verifiable.
  • Instruct the model to say “I don’t know” — without this, models will confabulate answers from training data even when context is insufficient.

Cost Estimate

VolumeContext sizeModelMonthly cost
1K queries/day2K tokensqwen-3-32b~$4/month
1K queries/day10K tokensgemini-2.5-flash~$18/month
1K queries/day50K tokensgemini-2.5-flash~$65/month
Large context = most of your cost. Use prompt caching if the same document appears in many queries.

Next Steps

  • Prompt Caching — up to 90% discount on repeated context
  • Streaming — stream RAG answers for faster perceived latency
  • Models — compare context windows across all active models