RAG & Search Applications

Best Model for This

Model	Why	Cost per query
`gemini-2.5-flash`	1M context — load entire codebases	~$0.009
`qwen-3-32b`	Fast synthesis for short contexts	~$0.004
`deepseek-v3`	Best reasoning over complex documents	~$0.007

Costs assume ~2K tokens context + ~300 tokens output per query.

Quick Start

from openai import OpenAI

client = OpenAI(
    base_url="https://kymaapi.com/v1",
    api_key="ky-your-api-key"
)

def retrieve_context(query: str) -> list[str]:
    # Your retrieval logic: vector DB, keyword search, etc.
    return [
        "Kyma API exposes active models through /v1/models and one OpenAI-compatible endpoint.",
        "All models use OpenAI-compatible /v1/chat/completions endpoint.",
        "Gemini 2.5 Flash provides 1M context for large-document RAG.",
    ]

def rag_answer(question: str) -> str:
    chunks = retrieve_context(question)
    context = "\n\n".join(f"[{i+1}] {c}" for i, c in enumerate(chunks))

    response = client.chat.completions.create(
        model="gemini-2.5-flash",
        messages=[
            {
                "role": "system",
                "content": "Answer using only the provided context. "
                           "Cite sources with [1], [2] etc. "
                           "If the answer isn't in the context, say so."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

print(rag_answer("What models does Kyma API support?"))

Tips & Best Practices

Use gemini-2.5-flash for large documents — its 1M token context window can hold an entire codebase or book. Skip chunking for smaller corpora.
Enable prompt caching for repeated context — if the same document is queried multiple times, caching cuts input cost by 90%. See Prompt Caching.
Be explicit about citation style — asking the model to cite [1], [2] reduces hallucination and makes answers verifiable.
Instruct the model to say “I don’t know” — without this, models will confabulate answers from training data even when context is insufficient.

Cost Estimate

Volume	Context size	Model	Monthly cost
1K queries/day	2K tokens	`qwen-3-32b`	~$4/month
1K queries/day	10K tokens	`gemini-2.5-flash`	~$18/month
1K queries/day	50K tokens	`gemini-2.5-flash`	~$65/month

Large context = most of your cost. Use prompt caching if the same document appears in many queries.

Next Steps

Prompt Caching — up to 90% discount on repeated context
Streaming — stream RAG answers for faster perceived latency
Models — compare context windows across all active models

Documentation Index

​Best Model for This

​Quick Start

​Tips & Best Practices

​Cost Estimate

​Next Steps

Best Model for This

Quick Start

Tips & Best Practices

Cost Estimate

Next Steps