Model Backends¶

Model backends adapt provider APIs into one standard Krisis interface. This is what makes it possible to run the same clinical task across OpenAI, Anthropic, Grok, and Gemini while receiving comparable result objects.

Supported Providers¶

Provider	Backend	Default model	Install extra
OpenAI	`OpenAIBackend`	`gpt-5.5`	`krisis[openai]`
Anthropic	`AnthropicBackend`	`claude-opus-4-7`	`krisis[anthropic]`
Grok	`GrokBackend`	`grok-4.3`	`krisis[grok]`
Google Gemini	`GeminiBackend`	`gemini-3-pro-preview`	`krisis[gemini]`

Install only the provider clients you need:

pip install "krisis[openai]"
pip install "krisis[anthropic]"
pip install "krisis[grok]"
pip install "krisis[gemini]"

Or install all provider extras:

pip install "krisis[all]"

Backend Contract¶

Every backend returns the same response shape:

BackendResponse(
    prediction=...,
    abstained=...,
    confidence=...,
    raw_response=...,
    prompt=...,
    prompt_mode=...,
    input_tokens=...,
    output_tokens=...,
    total_tokens=...,
)

Field meanings:

Field	Meaning
`prediction`	Parsed model prediction, or `None` when the model abstained
`abstained`	Whether the model declined to answer
`confidence`	Model-reported confidence between 0 and 1
`raw_response`	Raw provider text or JSON for auditability
`prompt`	Prompt/messages with patient data redacted
`prompt_mode`	`single` for one-row calls, `batch` for batched calls
`input_tokens`	Provider-reported input token count when available
`output_tokens`	Provider-reported output token count when available
`total_tokens`	Combined token count when available

API Keys¶

You can pass provider keys directly:

from krisis.backends.openai import OpenAIBackend

backend = OpenAIBackend(api_key="YOUR_OPENAI_KEY")

Provider examples:

from krisis.backends.anthropic import AnthropicBackend
from krisis.backends.gemini import GeminiBackend
from krisis.backends.grok import GrokBackend

anthropic_backend = AnthropicBackend(api_key="YOUR_ANTHROPIC_KEY")
gemini_backend = GeminiBackend(api_key="YOUR_GEMINI_KEY")
grok_backend = GrokBackend(api_key="YOUR_XAI_KEY")

The example runner supports the generic API_KEY environment variable and provider-native environment variables where available.

Do not commit API keys

Keep API keys in environment variables, local secret managers, or CI secrets. Never put provider keys in benchmark scripts committed to GitHub.

Batching And Concurrency¶

Krisis can evaluate records in provider batches:

from krisis.benchmark import Benchmark

result = Benchmark(
    suite,
    backend,
    batch_size=8,
    max_concurrency=2,
).run()

batch_size and max_concurrency control different things:

Setting	Meaning
`batch_size`	How many patient records are sent in one provider call
`max_concurrency`	How many provider calls may run at the same time
`max_output_tokens`	Per-row output-token cap passed to the provider backend

Example: batch_size=8 and max_concurrency=2 can evaluate up to 16 records in flight, split across two API calls.

Start conservative

Use batch_size=8 and max_concurrency=1 or 2 for first runs. Increase only after checking provider rate limits and structured JSON reliability.

Empty provider responses

If a provider returns an empty response or truncated JSON, increase the example runner's --max-output-tokens value. For larger frontier models, also try a smaller --batch-size so each call has less JSON to produce.

OpenAI token cap

The OpenAI backend defaults to max_completion_tokens=1024 per row. This is higher than the visible JSON usually needs because OpenAI reasoning models may consume part of the completion budget before emitting the final JSON.

Structured Output¶

Backends request structured JSON when the provider supports it. The expected single-row response is:

{
  "abstained": false,
  "confidence": 0.82,
  "prediction": 0
}

Batched responses use:

{
  "results": [
    {
      "id": "case_0",
      "abstained": false,
      "confidence": 0.82,
      "prediction": 0
    }
  ]
}

If a batched response is malformed, Krisis can shrink the batch and retry smaller groups before falling back to single-row evaluation.

Failure Handling¶

Backend hardening includes:

retry with backoff for transient provider failures
parsing safeguards for markdown-wrapped or malformed JSON
recursive batch shrinking when batched JSON fails
single-row fallback for difficult cases
actionable empty-response errors when output-token caps are too low
raw response preservation for debugging
token usage aggregation in benchmark execution metadata

Batching reduces calls, not reasoning time

Large frontier models may still take time to reason over batched clinical cases. Batching usually reduces HTTP overhead and rate-limit pressure, but it does not make the model itself free.

Choosing A Model¶

For fast smoke tests, use a lightweight provider model and small synthetic counts. For publishable comparisons, use fixed seeds, fixed task settings, and the same batch/concurrency settings across models where provider limits allow.

Always report:

provider
model ID
task
feature set
batch size
max concurrency
output-token cap
elapsed seconds
token total