Skip to content

Model Backends

Model backends adapt provider APIs into one standard Krisis interface. This is what makes it possible to run the same clinical task across OpenAI, Anthropic, Grok, and Gemini while receiving comparable result objects.

Supported Providers

Provider Backend Default model Install extra
OpenAI OpenAIBackend gpt-5.5 krisis[openai]
Anthropic AnthropicBackend claude-opus-4-7 krisis[anthropic]
Grok GrokBackend grok-4.3 krisis[grok]
Google Gemini GeminiBackend gemini-3-pro-preview krisis[gemini]

Install only the provider clients you need:

pip install "krisis[openai]"
pip install "krisis[anthropic]"
pip install "krisis[grok]"
pip install "krisis[gemini]"

Or install all provider extras:

pip install "krisis[all]"

Backend Contract

Every backend returns the same response shape:

BackendResponse(
    prediction=...,
    abstained=...,
    confidence=...,
    raw_response=...,
    prompt=...,
    prompt_mode=...,
    input_tokens=...,
    output_tokens=...,
    total_tokens=...,
)

Field meanings:

Field Meaning
prediction Parsed model prediction, or None when the model abstained
abstained Whether the model declined to answer
confidence Model-reported confidence between 0 and 1
raw_response Raw provider text or JSON for auditability
prompt Prompt/messages with patient data redacted
prompt_mode single for one-row calls, batch for batched calls
input_tokens Provider-reported input token count when available
output_tokens Provider-reported output token count when available
total_tokens Combined token count when available

API Keys

You can pass provider keys directly:

from krisis.backends.openai import OpenAIBackend

backend = OpenAIBackend(api_key="YOUR_OPENAI_KEY")

Provider examples:

from krisis.backends.anthropic import AnthropicBackend
from krisis.backends.gemini import GeminiBackend
from krisis.backends.grok import GrokBackend

anthropic_backend = AnthropicBackend(api_key="YOUR_ANTHROPIC_KEY")
gemini_backend = GeminiBackend(api_key="YOUR_GEMINI_KEY")
grok_backend = GrokBackend(api_key="YOUR_XAI_KEY")

The example runner supports the generic API_KEY environment variable and provider-native environment variables where available.

Do not commit API keys

Keep API keys in environment variables, local secret managers, or CI secrets. Never put provider keys in benchmark scripts committed to GitHub.

Batching And Concurrency

Krisis can evaluate records in provider batches:

from krisis.benchmark import Benchmark

result = Benchmark(
    suite,
    backend,
    batch_size=8,
    max_concurrency=2,
).run()

batch_size and max_concurrency control different things:

Setting Meaning
batch_size How many patient records are sent in one provider call
max_concurrency How many provider calls may run at the same time
max_output_tokens Per-row output-token cap passed to the provider backend

Example: batch_size=8 and max_concurrency=2 can evaluate up to 16 records in flight, split across two API calls.

Start conservative

Use batch_size=8 and max_concurrency=1 or 2 for first runs. Increase only after checking provider rate limits and structured JSON reliability.

Empty provider responses

If a provider returns an empty response or truncated JSON, increase the example runner's --max-output-tokens value. For larger frontier models, also try a smaller --batch-size so each call has less JSON to produce.

OpenAI token cap

The OpenAI backend defaults to max_completion_tokens=1024 per row. This is higher than the visible JSON usually needs because OpenAI reasoning models may consume part of the completion budget before emitting the final JSON.

Structured Output

Backends request structured JSON when the provider supports it. The expected single-row response is:

{
  "abstained": false,
  "confidence": 0.82,
  "prediction": 0
}

Batched responses use:

{
  "results": [
    {
      "id": "case_0",
      "abstained": false,
      "confidence": 0.82,
      "prediction": 0
    }
  ]
}

If a batched response is malformed, Krisis can shrink the batch and retry smaller groups before falling back to single-row evaluation.

Failure Handling

Backend hardening includes:

  • retry with backoff for transient provider failures
  • parsing safeguards for markdown-wrapped or malformed JSON
  • recursive batch shrinking when batched JSON fails
  • single-row fallback for difficult cases
  • actionable empty-response errors when output-token caps are too low
  • raw response preservation for debugging
  • token usage aggregation in benchmark execution metadata

Batching reduces calls, not reasoning time

Large frontier models may still take time to reason over batched clinical cases. Batching usually reduces HTTP overhead and rate-limit pressure, but it does not make the model itself free.

Choosing A Model

For fast smoke tests, use a lightweight provider model and small synthetic counts. For publishable comparisons, use fixed seeds, fixed task settings, and the same batch/concurrency settings across models where provider limits allow.

Always report:

  • provider
  • model ID
  • task
  • feature set
  • batch size
  • max concurrency
  • output-token cap
  • elapsed seconds
  • token total