Skip to content

Backend

The backend page defines the reusable interface that provider-specific backends implement.

Provider implementations live in the guide

The API reference focuses on the common backend contract. Provider-specific usage for OpenAI, Anthropic, Grok, and Gemini is documented under Framework Guide -> Model Backends.

Backend Base Classes

base

krisis/backends/base.py

Abstract interface for LLM providers. Benchmark calls evaluate_batch() over PatientRecord chunks; backends own prompting, inference, and raw text capture.

BackendResponse dataclass

Structured output from one evaluated row.

BaseBackend

Bases: ABC

Provider-agnostic contract for clinical benchmark inference.

name abstractmethod property

name: str

Short identifier for logging and BenchmarkResult (e.g. 'openai').

evaluate abstractmethod

evaluate(
    record: PatientRecord, task: Task
) -> BackendResponse

Run the model on one patient row.

Implementations should preserve the full model text in raw_response for qualitative review; abstained must be True when the model declines to commit to a prediction.

evaluate_batch

evaluate_batch(
    records: list[PatientRecord], task: Task
) -> list[BackendResponse]

Run the model on a batch of patient rows.

Backends can override this for provider-native batched prompts. The default keeps compatibility by looping over evaluate().

Provider Backend Controls

Concrete provider backends follow the same practical pattern even though each provider SDK has slightly different parameter names.

Control OpenAI Anthropic Grok Gemini Purpose
model yes yes yes yes Provider model ID
temperature yes yes yes yes Sampling temperature. 0.0 or None is recommended for evals
token cap max_completion_tokens max_tokens max_tokens max_output_tokens Caps generated output tokens
api_key yes yes yes yes Direct provider key override
client yes yes yes yes Prebuilt client for testing or custom setup
max_retries yes yes yes yes Number of retries after transient failures
retry_base_seconds yes yes yes yes Initial exponential-backoff delay
retry_max_seconds yes yes yes yes Maximum exponential-backoff delay

Default token caps are intentionally conservative for most providers. OpenAI defaults to max_completion_tokens=1024 per row because larger reasoning models can spend part of the completion budget before producing the visible JSON.

Example:

backend = OpenAIBackend(
    model="provider-model-id",
    api_key="YOUR_API_KEY",
    temperature=0.0,
    max_completion_tokens=1024,
    max_retries=2,
    retry_base_seconds=0.5,
    retry_max_seconds=8.0,
)

Provider naming differs

max_completion_tokens is the OpenAI name. Anthropic and Grok use max_tokens; Gemini uses max_output_tokens. The example CLI exposes one provider-agnostic flag, --max-output-tokens, and maps it to the correct backend setting.

Actual provider classes

Provider-specific setup examples live in Framework Guide -> Model Backends. The API Reference keeps the focus on the shared backend shape.

Retry Behavior

Krisis retries transient provider failures, including common timeout, connection, rate-limit, overloaded, and 5xx-style errors.

The retry controls are:

Parameter Default Meaning
max_retries 2 Number of retries after the first failed attempt
retry_base_seconds 0.5 Initial backoff delay
retry_max_seconds 8.0 Maximum backoff delay

max_retries=2 means up to three total attempts:

  1. first attempt
  2. first retry
  3. second retry

Retry delays use exponential backoff:

delay = min(retry_max_seconds, retry_base_seconds * (2 ** attempt))

Small jitter is added to reduce synchronized retry spikes.

Batched Evaluation

Backends expose two methods:

Method Meaning
evaluate(record, task) Evaluate one patient row
evaluate_batch(records, task) Evaluate a batch of patient rows

The base implementation of evaluate_batch loops over evaluate. Provider backends can override it to send one prompt containing multiple patient rows and return one response array.

Benchmark is responsible for deciding batch size and concurrency. Backends are responsible for turning one batch into BackendResponse objects.

Token Usage

BackendResponse includes prompt and usage audit fields:

  • prompt
  • prompt_mode
  • input_tokens
  • output_tokens
  • total_tokens

When providers expose usage metadata, Krisis records it per row and aggregates it into BenchmarkResult.extras.token_total. A redacted prompt template is preserved per row in full JSON so provider behavior can be reviewed alongside the instructions/output shape the model received.

Shared Usage Helpers

usage

krisis/backends/usage.py

Token usage helpers for provider responses.

TokenUsage dataclass

Input/output token counts from one provider response.

usage_from_openai_compatible_response

usage_from_openai_compatible_response(
    response: Any,
) -> TokenUsage

Extract token usage from OpenAI-compatible response objects.

usage_from_anthropic_response

usage_from_anthropic_response(response: Any) -> TokenUsage

Extract token usage from Anthropic Messages API response objects.

usage_from_gemini_response

usage_from_gemini_response(response: Any) -> TokenUsage

Extract token usage from Google Gemini response objects.

Retry Helpers

retry

krisis/backends/retry.py

Small retry helper for transient provider API failures.

is_retryable_exception

is_retryable_exception(exc: BaseException) -> bool

Return True for common transient OpenAI/Anthropic SDK failures.

call_with_retries

call_with_retries(
    operation: Callable[[], T],
    *,
    max_retries: int,
    base_delay_seconds: float,
    max_delay_seconds: float,
) -> T

Run operation with exponential backoff for transient provider errors.

max_retries=2 means up to three total attempts.