Core Concepts¶

Krisis is built from a small set of composable contracts. Understanding these contracts makes it much easier to add suites, backends, or metrics.

PatientRecord¶

PatientRecord is the unit of evaluation.

PatientRecord(
    features={"sc": 2.4, "hemo": 10.1, "htn": 1},
    label=0,
    metadata={"ckd_stage": 3, "egfr": 42.1, "should_abstain": False},
)

Field	Visible to model?	Purpose
`features`	Yes	Clinical input shown to the LLM
`label`	No	Ground truth used for scoring
`metadata`	No	Extra scoring/context fields such as stage, eGFR, or deferral labels

Why metadata is separate

Deferral labels such as should_abstain are intentionally hidden from the model. Metrics use them later to check whether the model deferred when it should have.

Suite¶

A suite prepares a clinical benchmark.

For CKD, the suite:

loads a local UCI CKD CSV
validates the CSV schema
imputes missing values
engineers eGFR, sex, and CKD stage
splits real held-out records
optionally generates synthetic records
returns PatientRecord rows

The public suite contract is:

records = suite.load()
summary = suite.describe()

Backend¶

A backend adapts a model provider to Krisis.

Each backend receives a PatientRecord and returns:

BackendResponse(
    prediction=...,
    abstained=...,
    confidence=...,
    raw_response=...,
    prompt=...,
    prompt_mode=...,
    input_tokens=...,
    output_tokens=...,
    total_tokens=...,
)

prompt preserves the provider-facing instructions with patient data redacted. prompt_mode is single for one-row calls and batch for batched calls. This helps compare model behavior when one provider follows batched JSON instructions better than another, without duplicating clinical row data inside every result.

This standard shape lets the same benchmark compare models routed through the same API backend.

Benchmark¶

Benchmark is the execution engine.

It handles:

loading records from a suite
batching records into provider calls
running batches concurrently
retrying transient failures
shrinking malformed batches when needed
collecting EvaluationResult rows
computing metrics
returning a BenchmarkResult

Batch fallback

Some models occasionally fail to return valid JSON for a batch. Krisis recursively shrinks the batch and eventually falls back to single-row evaluation, preserving benchmark progress while still using batching when it works.

In practice, this has shown up with openai/gpt-5.5 when the OpenAI max_completion_tokens cap is too low: the provider can return an empty or incomplete JSON response before the model finishes. Krisis now uses a higher OpenAI default.

Metric¶

Metrics operate on EvaluationResult rows. They do not call model APIs and do not know which provider produced the outputs.

This separation keeps scoring reproducible and backend-agnostic.

Report¶

BenchmarkResult stores all outputs. Report helpers serialize it as:

compact text
full JSON
metrics-only JSON

Use metrics-only JSON when creating model comparison plots.