Core Concepts¶
Krisis is built from a small set of composable contracts. Understanding these contracts makes it much easier to add suites, backends, or metrics.
PatientRecord¶
PatientRecord is the unit of evaluation.
PatientRecord(
features={"sc": 2.4, "hemo": 10.1, "htn": 1},
label=0,
metadata={"ckd_stage": 3, "egfr": 42.1, "should_abstain": False},
)
| Field | Visible to model? | Purpose |
|---|---|---|
features |
Yes | Clinical input shown to the LLM |
label |
No | Ground truth used for scoring |
metadata |
No | Extra scoring/context fields such as stage, eGFR, or deferral labels |
Why metadata is separate
Deferral labels such as should_abstain are intentionally hidden from the
model. Metrics use them later to check whether the model deferred when it
should have.
Suite¶
A suite prepares a clinical benchmark.
For CKD, the suite:
- loads a local UCI CKD CSV
- validates the CSV schema
- imputes missing values
- engineers eGFR, sex, and CKD stage
- splits real held-out records
- optionally generates synthetic records
- returns
PatientRecordrows
The public suite contract is:
records = suite.load()
summary = suite.describe()
Backend¶
A backend adapts a model provider to Krisis.
Each backend receives a PatientRecord and returns:
BackendResponse(
prediction=...,
abstained=...,
confidence=...,
raw_response=...,
prompt=...,
prompt_mode=...,
input_tokens=...,
output_tokens=...,
total_tokens=...,
)
prompt preserves the provider-facing instructions with patient data redacted.
prompt_mode is single for one-row calls and batch for batched calls. This
helps compare model behavior when one provider follows batched JSON instructions
better than another, without duplicating clinical row data inside every result.
This standard shape lets the same benchmark compare OpenAI, Anthropic, Grok, and Gemini.
Benchmark¶
Benchmark is the execution engine.
It handles:
- loading records from a suite
- batching records into provider calls
- running batches concurrently
- retrying transient failures
- shrinking malformed batches when needed
- collecting
EvaluationResultrows - computing metrics
- returning a
BenchmarkResult
Batch fallback
Some models occasionally fail to return valid JSON for a batch. Krisis recursively shrinks the batch and eventually falls back to single-row evaluation, preserving benchmark progress while still using batching when it works.
In practice, this has shown up with gpt-5.5 when the OpenAI
max_completion_tokens cap is too low: the provider can return an empty or
incomplete JSON response before the model finishes. Krisis now uses a higher
OpenAI default.
Metric¶
Metrics operate on EvaluationResult rows. They do not call model APIs and do
not know which provider produced the outputs.
This separation keeps scoring reproducible and backend-agnostic.
Report¶
BenchmarkResult stores all outputs. Report helpers serialize it as:
- compact text
- full JSON
- metrics-only JSON
Use metrics-only JSON when creating model comparison plots.