Framework Guide¶

The framework guide explains how Krisis is organized as a clinical evaluation framework: what data it accepts, how suites turn data into tasks, how models are called, what metrics mean, and how results should be reported.

Current scope

Krisis v0.2 has one implemented clinical domain: Chronic Kidney Disease (CKD). Diabetes and hypertension are listed as planned domains, but they are not implemented yet.

Code examples are CKD-first

Most example snippets use CKDSuite because it is the only suite available in v0.2. Treat those snippets as examples of the Krisis framework pattern, not as proof that every clinical domain is already implemented.

What To Read First¶

Page	Use it when you want to understand
Datasets	supported source data, dataset policy, and why Krisis does not bundle clinical CSV files
Suites	how raw rows become benchmark-ready `PatientRecord` objects
Metrics	how correctness, abstention, deferral, and calibration are scored
Model Backends	how OpenRouter routes OpenAI, Anthropic, Grok, Gemini, and other models through one interface
Outputs	text reports, full JSON, metrics-only JSON, and plotting fields
Research Status	what Krisis v0.2 does and does not claim

Framework Shape¶

flowchart TD
    A["Dataset"] --> B["Suite"]
    B --> C["PatientRecord objects"]
    C --> D["Benchmark"]
    D --> E["Model backend"]
    E --> F["Evaluation results"]
    F --> G["Metrics and reports"]

Krisis separates these pieces deliberately:

datasets define the source and scope of clinical data
suites define task labels, preprocessing, and benchmark metadata
backends define how LLM providers are called
metrics define how answers, abstentions, confidence, and deferrals are scored
reports define how results are inspected, exported, and compared

That separation is what lets the same CKD task run against multiple frontier LLM providers while producing a standard result object.

Recommended Path¶

For a first full run:

Read the CKD dataset card.
Read the CKD suite guide.
Choose a backend from Model Backends.
Run a benchmark with conservative batching.
Use Outputs to export metrics-only JSON for comparison plots.
Interpret results using the caveats in Research Status.