Framework Guide¶
The framework guide explains how Krisis is organized as a clinical evaluation framework: what data it accepts, how suites turn data into tasks, how models are called, what metrics mean, and how results should be reported.
Current scope
Krisis v0.1 has one implemented clinical domain: Chronic Kidney Disease (CKD). Diabetes and hypertension are listed as planned domains, but they are not implemented yet.
Code examples are CKD-first
Most example snippets use CKDSuite because it is the only suite available
in v0.1. Treat those snippets as examples of the Krisis framework pattern,
not as proof that every clinical domain is already implemented.
What To Read First¶
| Page | Use it when you want to understand |
|---|---|
| Datasets | supported source data, dataset policy, and why Krisis does not bundle clinical CSV files |
| Suites | how raw rows become benchmark-ready PatientRecord objects |
| Metrics | how correctness, abstention, deferral, and calibration are scored |
| Model Backends | how OpenAI, Anthropic, Grok, and Gemini are plugged into the same interface |
| Outputs | text reports, full JSON, metrics-only JSON, and plotting fields |
| Research Status | what Krisis v0.1 does and does not claim |
Framework Shape¶
flowchart TD
A["Dataset"] --> B["Suite"]
B --> C["PatientRecord objects"]
C --> D["Benchmark"]
D --> E["Model backend"]
E --> F["Evaluation results"]
F --> G["Metrics and reports"]
Krisis separates these pieces deliberately:
- datasets define the source and scope of clinical data
- suites define task labels, preprocessing, and benchmark metadata
- backends define how LLM providers are called
- metrics define how answers, abstentions, confidence, and deferrals are scored
- reports define how results are inspected, exported, and compared
That separation is what lets the same CKD task run against multiple frontier LLM providers while producing a standard result object.
Recommended Path¶
For a first full run:
- Read the CKD dataset card.
- Read the CKD suite guide.
- Choose a backend from Model Backends.
- Run a benchmark with conservative batching.
- Use Outputs to export metrics-only JSON for comparison plots.
- Interpret results using the caveats in Research Status.