Datasets¶

Datasets define the evidence boundary of a Krisis benchmark. They determine what clinical variables are available, which labels are trustworthy, which task claims are valid, and where synthetic stress tests begin.

Dataset cards are part of the benchmark

LLM evaluation on clinical tasks is only meaningful when the dataset boundary is explicit. Krisis documents source, schema, task support, and limitations for every suite.

Available Datasets¶

Dataset	Status	Used by	Supported tasks
CKD Dataset	Available in v0.1	`CKDSuite`	detection, staging, synthetic progression
Diabetes	Coming soon	Not available yet	Not available yet
Hypertension	Coming soon	Not available yet	Not available yet

Dataset Policy¶

Krisis does not bundle clinical datasets in the Python package.

This is intentional:

clinical datasets can have distribution and licensing constraints
benchmark users should know exactly which source file was used
local CSV paths make experiments reproducible without hiding the data source
package installs stay lightweight

For CKD, users download the UCI dataset separately and pass a local path:

from krisis.data.ckd.suite import CKDSuite

suite = CKDSuite(data_path="datasets/ckd/ckd_full.csv")

What A Dataset Card Should Answer¶

Each Krisis dataset page should make these points explicit:

Question	Why it matters
Where does the data come from?	Establishes provenance and reproducibility
What schema is expected?	Prevents silent misuse of incompatible clinical tables
Which tasks are supported?	Avoids claiming more than the source data can support
Which fields are engineered?	Separates source observations from benchmark-derived fields
Which labels are real vs synthetic?	Keeps detection, staging, and progression claims honest
What are the limitations?	Prevents clinical overclaiming

Real Data And Synthetic Stress Cases¶

Krisis can use real rows and synthetic stress cases in the same evaluation set. For CKD v0.1:

the real rows come from the UCI CKD dataset
n_synthetic controls completely synthetic patient rows generated from the training split
the staging task is derived from engineered eGFR
the progression task is synthetic because the source dataset is cross-sectional
should_abstain labels are benchmark metadata, not patient-ground-truth labels

Synthetic does not mean clinical validation

Synthetic rows and synthetic progression labels are useful for stress testing LLM behavior, especially abstention and deferral. They should not be described as real longitudinal clinical outcomes.

Synthetic rows are not UCI patients

Synthetic benchmark rows are generated from learned per-stage feature distributions and clinical bounds. They are useful for controlled stress testing, but they are not real patient records and should be reported separately from held-out UCI rows.

Validation Before Evaluation¶

Dataset validation happens before preprocessing. The CKD suite checks that the CSV has the expected shape, columns, labels, and value conventions before it creates benchmark records.

Strong validation helps catch:

missing required columns
unexpected columns from a different dataset
duplicated column names
non-numeric values in numeric fields
unsupported categorical values
missing or duplicated IDs
target labels that do not include both expected classes

Adding Future Datasets¶

New datasets should be added through new suites, not forced into the CKD suite. That keeps each dataset's units, labels, and clinical assumptions explicit.

A future dataset integration should include:

a dataset card
a suite-specific validator
preprocessing rules
feature engineering rules
task definitions
limitation notes
tests for supported schema and task behavior