Reports¶

May 19, 2026
in Reports
20 min read

CKDSuite Benchmark

Last updated: June 3, 2026 - see update log.

This report evaluates seven OpenRouter-routed LLM backends on three Krisis CKDSuite tasks: CKD detection, CKD staging, and synthetic CKD progression. Detection asks whether CKD is present. Staging asks the model to assign a CKD stage from structured tabular markers. Progression asks the model to classify a synthetic two-visit trajectory as worsening, improving, or stable. All tasks allow abstention on cases marked as ambiguous or unsafe to answer. All three tasks use the same 160-row evaluation setup: 80 held-out UCI CKD records and 80 synthetic stress-test records generated from the training split.