Krisis¶

Clinical evaluation framework for testing LLM safety behavior in medical reasoning.

Pronounced kree-sis. You can hear an audio pronunciation for krisis.

Krisis evaluates not only whether an LLM is correct, but whether it knows when to abstain, defer, or express uncertainty in high-stakes clinical tasks.

Research software, not clinical software

Krisis is for evaluation and research. It is not a medical device and must not be used to diagnose, treat, or triage patients.

Why Krisis Exists¶

Krisis grew out of Cady AI, an earlier CKD detection chatbot presented at a national AI hackathon. Cady AI used a model trained on the UCI Chronic Kidney Disease dataset to predict CKD/not-CKD, return class probabilities, and attribute which lab results pushed risk upward.

That project exposed the next safety question: as LLMs become more fluent in clinical reasoning, can they recognize cases where they should not confidently answer?

Krisis turns that question into a reusable evaluation framework: a human-in-the-loop type system for checking whether LLMs can defer, abstain, and express uncertainty before their outputs are trusted.

What Krisis Provides¶

Layer	What it does
Suites	Convert clinical datasets into benchmark-ready `PatientRecord` objects
Backends	Adapt OpenAI, Anthropic, Grok, and Gemini into one response shape
Benchmark runner	Runs batched/concurrent evaluations with retries and fallbacks
Metrics	Scores accuracy, calibration, abstention, coverage, and deferral behavior
Reports	Emits text, full JSON, and metrics-only JSON outputs

Current Scope¶

Krisis v0.1 includes one implemented clinical suite: Chronic Kidney Disease (CKD), based on the UCI CKD dataset.

Supported CKD tasks:

detection: CKD vs not CKD
staging: CKD stage classification
progression: synthetic progression stress test

Progression is synthetic

The UCI CKD dataset is cross-sectional, not longitudinal. Krisis creates synthetic two-visit trajectories for progression stress testing. These are useful for evaluating reasoning and deferral behavior, but they are not real longitudinal CKD outcomes.

Typical Workflow¶

flowchart TD
    A["Local UCI CKD CSV"] --> B["CKDSuite"]
    B --> C["PatientRecord rows"]
    C --> D["Benchmark"]
    D --> E["Model Backend"]
    E --> F["EvaluationResult rows"]
    F --> G["Metrics"]
    G --> H["Text / JSON reports"]

Next Steps¶

Start with Getting Started to run a benchmark.
Read Core Concepts to understand the framework design.
Use API Reference when integrating Krisis in code.