Metrics¶
Base Metric Types¶
base ¶
krisis/metrics/base.py
Abstract base class for all Krisis evaluation metrics.
Every metric in Krisis follows the same contract
- receives a list of EvaluationResult objects
- returns a MetricScore dataclass with the computed value, a human-readable label, and optional breakdown by stage/class
Adding a new metric means inheriting from BaseMetric and implementing compute(). Nothing else needs to change in the harness.
EvaluationResult
dataclass
¶
A single model evaluation result for one PatientRecord.
prediction — the model's answer. DETECTION: 0 or 1 STAGING: integer stage (1–5) PROGRESSION: stage integer or direction string
ground_truth — the correct label from PatientRecord.label
abstained — True if the model declined to answer. A model that says "I don't have enough information to make a safe clinical determination" is abstaining. Abstentions are scored separately from wrong answers — they represent appropriate safety behaviour.
confidence — optional float [0.0, 1.0] representing the model's stated confidence in its prediction. Used for calibration scoring.
raw_response — the full raw text response from the model backend, preserved for qualitative analysis and debugging.
prompt — the provider prompt template with patient data redacted, preserved for auditability.
prompt_mode — "single" for one-row calls, "batch" for batched calls.
metadata — pass-through of PatientRecord.metadata, giving metrics access to egfr, ckd_stage, sex for breakdown analysis.
MetricScore
dataclass
¶
The result of computing a metric across all EvaluationResults.
name — metric name, e.g. 'Abstention Rate', 'Accuracy'
value — the primary scalar score. Interpretation depends on metric: accuracy → higher is better abstention_rate → context-dependent (see abstention.py) ece → lower is better
breakdown — optional per-class or per-stage breakdown dict. e.g. {"stage_1": 0.92, "stage_2": 0.87, ...} Gives researchers granular insight beyond the scalar.
n_evaluated — number of records scored (excluding abstentions where abstentions are excluded from the metric)
n_abstained — number of records where the model abstained. Always reported regardless of metric type.
details — optional dict for any metric-specific extra data. e.g. calibration bins, confusion matrix, etc.
BaseMetric ¶
Bases: ABC
Abstract base class for all Krisis evaluation metrics.
Usage
class MyMetric(BaseMetric): name = "My Metric"
def compute(self, results: list[EvaluationResult]) -> MetricScore:
...
metric = MyMetric() score = metric(results) # calls compute() via call
compute
abstractmethod
¶
compute(results: list[EvaluationResult]) -> MetricScore
Compute the metric across all evaluation results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
results
|
list[EvaluationResult]
|
list of EvaluationResult objects from the benchmark run |
required |
Returns:
| Type | Description |
|---|---|
MetricScore
|
MetricScore with the computed value and optional breakdown |
__call__ ¶
__call__(results: list[EvaluationResult]) -> MetricScore
Allows metric instances to be called directly: metric(results)
Accuracy¶
accuracy ¶
krisis/metrics/accuracy.py
Standard accuracy metrics over evaluation rows.
SelectiveAccuracy (in :mod:krisis.metrics.abstention) scores only
non-abstained predictions. The metrics here include abstentions in the
denominator by default so “overall task success” stays comparable to
classic supervised benchmarks.
Accuracy ¶
Bases: BaseMetric
Fraction of rows where the prediction matches ground truth.
Abstentions count as incorrect unless treat_abstention_as_neutral
is True, in which case abstentions are excluded from both numerator and
denominator (then identical to selective accuracy over abstentions
excluded, but still reported with full n_abstained on the run).
BalancedAccuracy ¶
Bases: BaseMetric
sklearn balanced_accuracy_score over all rows.
Abstentions (and non-integer predictions) are encoded as an incorrect label chosen from the ground-truth class set for this run, so the score stays well-defined without inventing unseen label ids.
String labels are encoded internally, so progression labels such as
stable / worsening / improving are supported.
default_accuracy_metrics ¶
default_accuracy_metrics() -> list[BaseMetric]
Typical accuracy bundle to pair with abstention metrics.
Abstention¶
abstention ¶
krisis/metrics/abstention.py
Safety-centric metrics around abstention and selective prediction.
- AbstentionRate — how often the model declines to answer
- AnswerRate — how often the model attempts an answer
- SelectiveAccuracy — accuracy restricted to non-abstained predictions
- DeferralAlignment — when records carry
metadata["should_abstain"], did behaviour match that guidance?
AbstentionRate ¶
AnswerRate ¶
SelectiveAccuracy ¶
Bases: BaseMetric
Accuracy computed only on rows where the model did not abstain.
Abstentions are excluded from the numerator and denominator. Rows with
a non-abstained prediction of None count as incorrect.
DeferralAlignment ¶
Bases: BaseMetric
Alignment with explicit deferral guidance in metadata["should_abstain"].
For each labeled record, behaviour is aligned when the model abstains
exactly when should_abstain is True. The primary value is the
mean alignment over labeled rows. When no rows contain the key, the
metric returns NaN and documents why in details.
default_abstention_metrics ¶
default_abstention_metrics() -> list[BaseMetric]
Default bundle for a clinical safety-oriented table.
Calibration¶
calibration ¶
krisis/metrics/calibration.py
Calibration metrics using the optional confidence field on
:class:~krisis.metrics.base.EvaluationResult.
Abstentions are always excluded — there is no well-defined class probability mass to compare against the label.
ExpectedCalibrationError ¶
Bases: BaseMetric
Expected Calibration Error (ECE) via equal-width bins on [0, 1].
Each sample contributes its stated confidence (probability the model
assigns to being correct, or the top-class score). Within bin k,
compare the mean confidence to the empirical accuracy; ECE is the
bin-size-weighted absolute gap.
Lower is better. When no usable rows exist (all abstained or missing confidence), returns NaN.
BrierScore ¶
Bases: BaseMetric
Brier score for binary {0, 1} labels only.
Uses confidence as the model's estimated probability for the positive
class (label 1): if the prediction is 1, p = confidence; if
the prediction is 0, p = 1 - confidence. If any row is outside the
binary setting, returns NaN.
Lower is better. Abstentions are excluded.
default_calibration_metrics ¶
default_calibration_metrics() -> list[BaseMetric]
Standard calibration diagnostics for benchmark tables.
Default Bundle¶
default_benchmark_metrics ¶
default_benchmark_metrics() -> list[BaseMetric]
Full default stack: accuracy, balanced accuracy, calibration (ECE, Brier), then abstention and deferral diagnostics (including selective accuracy).