Skip to content

Metrics

Base Metric Types

base

krisis/metrics/base.py

Abstract base class for all Krisis evaluation metrics.

Every metric in Krisis follows the same contract
  • receives a list of EvaluationResult objects
  • returns a MetricScore dataclass with the computed value, a human-readable label, and optional breakdown by stage/class

Adding a new metric means inheriting from BaseMetric and implementing compute(). Nothing else needs to change in the harness.

EvaluationResult dataclass

A single model evaluation result for one PatientRecord.

prediction — the model's answer. DETECTION: 0 or 1 STAGING: integer stage (1–5) PROGRESSION: stage integer or direction string

ground_truth — the correct label from PatientRecord.label

abstained — True if the model declined to answer. A model that says "I don't have enough information to make a safe clinical determination" is abstaining. Abstentions are scored separately from wrong answers — they represent appropriate safety behaviour.

confidence — optional float [0.0, 1.0] representing the model's stated confidence in its prediction. Used for calibration scoring.

raw_response — the full raw text response from the model backend, preserved for qualitative analysis and debugging.

prompt — the provider prompt template with patient data redacted, preserved for auditability.

prompt_mode — "single" for one-row calls, "batch" for batched calls.

metadata — pass-through of PatientRecord.metadata, giving metrics access to egfr, ckd_stage, sex for breakdown analysis.

MetricScore dataclass

The result of computing a metric across all EvaluationResults.

name — metric name, e.g. 'Abstention Rate', 'Accuracy'

value — the primary scalar score. Interpretation depends on metric: accuracy → higher is better abstention_rate → context-dependent (see abstention.py) ece → lower is better

breakdown — optional per-class or per-stage breakdown dict. e.g. {"stage_1": 0.92, "stage_2": 0.87, ...} Gives researchers granular insight beyond the scalar.

n_evaluated — number of records scored (excluding abstentions where abstentions are excluded from the metric)

n_abstained — number of records where the model abstained. Always reported regardless of metric type.

details — optional dict for any metric-specific extra data. e.g. calibration bins, confusion matrix, etc.

BaseMetric

Bases: ABC

Abstract base class for all Krisis evaluation metrics.

Usage

class MyMetric(BaseMetric): name = "My Metric"

def compute(self, results: list[EvaluationResult]) -> MetricScore:
    ...

metric = MyMetric() score = metric(results) # calls compute() via call

compute abstractmethod

compute(results: list[EvaluationResult]) -> MetricScore

Compute the metric across all evaluation results.

Parameters:

Name Type Description Default
results list[EvaluationResult]

list of EvaluationResult objects from the benchmark run

required

Returns:

Type Description
MetricScore

MetricScore with the computed value and optional breakdown

__call__

__call__(results: list[EvaluationResult]) -> MetricScore

Allows metric instances to be called directly: metric(results)

Accuracy

accuracy

krisis/metrics/accuracy.py

Standard accuracy metrics over evaluation rows.

SelectiveAccuracy (in :mod:krisis.metrics.abstention) scores only non-abstained predictions. The metrics here include abstentions in the denominator by default so “overall task success” stays comparable to classic supervised benchmarks.

Accuracy

Bases: BaseMetric

Fraction of rows where the prediction matches ground truth.

Abstentions count as incorrect unless treat_abstention_as_neutral is True, in which case abstentions are excluded from both numerator and denominator (then identical to selective accuracy over abstentions excluded, but still reported with full n_abstained on the run).

BalancedAccuracy

Bases: BaseMetric

sklearn balanced_accuracy_score over all rows.

Abstentions (and non-integer predictions) are encoded as an incorrect label chosen from the ground-truth class set for this run, so the score stays well-defined without inventing unseen label ids.

String labels are encoded internally, so progression labels such as stable / worsening / improving are supported.

default_accuracy_metrics

default_accuracy_metrics() -> list[BaseMetric]

Typical accuracy bundle to pair with abstention metrics.

Abstention

abstention

krisis/metrics/abstention.py

Safety-centric metrics around abstention and selective prediction.

  • AbstentionRate — how often the model declines to answer
  • AnswerRate — how often the model attempts an answer
  • SelectiveAccuracy — accuracy restricted to non-abstained predictions
  • DeferralAlignment — when records carry metadata["should_abstain"], did behaviour match that guidance?

AbstentionRate

Bases: BaseMetric

Fraction of evaluations where the model abstained.

AnswerRate

Bases: BaseMetric

Fraction of evaluations where the model attempted an answer.

SelectiveAccuracy

Bases: BaseMetric

Accuracy computed only on rows where the model did not abstain.

Abstentions are excluded from the numerator and denominator. Rows with a non-abstained prediction of None count as incorrect.

DeferralAlignment

Bases: BaseMetric

Alignment with explicit deferral guidance in metadata["should_abstain"].

For each labeled record, behaviour is aligned when the model abstains exactly when should_abstain is True. The primary value is the mean alignment over labeled rows. When no rows contain the key, the metric returns NaN and documents why in details.

default_abstention_metrics

default_abstention_metrics() -> list[BaseMetric]

Default bundle for a clinical safety-oriented table.

Calibration

calibration

krisis/metrics/calibration.py

Calibration metrics using the optional confidence field on :class:~krisis.metrics.base.EvaluationResult.

Abstentions are always excluded — there is no well-defined class probability mass to compare against the label.

ExpectedCalibrationError

Bases: BaseMetric

Expected Calibration Error (ECE) via equal-width bins on [0, 1].

Each sample contributes its stated confidence (probability the model assigns to being correct, or the top-class score). Within bin k, compare the mean confidence to the empirical accuracy; ECE is the bin-size-weighted absolute gap.

Lower is better. When no usable rows exist (all abstained or missing confidence), returns NaN.

BrierScore

Bases: BaseMetric

Brier score for binary {0, 1} labels only.

Uses confidence as the model's estimated probability for the positive class (label 1): if the prediction is 1, p = confidence; if the prediction is 0, p = 1 - confidence. If any row is outside the binary setting, returns NaN.

Lower is better. Abstentions are excluded.

default_calibration_metrics

default_calibration_metrics() -> list[BaseMetric]

Standard calibration diagnostics for benchmark tables.

Default Bundle

default_benchmark_metrics

default_benchmark_metrics() -> list[BaseMetric]

Full default stack: accuracy, balanced accuracy, calibration (ECE, Brier), then abstention and deferral diagnostics (including selective accuracy).