Skip to content

Benchmark

Benchmark is the execution layer. It receives a suite, a backend, and a metric bundle, then produces a BenchmarkResult.

Constructor Controls

Parameter Default What it controls
suite required Data suite that produces PatientRecord rows
backend required Model backend used for inference
metrics None Optional custom metric list. None uses the default Krisis metric bundle
batch_size 8 Number of patient records sent to the backend in one provider call
max_concurrency 1 Number of backend batches allowed to run in parallel

Example:

result = Benchmark(
    suite,
    backend,
    batch_size=8,
    max_concurrency=2,
).run()

Batch Size vs Concurrency

batch_size and max_concurrency are separate controls.

Setting Example Meaning
batch_size=8 8 records One API call asks the model to evaluate 8 patient rows
max_concurrency=2 2 calls Krisis may run two batch calls at the same time

With batch_size=8 and max_concurrency=2, up to 16 records can be in flight. This reduces HTTP overhead and can improve throughput, but provider rate limits still apply.

Start conservative

Use batch_size=8 and max_concurrency=1 or 2 for first runs. Increase after checking provider rate limits and structured JSON reliability.

When a model returns empty JSON

Empty responses usually mean the provider returned no text or the output-token cap was too low for the requested batch. In the example CLI, raise --max-output-tokens; for direct backend use, raise the provider's token cap (max_completion_tokens, max_tokens, or max_output_tokens).

Batched JSON Fallback

Krisis asks backends to evaluate batches via backend.evaluate_batch(...).

If a provider returns malformed batched JSON, Benchmark does not immediately fail the whole run. It recursively splits the batch into smaller chunks. If the batch is already size 1, it falls back to backend.evaluate(...).

This protects long benchmark runs from one weak batched response while still using batching whenever the provider follows the requested format.

Retries are configured on backends

Benchmark controls batching and concurrency. Provider failure retries are configured on the backend with max_retries, retry_base_seconds, and retry_max_seconds.

Execution Metadata

Benchmark.run() stores operational details in BenchmarkResult.extras:

Field Meaning
batch_size Configured batch size
max_concurrency Configured concurrency
n_input_records Total patient records evaluated
n_api_batches Number of planned provider batches
elapsed_seconds Wall-clock runtime
records_per_second Evaluation throughput
input_tokens Total input tokens when provider usage is available
output_tokens Total output tokens when provider usage is available
token_total Total input + output tokens when provider usage is available
prompt_capture Where full prompt text is stored in full JSON
prompt_data_policy Whether patient data are included or redacted from captured prompts
prompt_modes Prompt invocation modes observed, such as single or batch
n_prompts_captured Number of result rows with prompt text
prompt_templates_count Number of unique redacted prompt templates used
prompt_templates Deduplicated redacted prompt templates used in the run

These fields appear in text reports and JSON reports.

Redacted prompt templates are stored in extras.prompt_templates. Row-level EvaluationResult.prompt also stores the redacted template used for that row.

Benchmark

Run a full evaluation pass.

With metrics=None, runs :func:krisis.metrics.default_benchmark_metrics (overall accuracy, balanced accuracy, ECE, Brier score, selective accuracy, abstention rate, and deferral alignment when should_abstain metadata is present).

Typical usage::

from krisis.benchmark import Benchmark
from krisis.results.report import format_report

suite = MyClinicalSuite(...)
backend = MyModelBackend(...)
run = Benchmark(
    suite,
    backend,
    batch_size=8,
    max_concurrency=2,
).run()
print(format_report(run))

Parameters:

Name Type Description Default
suite BaseDataSuite

data suite that produces PatientRecord rows.

required
backend BaseBackend

model backend used for inference.

required
metrics Sequence[BaseMetric] | None

optional custom metrics. When omitted, Krisis uses the default benchmark metric bundle.

None
batch_size int

number of patient records sent to the backend in one batch.

8
max_concurrency int

maximum number of backend batches evaluated in parallel.

1

run

run(
    records: list[PatientRecord] | None = None,
    *,
    suite_description: dict[str, Any] | None = None,
) -> BenchmarkResult

Execute the benchmark.

Parameters:

Name Type Description Default
records list[PatientRecord] | None

optional pre-built patient rows. When omitted, rows are produced via suite.load() (which may touch disk).

None
suite_description dict[str, Any] | None

optional override for reporting metadata. When omitted, suite.describe() is used after loading data.

None