Benchmark¶

Benchmark is the execution layer. It receives a suite, a backend, and a metric bundle, then produces a BenchmarkResult.

Constructor Controls¶

Parameter	Default	What it controls
`suite`	required	Data suite that produces `PatientRecord` rows
`backend`	required	Model backend used for inference
`metrics`	`None`	Optional custom metric list. `None` uses the default Krisis metric bundle
`batch_size`	`8`	Number of patient records sent to the backend in one provider call
`max_concurrency`	`1`	Number of backend batches allowed to run in parallel

Example:

result = Benchmark(
    suite,
    backend,
    batch_size=8,
    max_concurrency=2,
).run()

Batch Size vs Concurrency¶

batch_size and max_concurrency are separate controls.

Setting	Example	Meaning
`batch_size=8`	8 records	One API call asks the model to evaluate 8 patient rows
`max_concurrency=2`	2 calls	Krisis may run two batch calls at the same time

With batch_size=8 and max_concurrency=2, up to 16 records can be in flight. This reduces HTTP overhead and can improve throughput, but provider rate limits still apply.

Start conservative

Use batch_size=8 and max_concurrency=1 or 2 for first runs. Increase after checking provider rate limits and structured JSON reliability.

When a model returns empty JSON

Empty responses usually mean the provider returned no text or the output-token cap was too low for the requested batch. In the example CLI, raise --max-output-tokens; for direct backend use, raise the provider's token cap (max_completion_tokens, max_tokens, or max_output_tokens).

Batched JSON Fallback¶

Krisis asks backends to evaluate batches via backend.evaluate_batch(...).

If a provider returns malformed batched JSON, Benchmark does not immediately fail the whole run. It recursively splits the batch into smaller chunks. If the batch is already size 1, it falls back to backend.evaluate(...).

This protects long benchmark runs from one weak batched response while still using batching whenever the provider follows the requested format.

Retries are configured on backends

Benchmark controls batching and concurrency. Provider failure retries are configured on the backend with max_retries, retry_base_seconds, and retry_max_seconds.

Execution Metadata¶

Benchmark.run() stores operational details in BenchmarkResult.extras:

Field	Meaning
`batch_size`	Configured batch size
`max_concurrency`	Configured concurrency
`n_input_records`	Total patient records evaluated
`n_api_batches`	Number of planned provider batches
`elapsed_seconds`	Wall-clock runtime
`records_per_second`	Evaluation throughput
`input_tokens`	Total input tokens when provider usage is available
`output_tokens`	Total output tokens when provider usage is available
`token_total`	Total input + output tokens when provider usage is available
`prompt_capture`	Where full prompt text is stored in full JSON
`prompt_data_policy`	Whether patient data are included or redacted from captured prompts
`prompt_modes`	Prompt invocation modes observed, such as `single` or `batch`
`n_prompts_captured`	Number of result rows with prompt text
`prompt_templates_count`	Number of unique redacted prompt templates used
`prompt_templates`	Deduplicated redacted prompt templates used in the run

These fields appear in text reports and JSON reports.

Redacted prompt templates are stored in extras.prompt_templates. Row-level EvaluationResult.prompt also stores the redacted template used for that row.

Benchmark ¶

Run a full evaluation pass.

With metrics=None, runs :func:krisis.metrics.default_benchmark_metrics (overall accuracy, balanced accuracy, ECE, Brier score, selective accuracy, abstention rate, and deferral alignment when should_abstain metadata is present).

Typical usage::

from krisis.benchmark import Benchmark
from krisis.results.report import format_report

suite = MyClinicalSuite(...)
backend = MyModelBackend(...)
run = Benchmark(
    suite,
    backend,
    batch_size=8,
    max_concurrency=2,
).run()
print(format_report(run))

Parameters:

Name	Type	Description	Default
`suite`	`BaseDataSuite`	data suite that produces `PatientRecord` rows.	required
`backend`	`BaseBackend`	model backend used for inference.	required
`metrics`	`Sequence[BaseMetric] \| None`	optional custom metrics. When omitted, Krisis uses the default benchmark metric bundle.	`None`
`batch_size`	`int`	number of patient records sent to the backend in one batch.	`8`
`max_concurrency`	`int`	maximum number of backend batches evaluated in parallel.	`1`

run ¶

run(
    records: list[PatientRecord] | None = None,
    *,
    suite_description: dict[str, Any] | None = None,
) -> BenchmarkResult

Execute the benchmark.

Parameters:

Name	Type	Description	Default
`records`	`list[PatientRecord] \| None`	optional pre-built patient rows. When omitted, rows are produced via `suite.load()` (which may touch disk).	`None`
`suite_description`	`dict[str, Any] \| None`	optional override for reporting metadata. When omitted, `suite.describe()` is used after loading data.	`None`