Benchmark¶
Benchmark is the execution layer. It receives a suite, a backend, and a metric
bundle, then produces a BenchmarkResult.
Constructor Controls¶
| Parameter | Default | What it controls |
|---|---|---|
suite |
required | Data suite that produces PatientRecord rows |
backend |
required | Model backend used for inference |
metrics |
None |
Optional custom metric list. None uses the default Krisis metric bundle |
batch_size |
8 |
Number of patient records sent to the backend in one provider call |
max_concurrency |
1 |
Number of backend batches allowed to run in parallel |
Example:
result = Benchmark(
suite,
backend,
batch_size=8,
max_concurrency=2,
).run()
Batch Size vs Concurrency¶
batch_size and max_concurrency are separate controls.
| Setting | Example | Meaning |
|---|---|---|
batch_size=8 |
8 records | One API call asks the model to evaluate 8 patient rows |
max_concurrency=2 |
2 calls | Krisis may run two batch calls at the same time |
With batch_size=8 and max_concurrency=2, up to 16 records can be in flight.
This reduces HTTP overhead and can improve throughput, but provider rate limits
still apply.
Start conservative
Use batch_size=8 and max_concurrency=1 or 2 for first runs. Increase
after checking provider rate limits and structured JSON reliability.
When a model returns empty JSON
Empty responses usually mean the provider returned no text or the
output-token cap was too low for the requested batch. In the example CLI,
raise --max-output-tokens; for direct backend use, raise the provider's
token cap (max_completion_tokens, max_tokens, or max_output_tokens).
Batched JSON Fallback¶
Krisis asks backends to evaluate batches via backend.evaluate_batch(...).
If a provider returns malformed batched JSON, Benchmark does not immediately
fail the whole run. It recursively splits the batch into smaller chunks. If the
batch is already size 1, it falls back to backend.evaluate(...).
This protects long benchmark runs from one weak batched response while still using batching whenever the provider follows the requested format.
Retries are configured on backends
Benchmark controls batching and concurrency. Provider failure retries are
configured on the backend with max_retries, retry_base_seconds, and
retry_max_seconds.
Execution Metadata¶
Benchmark.run() stores operational details in BenchmarkResult.extras:
| Field | Meaning |
|---|---|
batch_size |
Configured batch size |
max_concurrency |
Configured concurrency |
n_input_records |
Total patient records evaluated |
n_api_batches |
Number of planned provider batches |
elapsed_seconds |
Wall-clock runtime |
records_per_second |
Evaluation throughput |
input_tokens |
Total input tokens when provider usage is available |
output_tokens |
Total output tokens when provider usage is available |
token_total |
Total input + output tokens when provider usage is available |
prompt_capture |
Where full prompt text is stored in full JSON |
prompt_data_policy |
Whether patient data are included or redacted from captured prompts |
prompt_modes |
Prompt invocation modes observed, such as single or batch |
n_prompts_captured |
Number of result rows with prompt text |
prompt_templates_count |
Number of unique redacted prompt templates used |
prompt_templates |
Deduplicated redacted prompt templates used in the run |
These fields appear in text reports and JSON reports.
Redacted prompt templates are stored in extras.prompt_templates. Row-level
EvaluationResult.prompt also stores the redacted template used for that row.
Benchmark ¶
Run a full evaluation pass.
With metrics=None, runs :func:krisis.metrics.default_benchmark_metrics
(overall accuracy, balanced accuracy, ECE, Brier score, selective accuracy,
abstention rate, and deferral alignment when should_abstain metadata is present).
Typical usage::
from krisis.benchmark import Benchmark
from krisis.results.report import format_report
suite = MyClinicalSuite(...)
backend = MyModelBackend(...)
run = Benchmark(
suite,
backend,
batch_size=8,
max_concurrency=2,
).run()
print(format_report(run))
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
suite
|
BaseDataSuite
|
data suite that produces |
required |
backend
|
BaseBackend
|
model backend used for inference. |
required |
metrics
|
Sequence[BaseMetric] | None
|
optional custom metrics. When omitted, Krisis uses the default benchmark metric bundle. |
None
|
batch_size
|
int
|
number of patient records sent to the backend in one batch. |
8
|
max_concurrency
|
int
|
maximum number of backend batches evaluated in parallel. |
1
|
run ¶
run(
records: list[PatientRecord] | None = None,
*,
suite_description: dict[str, Any] | None = None,
) -> BenchmarkResult
Execute the benchmark.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
records
|
list[PatientRecord] | None
|
optional pre-built patient rows. When omitted, rows are
produced via |
None
|
suite_description
|
dict[str, Any] | None
|
optional override for reporting metadata. When
omitted, |
None
|