Results¶

Benchmark runs return a BenchmarkResult. It contains row-level evaluation results, metric scores, suite metadata, backend name, and execution metadata.

Execution Extras¶

Operational fields live in BenchmarkResult.extras and are included in text and JSON reports:

Field	Meaning
`batch_size`	Configured batch size
`max_concurrency`	Configured maximum parallel backend batches
`n_input_records`	Total records evaluated
`n_api_batches`	Number of planned API batches
`elapsed_seconds`	Total wall-clock runtime
`records_per_second`	Throughput
`input_tokens`	Total input tokens when available
`output_tokens`	Total output tokens when available
`token_total`	Total token usage when available
`prompt_capture`	Where full prompt text is stored in full JSON
`prompt_data_policy`	Whether patient data are included or redacted from captured prompts
`prompt_modes`	Prompt invocation modes observed, such as `single` or `batch`
`n_prompts_captured`	Number of result rows with prompt text
`prompt_templates_count`	Number of unique redacted prompt templates used
`prompt_templates`	Deduplicated redacted prompt templates used in the run

Example:

result = Benchmark(suite, backend, batch_size=8, max_concurrency=2).run()

print(result.extras["elapsed_seconds"])
print(result.extras["token_total"])
print(result.extras["prompt_templates"][0]["prompt"])

Full JSON includes row-level prompt, prompt_mode, and raw_response fields so prompt design and provider response quality can be audited together. prompt redacts patient data and preserves the reusable instructions/output format. Aggregate JSON also includes deduplicated templates in extras.prompt_templates.

Result Container¶

BenchmarkResult `dataclass` ¶

All artefacts produced by Benchmark.run().

to_dict ¶

to_dict(*, include_results: bool = True) -> dict[str, Any]

Return a JSON-safe dictionary representation of the run.

include_results=False keeps only suite metadata and aggregate scores, which is useful for compact logs.

metrics_to_dict ¶

metrics_to_dict() -> dict[str, Any]

Return only the JSON-safe aggregate metric scores.

to_json ¶

to_json(
    *,
    include_results: bool = True,
    indent: int | None = 2,
    sort_keys: bool = True,
) -> str

Return a strict JSON string for the benchmark run.

metrics_to_json ¶

metrics_to_json(
    *, indent: int | None = 2, sort_keys: bool = True
) -> str

Return strict JSON containing only aggregate metric scores.

Report Formatters¶

report ¶

krisis/results/report.py

Human-readable summaries of BenchmarkResult.

format_report ¶

format_report(run: BenchmarkResult) -> str

Return a compact multi-line text summary suitable for logs or papers.

format_json_report ¶

format_json_report(
    run: BenchmarkResult, *, include_results: bool = True
) -> str

Return a strict JSON summary of a benchmark run.

format_metrics_json_report ¶

format_metrics_json_report(run: BenchmarkResult) -> str