Results¶
Benchmark runs return a BenchmarkResult. It contains row-level evaluation
results, metric scores, suite metadata, backend name, and execution metadata.
Execution Extras¶
Operational fields live in BenchmarkResult.extras and are included in text and
JSON reports:
| Field | Meaning |
|---|---|
batch_size |
Configured batch size |
max_concurrency |
Configured maximum parallel backend batches |
n_input_records |
Total records evaluated |
n_api_batches |
Number of planned API batches |
elapsed_seconds |
Total wall-clock runtime |
records_per_second |
Throughput |
input_tokens |
Total input tokens when available |
output_tokens |
Total output tokens when available |
token_total |
Total token usage when available |
prompt_capture |
Where full prompt text is stored in full JSON |
prompt_data_policy |
Whether patient data are included or redacted from captured prompts |
prompt_modes |
Prompt invocation modes observed, such as single or batch |
n_prompts_captured |
Number of result rows with prompt text |
prompt_templates_count |
Number of unique redacted prompt templates used |
prompt_templates |
Deduplicated redacted prompt templates used in the run |
Example:
result = Benchmark(suite, backend, batch_size=8, max_concurrency=2).run()
print(result.extras["elapsed_seconds"])
print(result.extras["token_total"])
print(result.extras["prompt_templates"][0]["prompt"])
Full JSON includes row-level prompt, prompt_mode, and raw_response fields
so prompt design and provider response quality can be audited together. prompt
redacts patient data and preserves the reusable instructions/output format.
Aggregate JSON also includes deduplicated templates in extras.prompt_templates.
Result Container¶
BenchmarkResult
dataclass
¶
All artefacts produced by Benchmark.run().
to_dict ¶
to_dict(*, include_results: bool = True) -> dict[str, Any]
Return a JSON-safe dictionary representation of the run.
include_results=False keeps only suite metadata and aggregate
scores, which is useful for compact logs.
metrics_to_dict ¶
metrics_to_dict() -> dict[str, Any]
Return only the JSON-safe aggregate metric scores.
to_json ¶
to_json(
*,
include_results: bool = True,
indent: int | None = 2,
sort_keys: bool = True,
) -> str
Return a strict JSON string for the benchmark run.
metrics_to_json ¶
metrics_to_json(
*, indent: int | None = 2, sort_keys: bool = True
) -> str
Return strict JSON containing only aggregate metric scores.
Report Formatters¶
report ¶
krisis/results/report.py
Human-readable summaries of BenchmarkResult.
format_report ¶
format_report(run: BenchmarkResult) -> str
Return a compact multi-line text summary suitable for logs or papers.
format_json_report ¶
format_json_report(
run: BenchmarkResult, *, include_results: bool = True
) -> str
Return a strict JSON summary of a benchmark run.
format_metrics_json_report ¶
format_metrics_json_report(run: BenchmarkResult) -> str
Return strict JSON containing only aggregate metric scores.