Outputs¶

Krisis produces human-readable reports and JSON exports from the same benchmark result object. The goal is to make quick inspection easy while keeping structured data available for plots, tables, and preprint appendices.

Output Types¶

Output	Best for
Text report	Terminal review, logs, quick sanity checks
Full JSON	Auditing every row, debugging backend responses, preserving raw outputs
Metrics-only JSON	Model comparison tables and plotting

Text Report¶

Use format_report for terminal output:

from krisis.results.report import format_report

print(format_report(result))

Example shape:

Krisis benchmark
================
Backend: api:openai/gpt-5.5

Suite
-----
  domain: Chronic Kidney Disease (CKD)
  task: staging
  feature_set: full
  n_total_eval_records: 160

Metrics
-------
  Accuracy: 0.7125
  Abstention Rate: 0.2000
  Answer Rate (Coverage): 0.8000
  Selective Accuracy (answered only): 0.8906
  Deferral Alignment: 1.0000

Execution
---------
  batch_size: 8
  max_concurrency: 2
  elapsed_seconds: 42.18
  records_per_second: 3.79
  token_total: 14400

Text reports are intentionally compact. They are good for answering "did the run work?" and "what are the headline numbers?"

Full JSON¶

Use full JSON when you need row-level results:

from krisis.results.report import format_json_report

print(format_json_report(result, include_results=True))

Full JSON includes:

backend name
suite metadata
metric scores and details
execution metadata
every EvaluationResult
raw model responses
row-level metadata such as stage, eGFR, and should_abstain

Full JSON can be large

Full JSON includes every evaluated row. It is useful for audits and debugging, but too noisy for most model comparison plots.

Metrics-Only JSON¶

Use metrics-only JSON for model comparison:

from krisis.results.report import format_metrics_json_report

print(format_metrics_json_report(result))

Metrics-only JSON keeps the important aggregate fields:

{
  "backend_name": "api:openai/gpt-5.5",
  "suite": {
    "task": "staging",
    "feature_set": "full",
    "n_total_eval_records": 160
  },
  "metrics": {
    "accuracy": {"value": 0.7125},
    "abstention_rate": {"value": 0.2},
    "selective_accuracy": {"value": 0.8906}
  },
  "execution": {
    "batch_size": 8,
    "max_concurrency": 2,
    "elapsed_seconds": 42.18,
    "records_per_second": 3.79,
    "token_total": 14400,
    "prompt_capture": "evaluation_results.prompt",
    "prompt_data_policy": "patient_data_redacted",
    "prompt_modes": ["batch"],
    "prompt_templates_count": 1,
    "prompt_templates": [
      {
        "prompt_mode": "batch",
        "prompt": [
          {
            "role": "system",
            "content": "[provider-facing instructions]"
          },
          {
            "role": "user",
            "content": "[BATCH_PATIENT_DATA_REDACTED]"
          }
        ]
      }
    ]
  }
}

Execution Metadata¶

Reports include an execution block:

{
  "batch_size": 8,
  "max_concurrency": 2,
  "n_input_records": 160,
  "n_api_batches": 20,
  "elapsed_seconds": 42.18,
  "records_per_second": 3.79,
  "input_tokens": 12000,
  "output_tokens": 2400,
  "token_total": 14400,
  "prompt_capture": "evaluation_results.prompt",
  "prompt_data_policy": "patient_data_redacted",
  "prompt_modes": ["batch"],
  "n_prompts_captured": 160,
  "prompt_templates_count": 1,
  "prompt_templates": [
    {
      "prompt_mode": "batch",
      "prompt": [
        {
          "role": "system",
          "content": "[provider-facing instructions]"
        },
        {
          "role": "user",
          "content": "[BATCH_PATIENT_DATA_REDACTED]"
        }
      ]
    }
  ]
}

These fields let you compare model behavior and operational cost:

Field	Meaning
`batch_size`	records per provider call
`max_concurrency`	parallel provider calls
`n_input_records`	total records passed to the benchmark
`n_api_batches`	number of provider batches executed
`elapsed_seconds`	wall-clock benchmark duration
`records_per_second`	throughput
`input_tokens`	total input tokens, where provider usage is available
`output_tokens`	total output tokens, where provider usage is available
`token_total`	combined token usage
`prompt_capture`	where full prompt text is stored in full JSON
`prompt_data_policy`	whether patient data are included or redacted from captured prompts
`prompt_modes`	prompt invocation modes observed, usually `single`, `batch`, or both
`n_prompts_captured`	number of row-level results that include prompt text
`prompt_templates_count`	number of unique redacted prompt templates used
`prompt_templates`	deduplicated redacted prompt templates used in the run

Prompt Audit Trail¶

Full JSON includes the provider-facing prompt template for each evaluated row, with patient data redacted:

from krisis.results.report import format_json_report

full = format_json_report(result, include_results=True)

Each row in evaluation_results can include:

Field	Meaning
`prompt`	Provider-facing messages serialized as JSON, with patient data redacted
`prompt_mode`	`single` or `batch`
`raw_response`	Provider response for that row

For batched calls, every row in the same provider batch carries the same redacted batch prompt. This is intentional: it lets you inspect the instructions and batched JSON shape when comparing reliability across providers, without storing the full patient payload repeatedly.

Aggregate JSON also includes deduplicated templates under execution.prompt_templates, so you can inspect the prompt shape without using include_results=True.

Plotting Model Comparisons¶

For bar charts across models, the most useful fields are:

Plot	JSON path
Overall accuracy	`metrics.accuracy.value`
Balanced accuracy	`metrics.balanced_accuracy.value`
Answered-only accuracy	`metrics.selective_accuracy.value`
Abstention rate	`metrics.abstention_rate.value`
Coverage	`metrics.answer_rate.value`
Deferral alignment	`metrics.deferral_alignment.value`
Calibration error	`metrics.expected_calibration_error.value`
Runtime	`execution.elapsed_seconds`
Throughput	`execution.records_per_second`
Token use	`execution.token_total`

Use grouped charts by task

Detection, staging, and progression measure different behaviors. Compare models within the same task instead of mixing all tasks into one score.

Recommended Artifacts To Save¶

For reproducible comparisons, save:

metrics-only JSON for each model/task run
the full JSON for at least one audited run per model
the exact model ID
task name
feature set
seed
synthetic record count
batch size and max concurrency
output-token cap