Skip to content

Outputs

Krisis produces human-readable reports and JSON exports from the same benchmark result object. The goal is to make quick inspection easy while keeping structured data available for plots, tables, and preprint appendices.

Output Types

Output Best for
Text report Terminal review, logs, quick sanity checks
Full JSON Auditing every row, debugging backend responses, preserving raw outputs
Metrics-only JSON Model comparison tables and plotting

Text Report

Use format_report for terminal output:

from krisis.results.report import format_report

print(format_report(result))

Example shape:

Krisis benchmark
================
Backend: openai:gpt-5.5

Suite
-----
  domain: Chronic Kidney Disease (CKD)
  task: staging
  feature_set: full
  n_total_eval_records: 160

Metrics
-------
  Accuracy: 0.7125
  Abstention Rate: 0.2000
  Answer Rate (Coverage): 0.8000
  Selective Accuracy (answered only): 0.8906
  Deferral Alignment: 1.0000

Execution
---------
  batch_size: 8
  max_concurrency: 2
  elapsed_seconds: 42.18
  records_per_second: 3.79
  token_total: 14400

Text reports are intentionally compact. They are good for answering "did the run work?" and "what are the headline numbers?"

Full JSON

Use full JSON when you need row-level results:

from krisis.results.report import format_json_report

print(format_json_report(result, include_results=True))

Full JSON includes:

  • backend name
  • suite metadata
  • metric scores and details
  • execution metadata
  • every EvaluationResult
  • raw model responses
  • row-level metadata such as stage, eGFR, and should_abstain

Full JSON can be large

Full JSON includes every evaluated row. It is useful for audits and debugging, but too noisy for most model comparison plots.

Metrics-Only JSON

Use metrics-only JSON for model comparison:

from krisis.results.report import format_metrics_json_report

print(format_metrics_json_report(result))

Metrics-only JSON keeps the important aggregate fields:

{
  "backend_name": "openai:gpt-5.5",
  "suite": {
    "task": "staging",
    "feature_set": "full",
    "n_total_eval_records": 160
  },
  "metrics": {
    "Accuracy": {"value": 0.7125},
    "Abstention Rate": {"value": 0.2},
    "Selective Accuracy (answered only)": {"value": 0.8906}
  },
  "execution": {
    "batch_size": 8,
    "max_concurrency": 2,
    "elapsed_seconds": 42.18,
    "records_per_second": 3.79,
    "token_total": 14400,
    "prompt_capture": "evaluation_results.prompt",
    "prompt_data_policy": "patient_data_redacted",
    "prompt_modes": ["batch"],
    "prompt_templates_count": 1,
    "prompt_templates": [
      {
        "prompt_mode": "batch",
        "prompt": [
          {
            "role": "system",
            "content": "[provider-facing instructions]"
          },
          {
            "role": "user",
            "content": "[BATCH_PATIENT_DATA_REDACTED]"
          }
        ]
      }
    ]
  }
}

Execution Metadata

Reports include an execution block:

{
  "batch_size": 8,
  "max_concurrency": 2,
  "n_input_records": 160,
  "n_api_batches": 20,
  "elapsed_seconds": 42.18,
  "records_per_second": 3.79,
  "input_tokens": 12000,
  "output_tokens": 2400,
  "token_total": 14400,
  "prompt_capture": "evaluation_results.prompt",
  "prompt_data_policy": "patient_data_redacted",
  "prompt_modes": ["batch"],
  "n_prompts_captured": 160,
  "prompt_templates_count": 1,
  "prompt_templates": [
    {
      "prompt_mode": "batch",
      "prompt": [
        {
          "role": "system",
          "content": "[provider-facing instructions]"
        },
        {
          "role": "user",
          "content": "[BATCH_PATIENT_DATA_REDACTED]"
        }
      ]
    }
  ]
}

These fields let you compare model behavior and operational cost:

Field Meaning
batch_size records per provider call
max_concurrency parallel provider calls
n_input_records total records passed to the benchmark
n_api_batches number of provider batches executed
elapsed_seconds wall-clock benchmark duration
records_per_second throughput
input_tokens total input tokens, where provider usage is available
output_tokens total output tokens, where provider usage is available
token_total combined token usage
prompt_capture where full prompt text is stored in full JSON
prompt_data_policy whether patient data are included or redacted from captured prompts
prompt_modes prompt invocation modes observed, usually single, batch, or both
n_prompts_captured number of row-level results that include prompt text
prompt_templates_count number of unique redacted prompt templates used
prompt_templates deduplicated redacted prompt templates used in the run

Prompt Audit Trail

Full JSON includes the provider-facing prompt template for each evaluated row, with patient data redacted:

from krisis.results.report import format_json_report

full = format_json_report(result, include_results=True)

Each row in evaluation_results can include:

Field Meaning
prompt Provider-facing messages serialized as JSON, with patient data redacted
prompt_mode single or batch
raw_response Provider response for that row

For batched calls, every row in the same provider batch carries the same redacted batch prompt. This is intentional: it lets you inspect the instructions and batched JSON shape when comparing reliability across providers, without storing the full patient payload repeatedly.

Aggregate JSON also includes deduplicated templates under execution.prompt_templates, so you can inspect the prompt shape without using include_results=True.

Plotting Model Comparisons

For bar charts across models, the most useful fields are:

Plot JSON path
Overall accuracy metrics.Accuracy.value
Balanced accuracy metrics.Balanced Accuracy.value
Answered-only accuracy metrics.Selective Accuracy (answered only).value
Abstention rate metrics.Abstention Rate.value
Coverage metrics.Answer Rate (Coverage).value
Deferral alignment metrics.Deferral Alignment.value
Calibration error metrics.Expected Calibration Error.value
Runtime execution.elapsed_seconds
Throughput execution.records_per_second
Token use execution.token_total

Use grouped charts by task

Detection, staging, and progression measure different behaviors. Compare models within the same task instead of mixing all tasks into one score.

For reproducible comparisons, save:

  • metrics-only JSON for each model/task run
  • the full JSON for at least one audited run per model
  • the exact model ID
  • task name
  • feature set
  • seed
  • synthetic record count
  • batch size and max concurrency
  • output-token cap