# Minimal Reproduction: Decision-Aligned Evaluation of UQ

Paper: **Decision-Aligned Evaluation of Uncertainty Quantification**  
ArXiv: https://arxiv.org/abs/2606.26990  
Version checked: v1, submitted 2026-06-25.

## What I reproduced

The paper argues that common uncertainty-quantification metrics such as NLL, Brier score, ECE, and accuracy may not align with the utility of downstream decisions. It proposes decision-aligned / prior-weighted utility metrics as a better evaluation target when decisions have an implicit cost prior.

This is a **minimal toy reproduction**, not the full benchmark suite. I reproduced the core binary-decision phenomenon:

> A predictor that looks best under a generic UQ metric can be ranked differently from a predictor that gives better downstream decision utility under a task-specific cost prior.

## Setup

- Python 3 only; no external dependencies.
- Synthetic binary classification labels.
- Three probabilistic predictors:
  - `A_calibratedish`: generally reasonable probabilities.
  - `B_decision_oriented`: optimized for the chosen decision regime.
  - `C_underconfident`: very high 0.5-threshold accuracy but poor decision utility under high false-positive costs.
- Downstream decision: choose positive action iff predicted probability `f > c`.
- Prior over false-positive cost: uniform grid over `c ∈ [0.70, 0.85]`.
- Repeated over 30 seeds.

## Key results

Aggregate rankings:

| Metric | Ranking |
|---|---|
| NLL, lower better | B > C > A |
| Brier, lower better | B > C > A |
| ECE, lower better | A > B > C |
| Accuracy@0.5, higher better | C > B > A |
| Prior-weighted utility, higher better | B > A > C |

Mean values:

| Model | NLL ↓ | Brier ↓ | ECE ↓ | Accuracy@0.5 ↑ | Prior-weighted utility ↑ |
|---|---:|---:|---:|---:|---:|
| A_calibratedish | 0.4811 | 0.1555 | 0.0957 | 0.7897 | -0.0856 |
| B_decision_oriented | 0.3233 | 0.0881 | 0.1508 | 0.9277 | -0.0641 |
| C_underconfident | 0.4353 | 0.1263 | 0.3414 | 1.0000 | -0.1056 |

Rank alignment with prior-weighted utility, mean Spearman over seeds:

- NLL: `0.50`
- Brier: `0.50`
- ECE: `0.50`
- Accuracy@0.5: `-0.50`
- Prior-weighted utility: `1.00`

## Interpretation

This supports the paper's central point: generic metrics can be misleading for downstream decisions. Accuracy@0.5 is especially pathological in this toy setup: the underconfident model gets perfect 0.5-threshold accuracy but is worst under the high false-positive-cost decision prior.

The prior-weighted utility metric aligns by construction with the chosen downstream utility, so it gives the expected ranking.

## Files

- `reproduce_decision_alignment.py` — dependency-free reproduction script.
- `summary.json` — aggregate metrics and rankings.
- `metrics_by_seed.csv` — raw per-seed model metrics.
- `rankings_by_seed.csv` — per-seed rankings.
- `decision_alignment_results.svg` — visual summary.

Run again:

```bash
python3 reproductions/decision_aligned_uq_2606_26990/reproduce_decision_alignment.py
```