Validation · Deployed-model QA

MAZE deployed-model validation — UK-AMo · universal_tempo

✓ Integration verified — universal_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = -0.0011).

Configuration

Aggregate metrics — published benchmark vs MAZE deployed

Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.

Metric Benchmark MAZE deployed Δ Mark
R2 0.5988 0.5977 -0.0011
RMSE 2.3184 2.3219 +0.0035
MAE 1.4517 1.4542 +0.0025

The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.

Per-horizon metrics — supplementary

Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.

h R² (bench) R² (MAZE) Δ R² RMSE (bench) RMSE (MAZE) MAE (bench) MAE (MAZE)
1 0.8787 0.8805 +0.002 1.2748 1.2773 0.7878 0.7933
6 0.6967 0.6981 +0.001 2.0158 2.0082 1.2454 1.2586
12 0.7025 0.6954 -0.007 1.9966 2.0515 1.2345 1.2513
24 0.6831 0.6799 -0.003 2.0604 2.0598 1.2463 1.2353
48 0.5998 0.5945 -0.005 2.3156 2.3683 1.4540 1.4675
72 0.5421 0.5637 +0.022 2.4770 2.3919 1.5722 1.5446
96 0.4814 0.4738 -0.008 2.6362 2.6586 1.6968 1.7047

Validation methodology

Reproduce

PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint universal_tempo --output validation/results/

# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint universal_tempo --full --output validation/results/