Validation · Deployed-model QA

MAZE deployed-model validation — SE-Htm · universal_tempo

✓ Integration verified — universal_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = -0.0010).

Configuration

Aggregate metrics — published benchmark vs MAZE deployed

Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.

Metric Benchmark MAZE deployed Δ Mark
R2 0.7281 0.7271 -0.0010
RMSE 3.7935 3.7998 +0.0063
MAE 2.6019 2.6047 +0.0028

The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.

Per-horizon metrics — supplementary

Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.

h R² (bench) R² (MAZE) Δ R² RMSE (bench) RMSE (MAZE) MAE (bench) MAE (MAZE)
1 0.8864 0.8816 -0.005 2.4346 2.4788 1.5743 1.5911
6 0.7778 0.7796 +0.002 3.4071 3.3791 2.2463 2.2339
12 0.7745 0.7702 -0.004 3.4338 3.5233 2.2961 2.3143
24 0.7478 0.7487 +0.001 3.6428 3.5932 2.4267 2.3959
48 0.7297 0.7166 -0.013 3.7898 3.8858 2.5971 2.6701
72 0.7056 0.6939 -0.012 3.9612 4.0544 2.7437 2.7883
96 0.6813 0.6901 +0.009 4.1320 4.0814 2.9204 2.8869

Validation methodology

Reproduce

PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint universal_tempo --output validation/results/

# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint universal_tempo --full --output validation/results/