Validation · Deployed-model QA

MAZE deployed-model validation — UK-AMo · forest_tempo

✓ Integration verified — forest_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = +0.0001).

Configuration

Aggregate metrics — published benchmark vs MAZE deployed

Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.

Metric Benchmark MAZE deployed Δ Mark
R2 0.5418 0.5419 +0.0001
RMSE 2.4777 2.4776 -0.0001
MAE 1.5025 1.5033 +0.0008

The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.

Per-horizon metrics — supplementary

Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.

h R² (bench) R² (MAZE) Δ R² RMSE (bench) RMSE (MAZE) MAE (bench) MAE (MAZE)
1 0.8791 0.8813 +0.002 1.2725 1.2732 0.7714 0.7806
6 0.6719 0.6824 +0.010 2.0968 2.0596 1.2803 1.2807
12 0.6777 0.6735 -0.004 2.0779 2.1241 1.2724 1.2909
24 0.6739 0.6702 -0.004 2.0901 2.0909 1.2481 1.2536
48 0.5556 0.5636 +0.008 2.4403 2.4569 1.4892 1.4764
72 0.4665 0.4705 +0.004 2.6737 2.6351 1.6271 1.6121
96 0.3965 0.3962 -0.000 2.8437 2.8479 1.7409 1.7556

Validation methodology

Reproduce

PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint forest_tempo --output validation/results/

# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint forest_tempo --full --output validation/results/