Validation · Deployed-model QA

MAZE deployed-model validation — SE-Htm · forest_tempo

✓ Integration verified — forest_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = +0.0002).

Configuration

Aggregate metrics — published benchmark vs MAZE deployed

Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.

Metric Benchmark MAZE deployed Δ Mark
R2 0.7661 0.7663 +0.0002
RMSE 4.6537 4.6527 -0.0010
MAE 3.5557 3.5550 -0.0007

The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.

Per-horizon metrics — supplementary

Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.

h R² (bench) R² (MAZE) Δ R² RMSE (bench) RMSE (MAZE) MAE (bench) MAE (MAZE)
1 0.8748 0.8732 -0.002 3.4180 3.4686 2.4835 2.5285
6 0.7692 0.7700 +0.001 4.6372 4.6035 3.5063 3.4642
12 0.7695 0.7717 +0.002 4.6217 4.5802 3.4866 3.4362
24 0.7625 0.7648 +0.002 4.7016 4.6653 3.4459 3.4298
48 0.7766 0.7845 +0.008 4.5601 4.4836 3.4617 3.4238
72 0.7562 0.7666 +0.010 4.7478 4.6562 3.6112 3.5350
96 0.7594 0.7701 +0.011 4.7029 4.6079 3.6282 3.5496

Validation methodology

Reproduce

PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint forest_tempo --output validation/results/

# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint forest_tempo --full --output validation/results/