✓ Integration verified — universal_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = -0.0010).
maze/models/checkpoints/universal_tempo.pth)Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.
| Metric | Benchmark | MAZE deployed | Δ | Mark |
|---|---|---|---|---|
| R2 | 0.7281 | 0.7271 | -0.0010 | ✓ |
| RMSE | 3.7935 | 3.7998 | +0.0063 | ✓ |
| MAE | 2.6019 | 2.6047 | +0.0028 | ✓ |
The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.
Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.
| h | R² (bench) | R² (MAZE) | Δ R² | RMSE (bench) | RMSE (MAZE) | MAE (bench) | MAE (MAZE) |
|---|---|---|---|---|---|---|---|
| 1 | 0.8864 | 0.8816 | -0.005 | 2.4346 | 2.4788 | 1.5743 | 1.5911 |
| 6 | 0.7778 | 0.7796 | +0.002 | 3.4071 | 3.3791 | 2.2463 | 2.2339 |
| 12 | 0.7745 | 0.7702 | -0.004 | 3.4338 | 3.5233 | 2.2961 | 2.3143 |
| 24 | 0.7478 | 0.7487 | +0.001 | 3.6428 | 3.5932 | 2.4267 | 2.3959 |
| 48 | 0.7297 | 0.7166 | -0.013 | 3.7898 | 3.8858 | 2.5971 | 2.6701 |
| 72 | 0.7056 | 0.6939 | -0.012 | 3.9612 | 4.0544 | 2.7437 | 2.7883 |
| 96 | 0.6813 | 0.6901 | +0.009 | 4.1320 | 4.0814 | 2.9204 | 2.8869 |
NEE_VUT_REF from MAZE's held-out FLUXNET2015 site CSV (7.SE-Htm.csv), gap-filled with ffill/bfill, sliced into 336+96 sliding windows. Data location is configurable via the MAZE_BENCHMARK_DATA_DIR env var.model.eval(), dropout layers OFF). Forward signature: model(x, itr=0, trend=x, season=x, noise=x), output sliced [:, -96:, :]./predict: the deployed path uses MC-dropout with ensemble_size=5, which adds forward-pass noise. The validation runner uses ensemble_size=1 deterministic mode for an apples-to-apples comparison against the published benchmark. Per Gal & Ghahramani (2016), the expected R² penalty from averaging over MC-dropout samples is small (~0.02) but is not measured in this report.np.linspace(0, N-1, K, dtype=int) to preserve seasonal coverage. Use --full to evaluate every eligible sequence.PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint universal_tempo --output validation/results/
# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint universal_tempo --full --output validation/results/