✓ Integration verified — universal_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = -0.0011).
maze/models/checkpoints/universal_tempo.pth)Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.
| Metric | Benchmark | MAZE deployed | Δ | Mark |
|---|---|---|---|---|
| R2 | 0.5988 | 0.5977 | -0.0011 | ✓ |
| RMSE | 2.3184 | 2.3219 | +0.0035 | ✓ |
| MAE | 1.4517 | 1.4542 | +0.0025 | ✓ |
The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.
Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.
| h | R² (bench) | R² (MAZE) | Δ R² | RMSE (bench) | RMSE (MAZE) | MAE (bench) | MAE (MAZE) |
|---|---|---|---|---|---|---|---|
| 1 | 0.8787 | 0.8805 | +0.002 | 1.2748 | 1.2773 | 0.7878 | 0.7933 |
| 6 | 0.6967 | 0.6981 | +0.001 | 2.0158 | 2.0082 | 1.2454 | 1.2586 |
| 12 | 0.7025 | 0.6954 | -0.007 | 1.9966 | 2.0515 | 1.2345 | 1.2513 |
| 24 | 0.6831 | 0.6799 | -0.003 | 2.0604 | 2.0598 | 1.2463 | 1.2353 |
| 48 | 0.5998 | 0.5945 | -0.005 | 2.3156 | 2.3683 | 1.4540 | 1.4675 |
| 72 | 0.5421 | 0.5637 | +0.022 | 2.4770 | 2.3919 | 1.5722 | 1.5446 |
| 96 | 0.4814 | 0.4738 | -0.008 | 2.6362 | 2.6586 | 1.6968 | 1.7047 |
NEE_VUT_REF from MAZE's held-out FLUXNET2015 site CSV (6.UK-AMo.csv), gap-filled with ffill/bfill, sliced into 336+96 sliding windows. Data location is configurable via the MAZE_BENCHMARK_DATA_DIR env var.model.eval(), dropout layers OFF). Forward signature: model(x, itr=0, trend=x, season=x, noise=x), output sliced [:, -96:, :]./predict: the deployed path uses MC-dropout with ensemble_size=5, which adds forward-pass noise. The validation runner uses ensemble_size=1 deterministic mode for an apples-to-apples comparison against the published benchmark. Per Gal & Ghahramani (2016), the expected R² penalty from averaging over MC-dropout samples is small (~0.02) but is not measured in this report.np.linspace(0, N-1, K, dtype=int) to preserve seasonal coverage. Use --full to evaluate every eligible sequence.PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint universal_tempo --output validation/results/
# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint universal_tempo --full --output validation/results/