✓ Integration verified — forest_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = +0.0001).
maze/models/checkpoints/forest_tempo.pth)Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.
| Metric | Benchmark | MAZE deployed | Δ | Mark |
|---|---|---|---|---|
| R2 | 0.5418 | 0.5419 | +0.0001 | ✓ |
| RMSE | 2.4777 | 2.4776 | -0.0001 | ✓ |
| MAE | 1.5025 | 1.5033 | +0.0008 | ✓ |
The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.
Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.
| h | R² (bench) | R² (MAZE) | Δ R² | RMSE (bench) | RMSE (MAZE) | MAE (bench) | MAE (MAZE) |
|---|---|---|---|---|---|---|---|
| 1 | 0.8791 | 0.8813 | +0.002 | 1.2725 | 1.2732 | 0.7714 | 0.7806 |
| 6 | 0.6719 | 0.6824 | +0.010 | 2.0968 | 2.0596 | 1.2803 | 1.2807 |
| 12 | 0.6777 | 0.6735 | -0.004 | 2.0779 | 2.1241 | 1.2724 | 1.2909 |
| 24 | 0.6739 | 0.6702 | -0.004 | 2.0901 | 2.0909 | 1.2481 | 1.2536 |
| 48 | 0.5556 | 0.5636 | +0.008 | 2.4403 | 2.4569 | 1.4892 | 1.4764 |
| 72 | 0.4665 | 0.4705 | +0.004 | 2.6737 | 2.6351 | 1.6271 | 1.6121 |
| 96 | 0.3965 | 0.3962 | -0.000 | 2.8437 | 2.8479 | 1.7409 | 1.7556 |
NEE_VUT_REF from MAZE's held-out FLUXNET2015 site CSV (6.UK-AMo.csv), gap-filled with ffill/bfill, sliced into 336+96 sliding windows. Data location is configurable via the MAZE_BENCHMARK_DATA_DIR env var.model.eval(), dropout layers OFF). Forward signature: model(x, itr=0, trend=x, season=x, noise=x), output sliced [:, -96:, :]./predict: the deployed path uses MC-dropout with ensemble_size=5, which adds forward-pass noise. The validation runner uses ensemble_size=1 deterministic mode for an apples-to-apples comparison against the published benchmark. Per Gal & Ghahramani (2016), the expected R² penalty from averaging over MC-dropout samples is small (~0.02) but is not measured in this report.np.linspace(0, N-1, K, dtype=int) to preserve seasonal coverage. Use --full to evaluate every eligible sequence.PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint forest_tempo --output validation/results/
# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint forest_tempo --full --output validation/results/