✓ Integration verified — forest_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = +0.0002).
maze/models/checkpoints/forest_tempo.pth)Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.
| Metric | Benchmark | MAZE deployed | Δ | Mark |
|---|---|---|---|---|
| R2 | 0.7661 | 0.7663 | +0.0002 | ✓ |
| RMSE | 4.6537 | 4.6527 | -0.0010 | ✓ |
| MAE | 3.5557 | 3.5550 | -0.0007 | ✓ |
The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.
Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.
| h | R² (bench) | R² (MAZE) | Δ R² | RMSE (bench) | RMSE (MAZE) | MAE (bench) | MAE (MAZE) |
|---|---|---|---|---|---|---|---|
| 1 | 0.8748 | 0.8732 | -0.002 | 3.4180 | 3.4686 | 2.4835 | 2.5285 |
| 6 | 0.7692 | 0.7700 | +0.001 | 4.6372 | 4.6035 | 3.5063 | 3.4642 |
| 12 | 0.7695 | 0.7717 | +0.002 | 4.6217 | 4.5802 | 3.4866 | 3.4362 |
| 24 | 0.7625 | 0.7648 | +0.002 | 4.7016 | 4.6653 | 3.4459 | 3.4298 |
| 48 | 0.7766 | 0.7845 | +0.008 | 4.5601 | 4.4836 | 3.4617 | 3.4238 |
| 72 | 0.7562 | 0.7666 | +0.010 | 4.7478 | 4.6562 | 3.6112 | 3.5350 |
| 96 | 0.7594 | 0.7701 | +0.011 | 4.7029 | 4.6079 | 3.6282 | 3.5496 |
NEE_VUT_REF from MAZE's held-out FLUXNET2015 site CSV (7.SE-Htm.csv), gap-filled with ffill/bfill, sliced into 336+96 sliding windows. Data location is configurable via the MAZE_BENCHMARK_DATA_DIR env var.model.eval(), dropout layers OFF). Forward signature: model(x, itr=0, trend=x, season=x, noise=x), output sliced [:, -96:, :]./predict: the deployed path uses MC-dropout with ensemble_size=5, which adds forward-pass noise. The validation runner uses ensemble_size=1 deterministic mode for an apples-to-apples comparison against the published benchmark. Per Gal & Ghahramani (2016), the expected R² penalty from averaging over MC-dropout samples is small (~0.02) but is not measured in this report.np.linspace(0, N-1, K, dtype=int) to preserve seasonal coverage. Use --full to evaluate every eligible sequence.PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint forest_tempo --output validation/results/
# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint forest_tempo --full --output validation/results/