MAZE deployed-model validation — UK-AMo · forest_tempo

✓ Integration verified — forest_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = +0.0001).

Configuration

Site: UK-AMo
Checkpoint: forest_tempo (maze/models/checkpoints/forest_tempo.pth)
Sequences evaluated: 1,000 of 9,055 eligible / 9,055 total (stratified-uniform-1000)
Predictor: forest_tempo (deterministic, ensemble_size=1)
Model load: 3.9s
Inference wall-clock: 24s (0.02s / sample on CPU)

Aggregate metrics — published benchmark vs MAZE deployed

Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.

Metric	Benchmark	MAZE deployed	Δ	Mark
R2	0.5418	0.5419	+0.0001	✓
RMSE	2.4777	2.4776	-0.0001	✓
MAE	1.5025	1.5033	+0.0008	✓

The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.

Per-horizon metrics — supplementary

Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.

h	R² (bench)	R² (MAZE)	Δ R²	RMSE (bench)	RMSE (MAZE)	MAE (bench)	MAE (MAZE)
1	0.8791	0.8813	+0.002	1.2725	1.2732	0.7714	0.7806
6	0.6719	0.6824	+0.010	2.0968	2.0596	1.2803	1.2807
12	0.6777	0.6735	-0.004	2.0779	2.1241	1.2724	1.2909
24	0.6739	0.6702	-0.004	2.0901	2.0909	1.2481	1.2536
48	0.5556	0.5636	+0.008	2.4403	2.4569	1.4892	1.4764
72	0.4665	0.4705	+0.004	2.6737	2.6351	1.6271	1.6121
96	0.3965	0.3962	-0.000	2.8437	2.8479	1.7409	1.7556

Validation methodology

Test data: raw NEE_VUT_REF from MAZE's held-out FLUXNET2015 site CSV (6.UK-AMo.csv), gap-filled with ffill/bfill, sliced into 336+96 sliding windows. Data location is configurable via the MAZE_BENCHMARK_DATA_DIR env var.
Inference: single deterministic forward pass per sample (model.eval(), dropout layers OFF). Forward signature: model(x, itr=0, trend=x, season=x, noise=x), output sliced [:, -96:, :].
Deviation from deployed /predict: the deployed path uses MC-dropout with ensemble_size=5, which adds forward-pass noise. The validation runner uses ensemble_size=1 deterministic mode for an apples-to-apples comparison against the published benchmark. Per Gal & Ghahramani (2016), the expected R² penalty from averaging over MC-dropout samples is small (~0.02) but is not measured in this report.
Subsampling: stratified-uniform indices via np.linspace(0, N-1, K, dtype=int) to preserve seasonal coverage. Use --full to evaluate every eligible sequence.

Reproduce

PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint forest_tempo --output validation/results/

# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint forest_tempo --full --output validation/results/