MAZE deployed-model validation — SE-Htm · forest_tempo

✓ Integration verified — forest_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = +0.0002).

Configuration

Site: SE-Htm
Checkpoint: forest_tempo (maze/models/checkpoints/forest_tempo.pth)
Sequences evaluated: 1,000 of 1,503 eligible / 5,007 total (stratified-uniform-1000)
Predictor: forest_tempo (deterministic, ensemble_size=1)
Model load: 2.4s
Inference wall-clock: 26s (0.03s / sample on CPU) Forest-TEMPO was fine-tuned on the first 70% of SE-Htm, so this evaluation uses the aligned subset (windows 3504:5007 of 5,007, i.e. the last 1,503 sliding windows) to avoid training-data leakage. The published benchmark for this variant is computed on the same aligned subset.

Aggregate metrics — published benchmark vs MAZE deployed

Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.

Metric	Benchmark	MAZE deployed	Δ	Mark
R2	0.7661	0.7663	+0.0002	✓
RMSE	4.6537	4.6527	-0.0010	✓
MAE	3.5557	3.5550	-0.0007	✓

The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.

Per-horizon metrics — supplementary

Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.

h	R² (bench)	R² (MAZE)	Δ R²	RMSE (bench)	RMSE (MAZE)	MAE (bench)	MAE (MAZE)
1	0.8748	0.8732	-0.002	3.4180	3.4686	2.4835	2.5285
6	0.7692	0.7700	+0.001	4.6372	4.6035	3.5063	3.4642
12	0.7695	0.7717	+0.002	4.6217	4.5802	3.4866	3.4362
24	0.7625	0.7648	+0.002	4.7016	4.6653	3.4459	3.4298
48	0.7766	0.7845	+0.008	4.5601	4.4836	3.4617	3.4238
72	0.7562	0.7666	+0.010	4.7478	4.6562	3.6112	3.5350
96	0.7594	0.7701	+0.011	4.7029	4.6079	3.6282	3.5496

Validation methodology

Test data: raw NEE_VUT_REF from MAZE's held-out FLUXNET2015 site CSV (7.SE-Htm.csv), gap-filled with ffill/bfill, sliced into 336+96 sliding windows. Data location is configurable via the MAZE_BENCHMARK_DATA_DIR env var.
Inference: single deterministic forward pass per sample (model.eval(), dropout layers OFF). Forward signature: model(x, itr=0, trend=x, season=x, noise=x), output sliced [:, -96:, :].
Deviation from deployed /predict: the deployed path uses MC-dropout with ensemble_size=5, which adds forward-pass noise. The validation runner uses ensemble_size=1 deterministic mode for an apples-to-apples comparison against the published benchmark. Per Gal & Ghahramani (2016), the expected R² penalty from averaging over MC-dropout samples is small (~0.02) but is not measured in this report.
Subsampling: stratified-uniform indices via np.linspace(0, N-1, K, dtype=int) to preserve seasonal coverage. Use --full to evaluate every eligible sequence.

Reproduce

PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint forest_tempo --output validation/results/

# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint forest_tempo --full --output validation/results/