MAZE deployed-model validation — SE-Htm · universal_tempo

✓ Integration verified — universal_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = -0.0010).

Configuration

Site: SE-Htm
Checkpoint: universal_tempo (maze/models/checkpoints/universal_tempo.pth)
Sequences evaluated: 1,000 of 5,007 eligible / 5,007 total (stratified-uniform-1000)
Predictor: universal_tempo (deterministic, ensemble_size=1)
Model load: 2.5s
Inference wall-clock: 28s (0.03s / sample on CPU)

Aggregate metrics — published benchmark vs MAZE deployed

Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.

Metric	Benchmark	MAZE deployed	Δ	Mark
R2	0.7281	0.7271	-0.0010	✓
RMSE	3.7935	3.7998	+0.0063	✓
MAE	2.6019	2.6047	+0.0028	✓

The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.

Per-horizon metrics — supplementary

Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.

h	R² (bench)	R² (MAZE)	Δ R²	RMSE (bench)	RMSE (MAZE)	MAE (bench)	MAE (MAZE)
1	0.8864	0.8816	-0.005	2.4346	2.4788	1.5743	1.5911
6	0.7778	0.7796	+0.002	3.4071	3.3791	2.2463	2.2339
12	0.7745	0.7702	-0.004	3.4338	3.5233	2.2961	2.3143
24	0.7478	0.7487	+0.001	3.6428	3.5932	2.4267	2.3959
48	0.7297	0.7166	-0.013	3.7898	3.8858	2.5971	2.6701
72	0.7056	0.6939	-0.012	3.9612	4.0544	2.7437	2.7883
96	0.6813	0.6901	+0.009	4.1320	4.0814	2.9204	2.8869

Validation methodology

Test data: raw NEE_VUT_REF from MAZE's held-out FLUXNET2015 site CSV (7.SE-Htm.csv), gap-filled with ffill/bfill, sliced into 336+96 sliding windows. Data location is configurable via the MAZE_BENCHMARK_DATA_DIR env var.
Inference: single deterministic forward pass per sample (model.eval(), dropout layers OFF). Forward signature: model(x, itr=0, trend=x, season=x, noise=x), output sliced [:, -96:, :].
Deviation from deployed /predict: the deployed path uses MC-dropout with ensemble_size=5, which adds forward-pass noise. The validation runner uses ensemble_size=1 deterministic mode for an apples-to-apples comparison against the published benchmark. Per Gal & Ghahramani (2016), the expected R² penalty from averaging over MC-dropout samples is small (~0.02) but is not measured in this report.
Subsampling: stratified-uniform indices via np.linspace(0, N-1, K, dtype=int) to preserve seasonal coverage. Use --full to evaluate every eligible sequence.

Reproduce

PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint universal_tempo --output validation/results/

# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site se-htm --checkpoint universal_tempo --full --output validation/results/