MAZE deployed-model validation — UK-AMo · universal_tempo

✓ Integration verified — universal_tempo aggregate R² matches the published benchmark to within ±0.05 (Δ R² = -0.0011).

Configuration

Site: UK-AMo
Checkpoint: universal_tempo (maze/models/checkpoints/universal_tempo.pth)
Sequences evaluated: 1,000 of 9,055 eligible / 9,055 total (stratified-uniform-1000)
Predictor: universal_tempo (deterministic, ensemble_size=1)
Model load: 4.2s
Inference wall-clock: 26s (0.03s / sample on CPU)

Aggregate metrics — published benchmark vs MAZE deployed

Computed across the flattened (N × 96) array with a single grand-mean R² baseline. Marks: ✓ within ±5%, ⚠ within ±15%, ✗ larger.

Metric	Benchmark	MAZE deployed	Δ	Mark
R2	0.5988	0.5977	-0.0011	✓
RMSE	2.3184	2.3219	+0.0035	✓
MAE	1.4517	1.4542	+0.0025	✓

The ±0.05 R² sanity bound on aggregate R² gates tests/test_validation.py::test_aggregate_r2_within_benchmark_bound when run with MAZE_RUN_FULL_VALIDATION=1.

Per-horizon metrics — supplementary

Per-horizon R² uses a per-step variance baseline of y_true[:, h] (not the flattened grand mean). Reported at the standard reporting horizons.

h	R² (bench)	R² (MAZE)	Δ R²	RMSE (bench)	RMSE (MAZE)	MAE (bench)	MAE (MAZE)
1	0.8787	0.8805	+0.002	1.2748	1.2773	0.7878	0.7933
6	0.6967	0.6981	+0.001	2.0158	2.0082	1.2454	1.2586
12	0.7025	0.6954	-0.007	1.9966	2.0515	1.2345	1.2513
24	0.6831	0.6799	-0.003	2.0604	2.0598	1.2463	1.2353
48	0.5998	0.5945	-0.005	2.3156	2.3683	1.4540	1.4675
72	0.5421	0.5637	+0.022	2.4770	2.3919	1.5722	1.5446
96	0.4814	0.4738	-0.008	2.6362	2.6586	1.6968	1.7047

Validation methodology

Test data: raw NEE_VUT_REF from MAZE's held-out FLUXNET2015 site CSV (6.UK-AMo.csv), gap-filled with ffill/bfill, sliced into 336+96 sliding windows. Data location is configurable via the MAZE_BENCHMARK_DATA_DIR env var.
Inference: single deterministic forward pass per sample (model.eval(), dropout layers OFF). Forward signature: model(x, itr=0, trend=x, season=x, noise=x), output sliced [:, -96:, :].
Deviation from deployed /predict: the deployed path uses MC-dropout with ensemble_size=5, which adds forward-pass noise. The validation runner uses ensemble_size=1 deterministic mode for an apples-to-apples comparison against the published benchmark. Per Gal & Ghahramani (2016), the expected R² penalty from averaging over MC-dropout samples is small (~0.02) but is not measured in this report.
Subsampling: stratified-uniform indices via np.linspace(0, N-1, K, dtype=int) to preserve seasonal coverage. Use --full to evaluate every eligible sequence.

Reproduce

PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint universal_tempo --output validation/results/

# or, for every eligible sequence (Forest-TEMPO on SE-Htm honours the aligned-1503 subset):
PYTHONPATH=. python scripts/run_validation.py --site uk-amo --checkpoint universal_tempo --full --output validation/results/