Evaluation Metrics#

The Common Task Framework (CTF) for Science employs a standardized suite of 12 evaluation metrics (E1–E12) to assess model performance across various data regimes and prediction tasks. All metrics return a score between 0 and 100, where:

100 indicates a perfect match to the ground truth,
0 corresponds to predicting all zeros, and
Negative values indicate performance worse than the zero baseline.

Note: Not all dataset get their scores to correspond to the [-100, 100] range. The Lorenz dataset for example can produce large negative scores.

This document outlines each metric’s purpose, method of evaluation, and its corresponding dataset pair.

Summary Table#

Metric	Name	Task Scenario	Input Noise	Data Regime	Forecast Type	Evaluation Type	Dataset Pair ID
E1	Short-Time Forecast Accuracy	Baseline	None	Full	Short-term	Short-time	1
E2	Long-Time Forecast Accuracy	Baseline	None	Full	Long-term	Long-time	1
E3	Reconstruction (Medium Noise)	Medium-noise denoising	Medium	Full	N/A	Short-time	2
E4	Short-Time Forecast (Medium Noise)	Forecast from noise	Medium	Full	Short-term	Long-time	3
E5	Reconstruction (High Noise)	High-noise denoising	High	Full	N/A	Short-time	4
E6	Short-Time Forecast (High Noise)	Forecast from noise	High	Full	Short-term	Long-time	5
E7	Short-Time Forecast (Low Data, Clean)	Few-shot clean	None	Sparse	Short-term	Short-time	6
E8	Long-Time Forecast (Low Data, Clean)	Few-shot clean	None	Sparse	Long-term	Long-time	6
E9	Short-Time Forecast (Low Data, Noisy)	Few-shot noisy	Medium/High	Sparse	Short-term	Short-time	7
E10	Long-Time Forecast (Low Data, Noisy)	Few-shot noisy	Medium/High	Sparse	Long-term	Long-time	7
E11	Parametric Gen. (Interpolation)	Interpolation across λ	None	Full	Short-term	Short-time	8
E12	Parametric Gen. (Extrapolation)	Extrapolation beyond λ	None	Full	Short-term	Short-time	9

Metric Descriptions#

E1 – Short-Time Forecast Accuracy#

Pair: ID 1
Measures: Accuracy over initial prediction steps.
How: Computes the root-mean-square error over the first k time steps between forecast and truth.

E2 – Long-Time Forecast Accuracy#

Pair: ID 1
Measures: Fidelity of long-term behavior via statistics.
How: L2 distance between log power spectra (log-PSD) of forecast and truth over dominant modes.

E3 – Reconstruction (Medium Noise)#

Pair: ID 2
Measures: Ability to reconstruct clean signals from moderately noisy data.
How: L2 error between denoised output and noise-free reference.

E4 – Short-Time Forecast (Medium Noise)#

Pair: ID 3
Measures: Short-term accuracy when initialized from noisy input.
How: Same as E1 but starting from medium-noise initial conditions.

E5 – Reconstruction (High Noise)#

Pair: ID 4
Measures: Denoising capability under high noise conditions.
How: Same as E3, but on data with stronger degradation.

E6 – Short-Time Forecast (High Noise)#

Pair: ID 5
Measures: Forecasting skill with severely noisy initializations.
How: Same as E1, but with high-noise input data.

E7 – Short-Time Forecast (Low Data, Clean)#

Pair: ID 6
Measures: Forecasting accuracy from small clean datasets.
How: Same as E1 with training on just 51 time steps.

E8 – Long-Time Forecast (Low Data, Clean)#

Pair: ID 6
Measures: Long-time accuracy from limited clean data.
How: Same as E2.

E9 – Short-Time Forecast (Low Data, Noisy)#

Pair: ID 7
Measures: Short-term forecasting from short, noisy input.
How: Same as E1.

E10 – Long-Time Forecast (Low Data, Noisy)#

Pair: ID 7
Measures: Long-range statistical alignment under low data + noise.
How: Same as E2.

E11 – Parametric Generalization (Interpolation)#

Pair: ID 8
Measures: Predictive generalization to interpolated physical parameters.
How: Forecast accuracy in unseen but interpolated parametric regime.

E12 – Parametric Generalization (Extrapolation)#

Pair: ID 9
Measures: Generalization to extrapolated dynamics.
How: Forecast skill in unseen extrapolated physical regimes.

For implementation details of how each metric is computed, see the source in eval_module.py. The evaluation logic automatically selects the appropriate metric per dataset using the metrics list specified in each dataset YAML configuration.