Evaluation Metrics#
The Common Task Framework (CTF) for Science employs a standardized suite of 12 evaluation metrics (E1–E12) to assess model performance across various data regimes and prediction tasks. All metrics return a score between 0 and 100, where:
100 indicates a perfect match to the ground truth,
0 corresponds to predicting all zeros, and
Negative values indicate performance worse than the zero baseline.
Note: Not all dataset get their scores to correspond to the [-100, 100] range. The Lorenz dataset for example can produce large negative scores.
This document outlines each metric’s purpose, method of evaluation, and its corresponding dataset pair.
Summary Table#
Metric |
Name |
Task Scenario |
Input Noise |
Data Regime |
Forecast Type |
Evaluation Type |
Dataset Pair ID |
|---|---|---|---|---|---|---|---|
E1 |
Short-Time Forecast Accuracy |
Baseline |
None |
Full |
Short-term |
Short-time |
1 |
E2 |
Long-Time Forecast Accuracy |
Baseline |
None |
Full |
Long-term |
Long-time |
1 |
E3 |
Reconstruction (Medium Noise) |
Medium-noise denoising |
Medium |
Full |
N/A |
Short-time |
2 |
E4 |
Short-Time Forecast (Medium Noise) |
Forecast from noise |
Medium |
Full |
Short-term |
Long-time |
3 |
E5 |
Reconstruction (High Noise) |
High-noise denoising |
High |
Full |
N/A |
Short-time |
4 |
E6 |
Short-Time Forecast (High Noise) |
Forecast from noise |
High |
Full |
Short-term |
Long-time |
5 |
E7 |
Short-Time Forecast (Low Data, Clean) |
Few-shot clean |
None |
Sparse |
Short-term |
Short-time |
6 |
E8 |
Long-Time Forecast (Low Data, Clean) |
Few-shot clean |
None |
Sparse |
Long-term |
Long-time |
6 |
E9 |
Short-Time Forecast (Low Data, Noisy) |
Few-shot noisy |
Medium/High |
Sparse |
Short-term |
Short-time |
7 |
E10 |
Long-Time Forecast (Low Data, Noisy) |
Few-shot noisy |
Medium/High |
Sparse |
Long-term |
Long-time |
7 |
E11 |
Parametric Gen. (Interpolation) |
Interpolation across λ |
None |
Full |
Short-term |
Short-time |
8 |
E12 |
Parametric Gen. (Extrapolation) |
Extrapolation beyond λ |
None |
Full |
Short-term |
Short-time |
9 |
Metric Descriptions#
E1 – Short-Time Forecast Accuracy#
Pair: ID 1
Measures: Accuracy over initial prediction steps.
How: Computes the root-mean-square error over the first k time steps between forecast and truth.
E2 – Long-Time Forecast Accuracy#
Pair: ID 1
Measures: Fidelity of long-term behavior via statistics.
How: L2 distance between log power spectra (log-PSD) of forecast and truth over dominant modes.
E3 – Reconstruction (Medium Noise)#
Pair: ID 2
Measures: Ability to reconstruct clean signals from moderately noisy data.
How: L2 error between denoised output and noise-free reference.
E4 – Short-Time Forecast (Medium Noise)#
Pair: ID 3
Measures: Short-term accuracy when initialized from noisy input.
How: Same as E1 but starting from medium-noise initial conditions.
E5 – Reconstruction (High Noise)#
Pair: ID 4
Measures: Denoising capability under high noise conditions.
How: Same as E3, but on data with stronger degradation.
E6 – Short-Time Forecast (High Noise)#
Pair: ID 5
Measures: Forecasting skill with severely noisy initializations.
How: Same as E1, but with high-noise input data.
E7 – Short-Time Forecast (Low Data, Clean)#
Pair: ID 6
Measures: Forecasting accuracy from small clean datasets.
How: Same as E1 with training on just 51 time steps.
E8 – Long-Time Forecast (Low Data, Clean)#
Pair: ID 6
Measures: Long-time accuracy from limited clean data.
How: Same as E2.
E9 – Short-Time Forecast (Low Data, Noisy)#
Pair: ID 7
Measures: Short-term forecasting from short, noisy input.
How: Same as E1.
E10 – Long-Time Forecast (Low Data, Noisy)#
Pair: ID 7
Measures: Long-range statistical alignment under low data + noise.
How: Same as E2.
E11 – Parametric Generalization (Interpolation)#
Pair: ID 8
Measures: Predictive generalization to interpolated physical parameters.
How: Forecast accuracy in unseen but interpolated parametric regime.
E12 – Parametric Generalization (Extrapolation)#
Pair: ID 9
Measures: Generalization to extrapolated dynamics.
How: Forecast skill in unseen extrapolated physical regimes.
For implementation details of how each metric is computed, see the source in eval_module.py. The evaluation logic automatically selects the appropriate metric per dataset using the metrics list specified in each dataset YAML configuration.