Datasets#
This document summarizes the core datasets used in the CTF for Science framework. Each dataset comprises a collection of train/test pairs, each associated with one or more evaluation metrics (E1–E12). For a detailed explanation of these metrics, see Evaluation Metrics. For visual outputs associated with each dataset, see Visualization Module.
Dataset Summary Table#
Name |
Type |
Delta t |
Spatial Dim |
Long-Time Eval |
Visualizations |
|---|---|---|---|---|---|
ODE_Lorenz |
Dynamical |
0.05 |
3 |
histogram_L2_error |
trajectories, histograms |
PDE_KS |
Spatio-temporal |
0.025 |
1024 |
spectral_L2_error |
psd, 2d_comparison |
Lorenz_Official |
Dynamical |
0.05 |
3 |
histogram_L2_error |
trajectories, histograms |
KS_Official |
Spatio-temporal |
0.025 |
1024 |
spectral_L2_error |
psd, 2d_comparison |
sst |
Spatio-temporal |
1.0 |
90601 |
spectral_L2_error |
psd, 2d_comparison |
seismo |
Spatio-temporal |
1.0 |
2048 |
spectral_L2_error |
psd, 2d_comparison |
ocean_das |
Spatio-temporal |
1.0 |
3000 |
spectral_L2_error |
psd, 2d_comparison |
crustal_3d |
Spatio-temporal |
1.0 |
62451, 26508 |
spectral_L2_error |
psd, 2d_comparison |
ODE_Lorenz#
A 3D dynamical system based on the Lorenz attractor. This dataset tests forecasting and reconstruction capabilities across varied noise levels and training regimes.
Time step: 0.05
Spatial dimension: 3
Evaluation: histogram L2 error for long-time metrics
Relevant metrics:
ID 1: E1 (short_time), E2 (long_time)
ID 2: E3 (reconstruction)
ID 3: E4 (long_time)
ID 4: E5 (reconstruction)
ID 5: E6 (long_time)
ID 6: E7, E8 (short_time, long_time)
ID 7: E9, E10 (short_time, long_time)
ID 8: E11 (short_time)
ID 9: E12 (short_time)
Visualizations: Trajectories, Histograms
PDE_KS#
A spatio-temporal dataset based on the Kuramoto-Sivashinsky (KS) partial differential equation. It challenges models to learn dynamics over space and time using dense 1024-dimensional spatial grids.
Time step: 0.025
Spatial dimension: 1024
Evaluation: spectral L2 error for long-term behavior (e.g., E2, E8, E10)
Relevant metrics:
ID 1: E1 (short_time), E2 (long_time)
ID 2: E3 (reconstruction)
ID 3: E4 (long_time)
ID 4: E5 (reconstruction)
ID 5: E6 (long_time)
ID 6: E7, E8 (short_time, long_time)
ID 7: E9, E10 (short_time, long_time)
ID 8: E11 (short_time)
ID 9: E12 (short_time)
Visualizations: Power Spectral Density (PSD)
Lorenz_Official#
The official Lorenz dataset with longer sequences and standardized splits for benchmarking. The testing data is not included and predictions need to be submitted for scoring on the test set.
Time step: 0.05
Spatial dimension: 3
Evaluation: histogram L2 error for long-time metrics
Relevant metrics:
IDs 1–9 map identically to E1–E12
Visualizations: Trajectories, Histograms
KS_Official#
The official Kuramoto-Sivashinsky dataset designed for rigorous testing of spatio-temporal forecasting and generalization. The testing data is not included in this dataset. Predictions need to be submitted for scoring on the test set.
Time step: 0.025
Spatial dimension: 1024
Evaluation: spectral L2 error for long-term behavior
Relevant metrics:
IDs 1–9 map identically to E1–E12
Visualizations: PSD, 2D comparison
seismo#
A spatio-temporal dataset of synthetic seismic waveforms generated using the Instaseis library. This dataset challenges models to learn complex wave propagation dynamics across 2048 virtual seismometer stations, testing forecasting capabilities for earthquake-induced ground motion patterns.
Time step: 1.0
Spatial dimension: 2048
Evaluation: spectral L2 error for long-term behavior
Data characteristics: * Velocity seismograms (m/s) in vertical (Z) component * Synthetic earthquakes with randomized magnitude, location, etc. * Normalized for each earthquake event
Relevant metrics:
ID 1: E1 (short_time), E2 (long_time)
ID 2: E3 (reconstruction)
ID 3: E4 (long_time)
ID 4: E5 (reconstruction)
ID 5: E6 (long_time)
ID 6: E7, E8 (short_time, long_time)
ID 7: E9, E10 (short_time, long_time)
ID 8: E11 (short_time)
ID 9: E12 (short_time)
ocean_das#
A spatio-temporal dataset using a novel geophysical sensing technology called DAS (Distributed Acoustic Sensing). This dataset is comprised of data from a shallow offshore DAS about 30m below sea level where surface gravity waves are particularly dispersive. The data is sampled at 5Hz but low-pass filtered to 1Hz.
Time step: 1.0
Spatial dimension: 3000
Evaluation: spectral L2 error for long-term behavior
Data characteristics: * Real-world sensor measurements measuring acoustic frequency strain signals.
Relevant metrics:
ID 1: E1 (short_time), E2 (long_time)
ID 2: E3 (reconstruction)
ID 3: E4 (long_time)
ID 4: E5 (reconstruction)
ID 5: E6 (long_time)
ID 6: E7, E8 (short_time, long_time)
ID 7: E9, E10 (short_time, long_time)
ID 8: E11 (short_time)
ID 9: E12 (short_time)
crustal_3d#
A spatio-temporal from synthetic 3D seismic wavefields in a heterogeneous 3D crustal model. Each simulation yields three-component velocity seismograms on a \(32\times32\times32\) heterogeneous grid. Virtual sensors form a \(94\times94\) grid arranged on top of the model volume with 100m spacing. These seismograms are sampled for 6 seconds at 50Hz.
Time step: 1.0
Spatial dimension: 62451, 26508
Evaluation: spectral L2 error for long-term behavior
Data characteristics: * For tasks \(E_{1}\)-\(E_{10}\), the velocity seismograms, virtual sensors, and point sources are provided, yielding 62451 data points per timestep. For tasks \(E_{11}\)-\(E_{12}\) only the velocity seismograms are provided, yielding 26508 data points per timestep.
Relevant metrics:
ID 1: E1 (short_time), E2 (long_time)
ID 2: E3 (reconstruction)
ID 3: E4 (long_time)
ID 4: E5 (reconstruction)
ID 5: E6 (long_time)
ID 6: E7, E8 (short_time, long_time)
ID 7: E9, E10 (short_time, long_time)
ID 8: E11 (short_time)
ID 9: E12 (short_time)
SST#
This dataset contains Global Sea Surface Temperature (SST) data from NASA’s Group for High Resolution Sea Surface Temperature (GHRSST) product. SST data exhibits complex, multiscale features of turbulent flows with intermittent events and quasi-periodic behavior, making it a challenging benchmark for forecasting, reconstruction, and prediction tasks. Unlike the synthetic KS and Lorenz datasets, this represents real-world geophysical observations, providing a critical testbed for evaluating data-driven methods on actual scientific data.
Time step: 1.0
Spatial dimension: 90601
Evaluation: spectral L2 error for long-term behavior
Data characteristics: * Real-world physical data
Relevant metrics:
ID 1: E1 (short_time), E2 (long_time)
ID 2: E3 (reconstruction)
ID 3: E4 (long_time)
ID 4: E5 (reconstruction)
ID 5: E6 (long_time)
ID 6: E7, E8 (short_time, long_time)
ID 7: E9, E10 (short_time, long_time)
ID 8: E11 (short_time)
ID 9: E12 (short_time)
Each dataset configuration file (e.g., ODE_Lorenz.yaml) includes:
The full list of train/test matrix files.
Pair ID mappings to metrics.
Matrix shapes and time offsets.
To inspect these settings programmatically, see ctf4science/data_module.py. For guidance on configuration format, see Configuration File Overview.