Getting Started#
This guide provides detailed instructions to help new users install, use, and contribute to the CTF for Science Framework.
Installation#
To set up the framework on your system:
Clone the Repository: Using SSH (Recommended)
git clone --recursive git@github.com:CTF-for-Science/ctf4science.git
Using HTTPS (requires GitHub authentication):
git clone --recursive https://github.com/CTF-for-Science/ctf4science.git
Install the Framework: Install the framework as a Python package in editable mode:
pip install -e .
This installs core dependencies:
numpyscipypandaspyyamlseabornmatplotlibimportlib-resources
To install all optional dependencies, run
pip install -e .[all]instead. This installs optional dependencies:plotlyoptunajupyterlabray[data,train,tune,serve]
Note
zsh shell users should run
pip install -e '.[all]'to avoid errors.Note
Some models may require additional dependencies, specified in their own
requirements.txtfiles.
Quick Start#
To test the framework with a baseline model:
Navigate to the naive baselines directory:
cd models/CTF_NaiveBaselines
Run the ‘average’ baseline on the Lorenz system:
python run.py config/config_Lorenz_average_batch_1-6.yaml
This will:
Load the Lorenz dataset for sub-datasets 1 through 6.
Generate predictions using the ‘average’ method for each sub-dataset.
Save results, including visualizations, to
results/ODE_Lorenz/CTF_NaiveBaselines_average/<batch_id>/.
Note
Check the
results/directory for outputs likepredictions.npy,evaluation_results.yaml, and visualization plots (e.g.,trajectories.png,histograms.png) for each sub-dataset.
Understanding the Core Modules#
The framework relies on three key modules:
data_module.py:Purpose: Handles dataset loading and sub-dataset selection.
Key Functions:
load_dataset(dataset_name, pair_id, transpose=False): Loads train and test data for a specific sub-dataset.parse_pair_ids(dataset_config): Interprets thepair_idconfiguration to determine which sub-datasets to process. It supports multiple formats (see “Configuring Your Run” below).
eval_module.py:Purpose: Computes evaluation metrics for model predictions.
Key Functions:
evaluate(dataset_name, pair_id, prediction, metrics=None): Calculates metrics like short-time forecast, reconstruction, and long-time forecast.save_results(...): Saves config, predictions, and metrics to theresults/directory for each sub-dataset.
visualization_module.py:Purpose: Generates plots to visualize predictions and metrics.
Key Features: Auto-generates plots (e.g., trajectories, histograms, PSD) during runs, saved in
results/**/visualizations/.
Configuring Your Run#
Configuration files (e.g., config_Lorenz_average_batch_1-6.yaml) control dataset and model parameters. The dataset section is required in every config file and specifies the dataset name and the sub-datasets to run on.
Dataset Configuration Options#
dataset(Required):name: The name of the dataset (e.g.,ODE_Lorenz,PDE_KS).pair_id: Specifies which sub-datasets to run on. This field is optional within thedatasetsection, but if omitted, the framework will run on all available sub-datasets. Supported formats include:Single integer: Run on a specific sub-dataset.
pair_id: 3 # Run on sub-dataset 3 only
List of integers: Run on multiple specific sub-datasets.
pair_id: [1, 2, 3, 4, 5, 6] # Run on sub-datasets 1 through 6
Range string: Run on a range of sub-datasets.
pair_id: '1-3' # Run on sub-datasets 1, 2, and 3
Omitted or
all: Run on all available sub-datasets for the dataset.dataset: name: PDE_KS # pair_id omitted or set to 'all' to run on all sub-datasets
Example configuration:
dataset:
name: ODE_Lorenz
pair_id: '1-6' # Run on sub-datasets 1 through 6
model:
name: CTF_NaiveBaselines
method: average
See Configuration File Overview for full details (to be created).
Contributing a New Model#
To integrate your own model into the framework, follow these steps:
Step 1: Create a Model Directory#
Create a new directory under models/ (e.g., models/MyModel):
mkdir models/MyModel
Step 2: Implement Your Model#
In models/MyModel/, create a Python file (e.g., my_model.py) with your model’s logic. It should:
Accept a
configdictionary and optionaltrain_dataduring initialization.Provide a
predictmethod for generating predictions.
Example:
class MyModel:
def __init__(self, config, train_data=None):
self.config = config
self.train_data = train_data
# Add initialization logic
def predict(self, test_data):
# Add prediction logic
return predictions
Step 3: Create a run.py File#
Add a run.py file in models/MyModel/ to handle batch runs across multiple sub-datasets. The framework uses a batch run approach, processing all specified sub-datasets and saving results under a unique batch identifier. Below is an example:
import argparse
import yaml
from pathlib import Path
import datetime
from ctf4science.data_module import load_dataset, parse_pair_ids, get_applicable_plots
from ctf4science.eval_module import evaluate, save_results
from ctf4science.visualization_module import Visualization
from my_model import MyModel
def main(config_path):
# Load configuration
with open(config_path, 'r') as f:
config = yaml.safe_load(f)
# Extract dataset name and parse pair_ids
dataset_name = config['dataset']['name']
pair_ids = parse_pair_ids(config['dataset'])
model_name = "MyModel"
# Generate a unique batch_id for this run
batch_id = f"batch_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}"
# Initialize batch results dictionary
batch_results = {
'batch_id': batch_id,
'model': model_name,
'dataset': dataset_name,
'pairs': []
}
# Initialize visualization object
viz = Visualization()
applicable_plots = get_applicable_plots(dataset_name)
# Process each sub-dataset
for pair_id in pair_ids:
# Load sub-dataset
train_data, test_data, init_data = load_dataset(dataset_name, pair_id)
# Initialize model
model = MyModel(config, train_data)
# Generate predictions
predictions = model.predict()
# Evaluate predictions
results = evaluate(dataset_name, pair_id, test_data, predictions)
# Save results and get directory
results_directory = save_results(dataset_name, model_name, batch_id, pair_id, config, predictions, results)
# Append metrics to batch results
batch_results['pairs'].append({
'pair_id': pair_id,
'metrics': results
})
# Generate and save visualizations
for plot_type in applicable_plots:
fig = viz.plot_from_batch(dataset_name, pair_id, results_directory, plot_type=plot_type)
viz.save_figure_results(fig, dataset_name, model_name, batch_id, pair_id, plot_type)
# Save aggregated batch results
with open(results_directory.parent / 'batch_results.yaml', 'w') as f:
yaml.dump(batch_results, f)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('config', type=str, help="Path to the configuration file")
args = parser.parse_args()
main(args.config)
Step 4: Add Configuration Files#
In models/MyModel/config/, create a YAML file (e.g., my_config.yaml):
dataset:
name: ODE_Lorenz
pair_id: '1-3' # Example: run on sub-datasets 1 to 3
model:
param1: value1
Step 5: Document Your Model#
Add a README.md in models/MyModel/ explaining your model, dependencies, and usage.
Step 6: Test Your Model#
Run your model:
python models/MyModel/run.py models/MyModel/config/my_config.yaml
Verify the output in results/.
Running Models#
To run any model:
python models/<model_name>/run.py models/<model_name>/config/<config_file>.yaml
Adding Your Model to ctf4science: Best Practices#
Now that you have developed your own model for the CTF in its own branch and you want to add it to the ctf4science repository. To maintain code quality and review processes, we recommend adding your model as a submodule through a development branch rather than pushing directly to the main branch.
First, clone the ctf4science repository:
git clone --recursive git@github.com:CTF-for-Science/ctf4science.git
cd ctf4science
Create a new development branch:
git checkout -b add-mymodel-submodule
Add your model repository as a submodule in the
modelsdirectory:
git submodule add git@github.com:MyGithubName/MyModelRepo.git models/MyModelRepo
Commit the changes to your development branch:
git commit -m "Add MyModelRepo as a submodule"
Push your development branch to the remote repository:
git push origin add-mymodel-submodule
Create a pull request (PR) from your branch to the main branch:
Go to the ctf4science repository on GitHub
Click “Pull requests” > “New pull request”
Set the base branch to
mainand the compare branch toadd-mymodel-submoduleAdd a description explaining your model and its integration
Submit the pull request for review
Results and Visualization#
After running a model, the framework saves results for each sub-dataset in results/<dataset>/<model>/<batch_id>/<pair_id>/, including:
config.yaml: Configuration used.predictions.npy: Predicted data array.evaluation_results.yaml: Metrics (e.g., short_time, reconstruction, long_time).visualizations/: Auto-generated plots (e.g.,trajectories.png,histograms.png,psd.png).
A batch_results.yaml file is also saved in results/<dataset>/<model>/<batch_id>/, summarizing the metrics for all sub-datasets in the batch.
Use the Jupyter notebooks in the notebooks/ directory for further analysis or custom visualizations.
Additional Documentation#
Datasets: Datasets overview
Evaluation Metrics: Metrics overview.
Evaluation Module: Evaluation Module overview.
Visualization Module: Visualization instructions.
Configuration File Overview: Config file structure.
Contributing to the CTF for Science Framework: Developer instructions
Tune Module: Information about hyperparameter optimization.
API: API documentation.