ctf4science.benchmark_module.ModelBenchmarker#

class ctf4science.benchmark_module.ModelBenchmarker(config_path: str, num_runs: int = 5)#

Bases: object

Benchmarks a model with optimal hyperparameters for a given dataset and pair_id.

Runs multiple independent training and evaluation runs with different random seeds. Designed to be run from within each model directory.

Parameters:

config_pathstr: Path to the configuration file (must exist).
num_runsint, optional: Number of independent evaluation runs to perform, by default 5.

Methods

run_benchmark()

Run multiple benchmarking evaluations and save results.

Raises:

FileNotFoundError: If config file does not exist.
ValueError: If dataset pair_id is not a single integer or list of one integer.

Notes

Class Methods:

run_benchmark():

Run multiple benchmarking evaluations and save results. Runs all evaluations, computes statistics (mean/std) when 3+ runs succeed.
Returns:
- Dict[str, Any] benchmark_results (model_name, dataset_name, pair_id, planned_num_runs, successful_runs, run_results, statistics, performance_summary, timestamp, output_file).

_construct_output_dir():

Construct the output directory path for benchmark results.
Returns:
- Path results/benchmark_results/{dataset_name}/{model_name}/pair_id_{pair_id}/{timestamp}/.

_create_run_config(self, run_idx, seed):

Create a configuration file for a specific run with a given seed.
Parameters:
- run_idx : int. Index of the run (0-based).
- seed : int. Random seed for this run.
Returns:
- Path to the created config file for the run.

_run_single_evaluation(self, run_idx, seed):

Run a single evaluation of the model.
Parameters:
- run_idx : int. Index of the run.
- seed : int. Random seed for this run.
Returns:
- Dict[str, Any] run results (run_idx, seed, duration, config_path, results, success) or error info.

_find_and_load_results_for_run(self, run_idx):

Find and load the results from the most recent run (for this pair_id).
Parameters:
- run_idx : int. Index of the run.
Returns:
- Dict[str, Any] evaluation results loaded from evaluation_results.yaml.

_extract_run_results(self, all_runs):

_calculate_statistics(self, all_runs):

Calculate mean and standard deviation for all metrics; requires 3+ successful runs.
Parameters:
- all_runs : List[Dict]. List of all run results.
Returns:
- Dict[str, Any] metric means, stds, timing stats. Requires at least 3 successful runs.