ctf4science.benchmark_module.ModelBenchmarker#

class ctf4science.benchmark_module.ModelBenchmarker(config_path: str, num_runs: int = 5)#

Bases: object

Benchmarks a model with optimal hyperparameters for a given dataset and pair_id.

Runs multiple independent training and evaluation runs with different random seeds. Designed to be run from within each model directory.

Parameters:
config_pathstr

Path to the configuration file (must exist).

num_runsint, optional

Number of independent evaluation runs to perform, by default 5.

Methods

run_benchmark()

Run multiple benchmarking evaluations and save results.

Raises:
FileNotFoundError

If config file does not exist.

ValueError

If dataset pair_id is not a single integer or list of one integer.

Notes

Class Methods:

run_benchmark():

  • Run multiple benchmarking evaluations and save results. Runs all evaluations, computes statistics (mean/std) when 3+ runs succeed.

  • Returns:
    • Dict[str, Any] benchmark_results (model_name, dataset_name, pair_id, planned_num_runs, successful_runs, run_results, statistics, performance_summary, timestamp, output_file).

_construct_output_dir():

  • Construct the output directory path for benchmark results.

  • Returns:
    • Path results/benchmark_results/{dataset_name}/{model_name}/pair_id_{pair_id}/{timestamp}/.

_create_run_config(self, run_idx, seed):

  • Create a configuration file for a specific run with a given seed.

  • Parameters:
    • run_idx : int. Index of the run (0-based).

    • seed : int. Random seed for this run.

  • Returns:
    • Path to the created config file for the run.

_run_single_evaluation(self, run_idx, seed):

  • Run a single evaluation of the model.

  • Parameters:
    • run_idx : int. Index of the run.

    • seed : int. Random seed for this run.

  • Returns:
    • Dict[str, Any] run results (run_idx, seed, duration, config_path, results, success) or error info.

_find_and_load_results_for_run(self, run_idx):

  • Find and load the results from the most recent run (for this pair_id).

  • Parameters:
    • run_idx : int. Index of the run.

  • Returns:
    • Dict[str, Any] evaluation results loaded from evaluation_results.yaml.

_extract_run_results(self, all_runs):

  • Extract run results for each successful run, keyed by run identifier.

  • Parameters:
    • all_runs : List[Dict]. List of all run results.

  • Returns:
    • Dict[str, Any] run results keyed by run_{n}_seed_{seed}.

_calculate_statistics(self, all_runs):

  • Calculate mean and standard deviation for all metrics; requires 3+ successful runs.

  • Parameters:
    • all_runs : List[Dict]. List of all run results.

  • Returns:
    • Dict[str, Any] metric means, stds, timing stats. Requires at least 3 successful runs.