Benchmark Module#

The Benchmark Module (benchmark_module.py) provides systematic evaluation of CTF models against a hidden test set. It also assesses model stability by running models multiple times with different random seeds.

Overview#

The ModelBenchmarker class orchestrates benchmarking by:

  • Running multiple independent evaluations with different random seeds (default: 5, configurable)

  • Computing statistical summaries (mean and standard deviation) when 3+ runs are successful

  • Extracting individual run results for detailed analysis

  • Monitoring wall-clock time performance during execution

  • Saving results for analysis

Usage#

Command Line Interface#

Run benchmarking from within a model directory:

cd models/YourModel
python -m ctf4science.benchmark_module --config path/to/your/config.yaml

To specify the number of evaluation runs:

python -m ctf4science.benchmark_module --config path/to/your/config.yaml --num-evals 10

Programmatic Usage#

from ctf4science.benchmark_module import ModelBenchmarker

# Default: 5 runs
benchmarker = ModelBenchmarker("path/to/config.yaml")
results = benchmarker.run_benchmark()

# Custom number of runs
benchmarker = ModelBenchmarker("path/to/config.yaml", num_runs=10)
results = benchmarker.run_benchmark()

Configuration Requirements#

The benchmark module requires a standard CTF configuration file:

dataset:
  name: ODE_Lorenz
  pair_id: 1        # Single pair ID (not a list)

model:
  name: YourModel
  method: your_method  # Optional

Parameters#

  • num_runs (default: 5): Number of independent evaluation runs to perform

Output Structure#

Benchmark results are saved in:

results/benchmark_results/
    {dataset_name}/
        {model_name}/
            pair_id_{pair_id}/
                {timestamp}/
                    benchmark_results_{model_name}_pair{pair_id}.json

Results File Contents#

The main results file contains:

{
  "model_name": "YourModel",
  "dataset_name": "ODE_Lorenz",
  "pair_id": 1,
  "planned_num_runs": 5,
  "successful_runs": 5,
  "actual_num_runs": 5,
  "run_results": {
    "run_1_seed_42": {
      "results": {"short_time": 85.2},
      "duration": 45.1
    },
    "run_2_seed_123": {
      "results": {"short_time": 84.8},
      "duration": 44.9
    }
  },
  "statistics": {
    "short_time_mean": 85.2,
    "short_time_std": 2.1,
    "timing": {
      "duration_mean": 45.2,
      "duration_std": 2.8
    }
  },
  "performance_summary": {},
  "timestamp": "2025-01-XX..."
}

Statistical Analysis#

The benchmark module calculates statistics for all evaluation metrics:

  • Mean: Average score across successful runs

  • Standard Deviation: Measure of score variability

  • Timing Statistics: Mean and standard deviation of execution times

Success Criteria:

  • Requires at least 3 successful runs to calculate statistics

  • Individual run failures are logged but don’t stop the benchmark

  • All successful runs are recorded in run_results

Model Compatibility#

The benchmark module works with any CTF model that follows the standard interface:

  • Model directory must contain a run.py file

  • run.py must have a main(config_path) function

  • Model should save results in the standard CTF format

Troubleshooting#

Common Issues#

“Config file not found”

  • Verify the config file path is correct

“No result directories found”

  • Check that the model ran successfully

  • Verify the model saves results in the expected format

“Only X runs successful”

  • Review error logs for failed runs

  • Check model stability and error handling

Integration#

The benchmark module integrates with CTF components:

  • Performance Module: Uses PerformanceMonitor for system monitoring

  • Evaluation Module: Relies on standard evaluation results format

  • Data Module: Works with standard dataset loading