Training

Training a network can require a significant amount of time (for production models, typically a week with a fast GPU). We therefore expect that this will almost always be done non-interactively using a command-line script. Dingo offers two options, dingo_train and dingo_train_condor, depending on whether your GPU is local or cluster-based.

Both of these scripts take as main argument a settings file, which specifies options relating to Data pre-processing, training strategy, Neural network architecture, hardware, and checkpointing. They produce a trained model in PyTorch .pt format, and they save checkpoints and the training history. The settings file is furthermore saved within the model files for reproducibility and to be able to resume training from a checkpoint. Finally, all precursor settings files (for the waveform or noise datasets) are also saved with the model.

Settings file

Example train_settings.yaml file. This is also available in the examples/ folder. The specific settings listed will train a production-size network, taking about a week on an NVIDIA A100. Consider reducing some model hyperparameters for experimentation.
data:
  waveform_dataset_path: /path/to/waveform_dataset.hdf5  # Contains intrinsic waveforms
  train_fraction: 0.95
  window:
    type: tukey
    f_s: 4096
    T: 8.0
    roll_off: 0.4
  domain_update:
    f_min: 20.0
    f_max: 1024.0
  svd_size_update: 200 
  detectors:
    - H1
    - L1
  extrinsic_prior:
    dec: default
    ra: default
    geocent_time: bilby.core.prior.Uniform(minimum=-0.10, maximum=0.10, name='geocent_time')
    psi: default
    luminosity_distance: bilby.core.prior.Uniform(minimum=100.0, maximum=1000.0, name='luminosity_distance')
  ref_time: 1126259462.391
  gnpe_time_shifts:
    kernel: bilby.core.prior.Uniform(minimum=-0.001, maximum=0.001)
    exact_equiv: True
  inference_parameters: default

model:
  posterior_model_type: normalizing_flow
  posterior_kwargs:
    num_flow_steps: 30
    base_transform_kwargs:
      hidden_dim: 512
      num_transform_blocks: 5
      activation: elu
      dropout_probability: 0.0
      batch_norm: True
      num_bins: 8
      base_transform_type: rq-coupling
  embedding_kwargs:
    output_dim: 128
    hidden_dims: [1024, 1024, 1024, 1024, 1024, 1024,
                  512, 512, 512, 512, 512, 512,
                  256, 256, 256, 256, 256, 256,
                  128, 128, 128, 128, 128, 128]
    activation: elu
    dropout: 0.0
    batch_norm: True
    svd:
      num_training_samples: 20000
      num_validation_samples: 5000
      size: 200

# Training is divided in stages. They each require all settings as indicated below.
training:
  stage_0:
    epochs: 300
    asd_dataset_path: /path/to/asds_fiducial.hdf5
    freeze_rb_layer: True
    optimizer:
      type: adam
      lr: 0.0001
    scheduler:
      type: cosine
      T_max: 300
    batch_size: 64

  stage_1:
    epochs: 150
    asd_dataset_path: /path/to/asds.hdf5
    freeze_rb_layer: False
    optimizer:
      type: adam
      lr: 0.00001
    scheduler:
      type: cosine
      T_max: 150
    batch_size: 64

# Local settings that have no impact on the final trained network.
local:
  device: cpu  # Change this to 'cuda' for training on a GPU.
  num_workers: 6
#  wandb:
#    project: dingo
#    group: O4
  runtime_limits:
    max_time_per_run: 36000
    max_epochs_per_run: 500
  checkpoint_epochs: 10
  leave_waveforms_on_disk: True
  local_cache_path: tmp
#   condor:
#     bid: 100
#     num_cpus: 16
#     memory_cpus: 128000
#     num_gpus: 1
#     memory_gpus: 8000
#     request_disk: 50GB

The train settings file is grouped into four sections:

data_settings

These settings point to a saved dataset of waveform polarizations and describe the transforms to obtain detector waveforms. A detailed description of these settings is available here.

model

This describes the model architecture, including network type and hyperparameters. All of these settings are described in the section on Neural network architecture.

training

This describes the training strategy. Training is divided into stages, each of which can differ to some extent. Stages are numbered (stage_0, stage_1, …) and executed in this order. Each stage is defined by the following settings:

epochs

Total number of training epochs for the stage. The network sees the entire training set once per epoch.

asd_dataset_path

Points to an ASDDataset file. Each stage can have its own ASD dataset, which is useful for implementing a pre-training stage with fixed ASD and a fine-tuning stage with variable ASD.

freeze_rb_layer

Whether to freeze the first layer of the embedding network in nsf+embedding models. This layer is seeded with reduced (SVD) basis vectors, so freezing this layer during pre-training simply projects data onto the basis coefficients. In the fine-tuning stage, when other weights are more stable, unfreezing this can be useful.

optimizer

Specify optimizer type and parameters such as initial learning rate.

scheduler

Use a learning rate scheduler to reduce the learning rate over time. This can improve overall optimization.

batch_size

Number of training samples per mini-batch. For a training dataset of size \(N\), then each epoch will consist of \(N / \text{batch_size}\) batches. Generally training will be faster for a larger batch size, but will require additional iterations.

Important

The stage-training framework allows for separate pre-training and fine-tuning stages. We found that having a pre-training stage where we freeze certain network weights and fix the noise ASD improves overall training results.

local

The local settings are the only group that have no impact on the final trained network. Indeed, they are not even saved in the .pt files; rather they are split off and saved in a new file local_settings.yaml.

device

cpu or cuda. Training on a GPU with CUDA is highly recommended.

num_workers

Number of CPU worker processes to use for pre-processing training data before copying to the GPU. Data pre-processing (inluding decompression, projection to detectors, and noise generation) is quite expensive, so using 16 or 32 processes is recommended, otherwise this can become a bottleneck. We recommend monitoring the GPU utilization percentage as well as time spent on pre-processing (output during training) to fine-tune this number.

wandb

Settings for Weights & Biases. If you have an account, you can use this to track your training progress and compare different runs.

runtime_limits

Maximum time (in seconds) or maximum number of epochs per run. Using this could make sense in a cluster environment.

checkpoint_epochs

Dingo saves a temporary checkpoint in model_latest.py after every epoch, but this is later overwritten by the next checkpoint. This setting saves a permanent checkpoint after the specified number of epochs. Having these checkpoints can help in recovering from training failures that do not result in program termination.

leave_waveforms_on_disk

To improve memory efficiency during training, the waveforms are not loaded into memory at the beginning of training, but separately for each batch during training. When training on a cluster, it is highly recommended to include a local path where the dataset is cached at the beginning of training (see local_cache_path). If RAM is not an issue, the default leave_waveforms_on_disk=True can be set to False.

local_cache_path

When training on a cluster and loading waveforms during training (i.e., leave_waveforms_on_disk=True), the waveform dataset should be copied to the disk storage of the local node at the beginning of training. This prevents unexpected long data loading times during training due to network traffic. Usually, paths for local storage are tmp or dev/shm. When submitting the job with condor, request_disk: 50GB should be included in the condor settings with the requested disk space larger than the size of the waveform dataset used for training.

condor

Settings for HTCondor. The condor script will (re)submit itself according to these options.

Command-line scripts

dingo_train

On a local machine, simply pass the settings file (or checkpoint) and an output directory to dingo_train. It will train until complete, or until a runtime limit is reached.

usage: dingo_train [-h] [--settings_file SETTINGS_FILE] --train_dir TRAIN_DIR [--checkpoint CHECKPOINT]

Train a neural network for gravitational-wave single-event inference.

This program can be called in one of two ways:
    a) with a settings file. This will create a new network based on the 
    contents of the settings file.
    b) with a checkpoint file. This will resume training from the checkpoint.

optional arguments:
  -h, --help            show this help message and exit
  --settings_file SETTINGS_FILE
                        YAML file containing training settings.
  --train_dir TRAIN_DIR
                        Directory for Dingo training output.
  --checkpoint CHECKPOINT
                        Checkpoint file from which to resume training.

dingo_train_condor

On a cluster using HTCondor, use dingo_train_condor. This calls itself recursively as follows:

  1. The first time you call it, use the flag --start-submission. This creates a condor submission file submission_file.sub that again calls the executable dingo_train_condor (now without the flag) and submits it. This will run dingo_train_condor directly on the cluster node that is assigned.

  2. On the cluster node, dingo_train_condor first trains the network until done or a runtime limit is reached (be careful to set this shorter than the condor time limit). Then it creates a new submission file that once again calls dingo_train_condor, and submits it. This will resume the run on a new node, and repeat.

usage: dingo_train_condor [-h] --train_dir TRAIN_DIR [--checkpoint CHECKPOINT] [--start_submission]

optional arguments:
  -h, --help            show this help message and exit
  --train_dir TRAIN_DIR
                        Directory for Dingo training output.
  --checkpoint CHECKPOINT
  --start_submission

Output

Output from training is stored in the TRAIN_DIR folder passed to the training scripts. This consists of the following:

  • model_latest.pt checkpoints every epoch (overwritten);

  • model_XXX.pt checkpoints where XXX is the epoch number, every checkpoint_epochs epochs;

  • model_stage_X.pt at the end of training stage X;

  • history.txt with columns (epoch number, train loss, test loss, learning rate);

  • svd_L1.hdf5, …, storing SVD basis information used for seeding the embedding network;

  • local_settings.yaml with local settings for the run (not stored with checkpoints).

The .pt and .hdf5 files may all be inspected using dingo_ls. This prints all the settings, as well as diagnostic information for SVD bases. The saved settings include all the settings provided in the settings file, as well as several derived quantities, such as parameter standardizations, additional context parameters (for GNPE), etc.

Modifying a checkpoint

Occasionally it may be necessary to change a setting of a partially trained model. For example, a model may have been successfully pre-trained, but the fine-tuning failed, and one may wish to change the fine-tuning settings without starting from scratch. Since the model setting are all stored with the checkpoint, they just need to be changed.

The script dingo_append_training_stage allows for appending a model stage or replacing an existing planned stage. It will fail if the stage has already begun training, so be sure to use it on a sufficiently early checkpoint.

usage: dingo_append_training_stage [-h] --checkpoint CHECKPOINT --stage_settings_file STAGE_SETTINGS_FILE --out_file OUT_FILE [--replace REPLACE]

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
  --stage_settings_file STAGE_SETTINGS_FILE
  --out_file OUT_FILE
  --replace REPLACE

For more detailed adjustments to the training settings the script one can use the script compatibility/update_model_metadata.py.

usage: update_model_metadata.py [-h] --checkpoint CHECKPOINT --key KEY [KEY ...] --value VALUE

optional arguments:
  -h, --help            show this help message and exit
  --checkpoint CHECKPOINT
  --key KEY [KEY ...]
  --value VALUE

Warning

Modifications to model metadata can easily break things. Do not use this unless completely sure what you are doing!