# Training Training a network can require a significant amount of time (for production models, typically a week with a fast GPU). We therefore expect that this will almost always be done non-interactively using a command-line script. Dingo offers two options, `dingo_train` and `dingo_train_condor`, depending on whether your GPU is local or cluster-based. Both of these scripts take as main argument a settings file, which specifies options relating to [](training_transforms.ipynb), training strategy, [](network_architecture.ipynb), hardware, and checkpointing. They produce a trained model in PyTorch `.pt` format, and they save checkpoints and the training history. The settings file is furthermore saved within the model files for reproducibility and to be able to resume training from a checkpoint. Finally, all *precursor* settings files (for the waveform or noise datasets) are also saved with the model. ## Settings file ```{code-block} yaml --- caption: Example `train_settings.yaml` file. This is also available in the examples/ folder. The specific settings listed will train a production-size network, taking about a week on an NVIDIA A100. Consider reducing some model hyperparameters for experimentation. --- data: waveform_dataset_path: /path/to/waveform_dataset.hdf5 # Contains intrinsic waveforms train_fraction: 0.95 window: type: tukey f_s: 4096 T: 8.0 roll_off: 0.4 domain_update: f_min: 20.0 f_max: 1024.0 svd_size_update: 200 detectors: - H1 - L1 extrinsic_prior: dec: default ra: default geocent_time: bilby.core.prior.Uniform(minimum=-0.10, maximum=0.10, name='geocent_time') psi: default luminosity_distance: bilby.core.prior.Uniform(minimum=100.0, maximum=1000.0, name='luminosity_distance') ref_time: 1126259462.391 gnpe_time_shifts: kernel: bilby.core.prior.Uniform(minimum=-0.001, maximum=0.001) exact_equiv: True inference_parameters: default model: posterior_model_type: normalizing_flow posterior_kwargs: num_flow_steps: 30 base_transform_kwargs: hidden_dim: 512 num_transform_blocks: 5 activation: elu dropout_probability: 0.0 batch_norm: True num_bins: 8 base_transform_type: rq-coupling embedding_kwargs: output_dim: 128 hidden_dims: [1024, 1024, 1024, 1024, 1024, 1024, 512, 512, 512, 512, 512, 512, 256, 256, 256, 256, 256, 256, 128, 128, 128, 128, 128, 128] activation: elu dropout: 0.0 batch_norm: True svd: num_training_samples: 20000 num_validation_samples: 5000 size: 200 # Training is divided in stages. They each require all settings as indicated below. training: stage_0: epochs: 300 asd_dataset_path: /path/to/asds_fiducial.hdf5 freeze_rb_layer: True optimizer: type: adam lr: 0.0001 scheduler: type: cosine T_max: 300 batch_size: 64 stage_1: epochs: 150 asd_dataset_path: /path/to/asds.hdf5 freeze_rb_layer: False optimizer: type: adam lr: 0.00001 scheduler: type: cosine T_max: 150 batch_size: 64 # Local settings that have no impact on the final trained network. local: device: cpu # Change this to 'cuda' for training on a GPU. num_workers: 6 # wandb: # project: dingo # group: O4 runtime_limits: max_time_per_run: 36000 max_epochs_per_run: 500 checkpoint_epochs: 10 leave_waveforms_on_disk: True local_cache_path: tmp # condor: # bid: 100 # num_cpus: 16 # memory_cpus: 128000 # num_gpus: 1 # memory_gpus: 8000 # request_disk: 50GB ``` The train settings file is grouped into **four sections:** ### `data_settings` These settings point to a saved dataset of waveform polarizations and describe the transforms to obtain detector waveforms. A detailed description of these settings is available [here](training_transforms.ipynb#building-the-transforms). ### `model` This describes the model architecture, including network type and hyperparameters. All of these settings are described in the section on [](network_architecture.ipynb). ### `training` This describes the training strategy. Training is divided into **stages**, each of which can differ to some extent. Stages are numbered (`stage_0`, `stage_1`, ...) and executed in this order. Each stage is defined by the following settings: epochs : Total number of training epochs for the stage. The network sees the entire training set once per epoch. asd_dataset_path : Points to an `ASDDataset` file. Each stage can have its own ASD dataset, which is useful for implementing a pre-training stage with fixed ASD and a fine-tuning stage with variable ASD. freeze_rb_layer : Whether to freeze the first layer of the embedding network in `nsf+embedding` models. This layer is seeded with reduced (SVD) basis vectors, so freezing this layer during pre-training simply projects data onto the basis coefficients. In the fine-tuning stage, when other weights are more stable, unfreezing this can be useful. optimizer : Specify [optimizer](https://pytorch.org/docs/stable/optim.html) type and parameters such as initial learning rate. scheduler : Use a [learning rate scheduler](https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate) to reduce the learning rate over time. This can improve overall optimization. batch_size : Number of training samples per mini-batch. For a training dataset of size $N$, then each epoch will consist of $N / \text{batch_size}$ batches. Generally training will be faster for a larger batch size, but will require additional iterations. ```{important} The stage-training framework allows for separate pre-training and fine-tuning stages. We found that having a pre-training stage where we freeze certain network weights and fix the noise ASD improves overall training results. ``` ### `local` The `local` settings are the only group that have no impact on the final trained network. Indeed, they are not even saved in the `.pt` files; rather they are split off and saved in a new file `local_settings.yaml`. device : `cpu` or `cuda`. Training on a GPU with CUDA is highly recommended. num_workers : Number of CPU worker processes to use for pre-processing training data before copying to the GPU. Data pre-processing (inluding decompression, projection to detectors, and noise generation) is quite expensive, so using 16 or 32 processes is recommended, otherwise this can become a bottleneck. We recommend monitoring the GPU utilization percentage as well as time spent on pre-processing (output during training) to fine-tune this number. wandb : Settings for [Weights & Biases](https://wandb.ai/site). If you have an account, you can use this to track your training progress and compare different runs. runtime_limits : Maximum time (in seconds) or maximum number of epochs per run. Using this could make sense in a cluster environment. checkpoint_epochs : Dingo saves a temporary checkpoint in `model_latest.py` after every epoch, but this is later overwritten by the next checkpoint. This setting saves a permanent checkpoint after the specified number of epochs. Having these checkpoints can help in recovering from training failures that do not result in program termination. leave_waveforms_on_disk : To improve memory efficiency during training, the waveforms are not loaded into memory at the beginning of training, but separately for each batch during training. When training on a cluster, it is highly recommended to include a local path where the dataset is cached at the beginning of training (see `local_cache_path`). If RAM is not an issue, the default `leave_waveforms_on_disk=True` can be set to `False`. local_cache_path : When training on a cluster and loading waveforms during training (i.e., `leave_waveforms_on_disk=True`), the waveform dataset should be copied to the disk storage of the local node at the beginning of training. This prevents unexpected long data loading times during training due to network traffic. Usually, paths for local storage are `tmp` or `dev/shm`. When submitting the job with `condor`, `request_disk: 50GB` should be included in the `condor` settings with the requested disk space larger than the size of the waveform dataset used for training. condor : Settings for [HTCondor](https://htcondor.readthedocs.io/en/latest/index.html). The condor script will (re)submit itself according to these options. ## Command-line scripts ### `dingo_train` On a local machine, simply pass the settings file (or checkpoint) and an output directory to `dingo_train`. It will train until complete, or until a runtime limit is reached. ```text usage: dingo_train [-h] [--settings_file SETTINGS_FILE] --train_dir TRAIN_DIR [--checkpoint CHECKPOINT] Train a neural network for gravitational-wave single-event inference. This program can be called in one of two ways: a) with a settings file. This will create a new network based on the contents of the settings file. b) with a checkpoint file. This will resume training from the checkpoint. optional arguments: -h, --help show this help message and exit --settings_file SETTINGS_FILE YAML file containing training settings. --train_dir TRAIN_DIR Directory for Dingo training output. --checkpoint CHECKPOINT Checkpoint file from which to resume training. ``` ### `dingo_train_condor` On a cluster using HTCondor, use `dingo_train_condor`. This calls itself recursively as follows: 1. The first time you call it, use the flag `--start-submission`. This creates a condor submission file `submission_file.sub` that again calls the executable `dingo_train_condor` (now without the flag) and submits it. This will run `dingo_train_condor` directly on the cluster node that is assigned. 2. On the cluster node, `dingo_train_condor` first trains the network until done or a runtime limit is reached (be careful to set this shorter than the condor time limit). Then it creates a *new* submission file that once again calls `dingo_train_condor`, and submits it. This will resume the run on a new node, and repeat. ```text usage: dingo_train_condor [-h] --train_dir TRAIN_DIR [--checkpoint CHECKPOINT] [--start_submission] optional arguments: -h, --help show this help message and exit --train_dir TRAIN_DIR Directory for Dingo training output. --checkpoint CHECKPOINT --start_submission ``` ## Output Output from training is stored in the `TRAIN_DIR` folder passed to the training scripts. This consists of the following: * `model_latest.pt` checkpoints every epoch (overwritten); * `model_XXX.pt` checkpoints where `XXX` is the epoch number, every `checkpoint_epochs` epochs; * `model_stage_X.pt` at the end of training stage `X`; * `history.txt` with columns (epoch number, train loss, test loss, learning rate); * `svd_L1.hdf5`, ..., storing SVD basis information used for seeding the embedding network; * `local_settings.yaml` with local settings for the run (not stored with checkpoints). The `.pt` and `.hdf5` files may all be inspected using `dingo_ls`. This prints all the settings, as well as diagnostic information for SVD bases. The saved settings include all the settings provided in the settings file, as well as several derived quantities, such as parameter standardizations, additional context parameters (for GNPE), etc. ### Modifying a checkpoint Occasionally it may be necessary to change a setting of a partially trained model. For example, a model may have been successfully pre-trained, but the fine-tuning failed, and one may wish to change the fine-tuning settings without starting from scratch. Since the model setting are all stored with the checkpoint, they just need to be changed. The script `dingo_append_training_stage` allows for appending a model stage or replacing an existing planned stage. It will fail if the stage has already begun training, so be sure to use it on a sufficiently early checkpoint. ```text usage: dingo_append_training_stage [-h] --checkpoint CHECKPOINT --stage_settings_file STAGE_SETTINGS_FILE --out_file OUT_FILE [--replace REPLACE] optional arguments: -h, --help show this help message and exit --checkpoint CHECKPOINT --stage_settings_file STAGE_SETTINGS_FILE --out_file OUT_FILE --replace REPLACE ``` For more detailed adjustments to the training settings the script one can use the script `compatibility/update_model_metadata.py`. ```text usage: update_model_metadata.py [-h] --checkpoint CHECKPOINT --key KEY [KEY ...] --value VALUE optional arguments: -h, --help show this help message and exit --checkpoint CHECKPOINT --key KEY [KEY ...] --value VALUE ``` ```{warning} Modifications to model metadata can easily break things. Do not use this unless completely sure what you are doing! ```