Toy Example

The goal of the following tutorial is to take a user from start to finish analyzing GW150914 using dingo.

Caution

This is only a toy example which is useful for testing on a local machine. This is NOT meant be used for production gravitational wave analyses.

There are 4 main steps:

Generate the waveform dataset
Generate the ASD dataset
Train the network
Do inference

In this tutorial as well as the npe model and gnpe model the following file structure will be employed

toy_npe_model/

    #  config files
    waveform_dataset_settings.yaml
    asd_dataset_settings.yaml
    train_settings.yaml
    GW150914.ini

    training_data/
        waveform_dataset.hdf5
        asd_dataset/ # Contains the asd_dataset.hdf5 and also temp files for asd generation

    training/
        model_050.pt
        model_stage_0.pt
        model_latest.pt
        history.txt
        #  etc...

    outdir_GW150914/
        #  dingo_pipe output

The config files which are the only ones which need to be edited are contained in the top level directory. In the next few sections these config files will be explained. To download sample config files, please visit https://github.com/dingo-gw/dingo/tree/main/examples. In this tutorial the toy_npe_model folder will be used.

Step 1 Generating a waveform dataset

After downloading the files for the tutorial first run

cd toy_npe_model/
mkdir training_data
mkdir training

to set up the file structure. Then run

dingo_generate_dataset --settings waveform_dataset_settings.yaml --out_file training_data/waveform_dataset.hdf5

which will create a dingo.gw.waveform_generator.waveform_generator.WaveformGenerator object and store it at the location provided with --out_file. For convenience, here is the waveform dataset file

domain:
  type: UniformFrequencyDomain
  f_min: 20.0
  f_max: 1024.0
  delta_f: 0.25  # Expressions like 1.0/8.0 would require eval and are not supported

waveform_generator:
  approximant: IMRPhenomD
  f_ref: 20.0
  # f_start: 15.0  # Optional setting useful for EOB waveforms. Overrides f_min when generating waveforms.

# Dataset only samples over intrinsic parameters. Extrinsic parameters are chosen at train time.
intrinsic_prior:
  mass_1: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_1')
  mass_2: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_2')
  chirp_mass: bilby.gw.prior.UniformInComponentsChirpMass(minimum=15.0, maximum=100.0, name='chirp_mass')
  mass_ratio: bilby.gw.prior.UniformInComponentsMassRatio(minimum=0.125, maximum=1.0, name='mass_ratio')
  phase: default
  chi_1: bilby.gw.prior.AlignedSpin(name='chi_1', a_prior=Uniform(minimum=0, maximum=0.9))
  chi_2: bilby.gw.prior.AlignedSpin(name='chi_2', a_prior=Uniform(minimum=0, maximum=0.9))
  theta_jn: default
  # Reference values for fixed (extrinsic) parameters. These are needed to generate a waveform.
  luminosity_distance: 100.0  # Mpc
  geocent_time: 0.0  # s

# Dataset size
num_samples: 10000

compression: None

The file waveform_dataset_settings.yaml contains four sections: domain, waveform_generator, intrinsic_prior, and compression. The domain section defines the settings for storing the waveform. Note the type attribute; this does not refer to the native domain of the waveform model, but rather to the internal dingo.gw.domains.Domain class. This allows the use of time domain waveform models, which are transformed into Fourier domain before being passed to the network. Currently, only the dingo.gw.domains.FrequencyDomain class is supported for training the network. It is sometimes advisable to generate waveforms with a higher f_max and then truncate them at a lower f_max for training due to issues with generating short waveforms for some of the waveform models implemented in LALSuite’s LALSimulation package (https://lscsoft.docs.ligo.org/lalsuite/lalsimulation/).

The waveform_generator section specifies the approximant attribute. At present any waveform model, aka approximant, that is callable through LALSimulation’s SimInspiralFD API can be used to generate waveforms for dingo via the dingo.gw.waveform_generator.waveform_generator.WaveformGenerator module (see generating_waveforms).

The intrinsic_prior section is based on Bilby’s prior module. Default values can be found in dingo.gw.prior. Two priors to note are the chirp_mass and mass_ratio, whose minimum values are set to 15.0 and 0.125, respectively. Extending these priors towards lower chirp masses or more extreme mass-ratios may lead to poor performance of the embedding network and normalizing flow during training and would require changes to the network setup. Note that the luminosity_distance and geocent_time are defined as constants to generate the waveform at a fixed reference point.

The compression section can be set to None for testing purposes. For a practical example of how it is used, see the next tutorial.

Step 2 Generating the Amplitude Spectral Density (ASD) dataset

To generate an ASD dataset run

dingo_generate_asd_dataset --settings_file asd_dataset_settings.yaml --data_dir training_data/asd_dataset

This command will generate an dingo.gw.noise.asd_dataset.ASDDataset object in the form of an .hdf5 file, which will be used later for training. The reason for specifying a folder instead of a file, as in the waveform dataset example, is because some temporary data is downloaded to create Welch estimates of the ASD. This data can be removed later, but it is sometimes useful for understanding how the ASDs were estimated. For convenience here is a copy of the asd_dataset_settings.yaml file.

dataset_settings:
f_s: 4096
time_psd: 1024
T: 4
window:
    roll_off: 0.4
    type: tukey
time_gap: 0          # specifies the time skipped between to consecutive PSD estimates. If set < 0, the time segments overlap
num_psds_max: 1  # if set > 0, only a subset of all available PSDs will be used
detectors:
    - H1
    - L1
observing_run: O1

The asd_dataset_settings.yaml file includes several attributes. f_s is the sampling frequency in Hz, time_psd is the length of time used for an ASD estimate, and T is the duration of each ASD segment. Thus, the value of time_psd/T gives the number of segments analyzed to estimate one ASD. To avoid spectral leakage, a window is applied to each segment. We use the standard window used in LVK analyses, a Tukey window with a roll off of \(\alpha=0.4\). The next attribute, num_psds_max=1, defines the number of ASDs stored in the ASD dataset. For now, we will use only one. See the next tutorial for a more advanced setup.

Step 3 Training the network

To train the network, first the paths to the correct datasets must be specified before executing:

dingo_train --settings_file train_settings.yaml --train_dir training

While this file contains numerous settings that are discussed in training, we will cover the most significant ones here. Again here is the file.

data:
  waveform_dataset_path: training_data/waveform_dataset.hdf5  # Contains intrinsic waveforms
  train_fraction: 0.95
  detectors:
    - H1
    - L1
  extrinsic_prior:  # Sampled at train time
    dec: default
    ra: default
    geocent_time: bilby.core.prior.Uniform(minimum=-0.10, maximum=0.10, name='geocent_time')
    psi: default
    luminosity_distance: bilby.core.prior.Uniform(minimum=100.0, maximum=1000.0, name='luminosity_distance')
  ref_time: 1126259462.391
  inference_parameters: 
  - chirp_mass
  - mass_ratio
  - chi_1
  - chi_2
  - theta_jn
  - dec
  - ra
  - geocent_time
  - luminosity_distance
  - psi
  - phase

# Model architecture
model:
  posterior_model_type: normalizing_flow
  # kwargs for neural spline flow
  posterior_kwargs:
    num_flow_steps: 5
    base_transform_kwargs:
      hidden_dim: 64 
      num_transform_blocks: 5
      activation: elu
      dropout_probability: 0.0
      batch_norm: True
      num_bins: 8
      base_transform_type: rq-coupling
  # kwargs for embedding net
  embedding_kwargs:
    output_dim: 128
    hidden_dims: [1024, 512, 256, 128]
    activation: elu
    dropout: 0.0
    batch_norm: True
    svd:
      num_training_samples: 1000
      num_validation_samples: 100
      size: 50

# The first stage (and only) stage of training. 
training:
  stage_0:
    epochs: 20
    asd_dataset_path: training_data/asd_dataset/asds_O1.hdf5  # this should just contain a single fiducial ASD per detector for pretraining
    freeze_rb_layer: True
    optimizer:
      type: adam
      lr: 0.0001
    scheduler:
      type: cosine
      T_max: 20
    batch_size: 64

# Local settings for training that have no impact on the final trained network.
local:
  device: cpu  # Change this to 'cuda' for training on a GPU.
  num_workers: 6  # num_workers >0 does not work on Mac, see https://stackoverflow.com/questions/64772335/pytorch-w-parallelnative-cpp206
  runtime_limits:
    max_time_per_run: 36000
    max_epochs_per_run: 30
  checkpoint_epochs: 15
  leave_waveforms_on_disk: True

For training, several extrinsic_priors are set, which project the waveforms generated in step 1 onto the detector network according to the specified priors. This is considerably cheaper than generating waveforms sampled from the full intrinsic plus extrinsic prior in step 1.

Another crucial setting is inference_parameters. By default, all the parameters described in dingo.gw.prior are inferred. If a parameter needs to be marginalized over, this parameter can be omitted from inference_parameters.

Essential settings for the model architecture of the normalizing flow (i.e., the neural spline flow and the embedding network) are as follows: posterior_kwargs.num_flow_steps describes the number of flow transforms from the base distribution to the final distribution, while embedding_net_kwargs.hidden_dim defines the dimensions of the neural network’s hidden layer, which selects the most important data features. Finally, embedding_net_kwargs.svd describes the settings of the SVD used as a pre-processing step before passing data vectors to the embedding network. For a production network, these values should be much higher than those used in this tutorial.

Next, we turn to the training section. Here we only employ a single stage of training with settings provided under the stage_0 attribute. This stage uses the training dataset generated in step 1 for 30 epochs. We also specify the asd_dataset_path here, which was created in step 2.

Finally, the local settings section specifies technical details of the training setup. It contains information about, e.g., parallelization during training and the device used. An important setting here is num_workers, which determines how many PyTorch dataloader processes are spawned during training. If training is too slow, a potential cause is a lack of workers to load data into the network. This can be identified if the dataloader times in the dingo_train output exceed 100ms. The solution is generally to increase the number of workers.

Step 4 Doing Inference

The final step is to do inference, for example on GW150914. To do this we will use dingo_pipe. For a local run execute:

dingo_pipe GW150914.ini

This calls dingo_pipe on an INI file that specifies the event to run on,

################################################################################
##  Job submission arguments
################################################################################

local = True
accounting = dingo
request-cpus-importance-sampling = 2

################################################################################
##  Sampler arguments
################################################################################

model = training/model_latest.pt
device = 'cpu'
num-samples = 5000
batch-size = 5000
recover-log-prob = false
importance-sample = false

################################################################################
## Data generation arguments
################################################################################

trigger-time = GW150914
label = GW150914
outdir = outdir_GW150914
channel-dict = {H1:GWOSC, L1:GWOSC}
psd-length = 128
# sampling-frequency = 2048.0
# importance-sampling-updates = {'duration': 4.0}

################################################################################
## Plotting arguments
################################################################################

plot-corner = true
plot-weights = true
plot-log-probs = true

This will generate files which are described in dingo_pipe. To see the results, take a look in outdir_GW150914. We set the flag importance-sample = False in the INI file, which disables importance sampling for this simple example. Generally one would omit this (it defaults to True).

We can load and manipulate the data with the following code. For example, here we create a cornerplot

from dingo.gw.result import Result
result = Result(file_name="outdir_GW150914/result/GW150914_data0_1126259462-4_sampling.hdf5")
result.plot_corner()

Notice the results don’t look very promising, but this is expected as the settings used in this example are not enough to warrant convergence. Dingo should also automatically generate a cornerplot which will be displayed under outdir_GW150914.