Building a waveform dataset

For training neural networks, the more training samples the better. With too little training data, one runs the risk of overfitting. Waveforms, however, can be expensive to generate and take up significant storage. Dingo adopts several strategies to mitigate these problems:

  • Dingo partitions parameters into two types—intrinsic and extrinsic—and builds a training set based only on the intrinsic parameters. This consists of waveform polarizations \(h_+\) and \(h_\times\). Extrinsic parameters are selected during training, and applied to generate the detector waveforms \(h_I\). This augments the training set to provide unlimited samples from the extrinsic parameters.

  • Saved waveforms are compressed using a singular value decomposition. Although this is lossy, waveform mismatches can monitored to ensure that they fall below the intrinsic error in the waveform model.

The WaveformDataset class

The WaveformDataset is a storage container for waveform polarizations and parameters, which can used to serve samples to a neural network during training:

class dingo.gw.dataset.WaveformDataset(file_name: str | None = None, dictionary: dict | None = None, transform=None, precision: str | None = None, domain_update: dict | None = None, svd_size_update: int | None = None, leave_waveforms_on_disk: bool | None = False)

Bases: DingoDataset, Dataset

This class stores a dataset of waveforms (polarizations) and corresponding parameters.

It can load the dataset either from an HDF5 file or suitable dictionary.

It is possible to either load the entire dataset into memory or to load the dataset during training (leave_waveforms_on_disk=True) to reduce the memory footprint. At the moment, it is only possible to load the waveforms on-demand since the standardization dict for all parameters in the dataset has to be computed at the beginning of training.

The waveform data is consumed through a __getitem__() or __getitems__() call which optionally loads the polarizations and applies a chain of transformations, which are classes that implement a __call__() method.

For constructing, provide either file_name, or dictionary containing data and settings entries, or neither.

Parameters:
  • file_name (str) – HDF5 file containing a dataset

  • dictionary (dict) – Contains settings and data entries. The dictionary keys should be ‘settings’, ‘parameters’, and ‘polarizations’.

  • transform (Transform) – Transform to be applied to dataset samples when accessed through __getitem__

  • precision (str ('single', 'double')) – If provided, changes precision of loaded dataset.

  • domain_update (dict) – If provided, update domain from existing domain using new settings.

  • svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).

  • leave_waveforms_on_disk (bool) – If True, the values for the waveforms are not loaded into RAM when initializing the waveform dataset. Instead, they are loaded lazily in __getitem__().

initialize_decompression(svd_size_update: int | None = None)

Sets up decompression transforms. These are applied to the raw dataset before self.transform. E.g., SVD decompression.

Parameters:

svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).

load_supplemental(domain_update: dict | None = None, svd_size_update: int | None = None)

Method called immediately after loading a dataset.

Creates (and possibly updates) domain, updates dtypes, and initializes any decompression transform. Also zeros data below f_min, and truncates above f_max.

Parameters:
  • domain_update (dict) – If provided, update domain from existing domain using new settings.

  • svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).

update_domain(domain_update: dict | None = None)

Update the domain based on new configuration.

The waveform dataset provides waveform polarizations in a particular domain. In Frequency domain, this is [0, domain._f_max]. Furthermore, data is set to 0 below domain._f_min. In practice one may want to train a network based on slightly different domain settings, which corresponds to truncating the likelihood integral.

This method provides functionality for that. It truncates and/or zeroes the dataset to the range specified by the domain, by calling domain.update_data.

Parameters:

domain_update (dict) – Settings dictionary. Must contain a subset of the keys contained in domain_dict.

WaveformDataset subclasses dingo.core.dataset.DingoDataset and torch.utils.data.Dataset. The former provides generic functionality for saving and loading datasets as HDF5 files and dictionaries, and is used in several components of Dingo. The latter allows the WaveformDataset to be used with a PyTorch DataLoader. In general, we follow the PyTorch design framework for training, including Datasets, DataLoaders, and Transforms.

Generating a simple dataset

As described above, the WaveformDataset class is just a container, and does not generate the contents itself. Dataset generation is instead carried out using functions in the dingo.gw.dataset.generate_dataset module. Although in practice, datasets are likely to be generated from a settings file using the command line interface, here we describe how to generate one interactively.

A dataset is based on an intrinsic prior and a waveform generator, so we build these as described here.

import warnings
warnings.filterwarnings("ignore", "Wswiglal-redir-stdio")
import lal
from dingo.gw.waveform_generator import WaveformGenerator
from bilby.core.prior import PriorDict
from dingo.gw.prior import default_intrinsic_dict
from dingo.gw.domains import FrequencyDomain

domain = FrequencyDomain(f_min=20.0, f_max=1024.0, delta_f=0.125)
wfg = WaveformGenerator(approximant='IMRPhenomXPHM', domain=domain, f_ref=20.0)
prior = PriorDict(default_intrinsic_dict)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[2], line 4
      2 from bilby.core.prior import PriorDict
      3 from dingo.gw.prior import default_intrinsic_dict
----> 4 from dingo.gw.domains import FrequencyDomain
      6 domain = FrequencyDomain(f_min=20.0, f_max=1024.0, delta_f=0.125)
      7 wfg = WaveformGenerator(approximant='IMRPhenomXPHM', domain=domain, f_ref=20.0)

ImportError: cannot import name 'FrequencyDomain' from 'dingo.gw.domains' (/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/latest/lib/python3.10/site-packages/dingo/gw/domains/__init__.py)

We can use the following function to generate sets of parameters and associated waveforms:

from dingo.gw.dataset.generate_dataset import generate_parameters_and_polarizations

parameters, polarizations = generate_parameters_and_polarizations(wfg,
                                                                  prior,
                                                                  num_samples=100,
                                                                  num_processes=1)
Generating dataset of size 100
parameters
mass_ratio chirp_mass luminosity_distance theta_jn phase a_1 a_2 tilt_1 tilt_2 phi_12 phi_jl geocent_time
0 0.218187 73.845050 1000.0 1.255204 1.966362 0.197980 0.240156 1.972606 1.376228 2.186446 4.752777 0.0
1 0.381173 87.704762 1000.0 2.033628 3.888862 0.460440 0.692240 1.754236 0.661015 0.790942 5.066653 0.0
2 0.510406 93.479307 1000.0 1.859908 3.469898 0.023533 0.296818 2.552577 0.359922 2.138755 3.489143 0.0
3 0.678305 92.145038 1000.0 0.758713 2.841377 0.172021 0.934613 0.359660 2.157047 3.599841 0.860001 0.0
4 0.624489 33.540545 1000.0 1.582852 1.577590 0.413280 0.964930 1.929234 2.084173 1.543995 5.298489 0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
95 0.540129 87.451546 1000.0 2.696406 5.270380 0.201667 0.187635 0.447384 1.944557 0.052446 0.952740 0.0
96 0.803457 66.013454 1000.0 0.379665 0.175340 0.437341 0.730075 1.475004 2.752046 5.595977 2.047529 0.0
97 0.861454 75.908534 1000.0 1.805871 1.334242 0.505140 0.566819 0.965326 0.194196 0.807147 2.357237 0.0
98 0.380818 45.702456 1000.0 1.684684 3.820672 0.092019 0.228797 1.478859 1.849281 5.860794 0.562862 0.0
99 0.941143 69.169888 1000.0 2.045144 0.209135 0.925224 0.975578 1.644663 1.359320 3.098630 4.976837 0.0

100 rows × 12 columns

polarizations
{'h_plus': array([[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        ...,
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]]),
 'h_cross': array([[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        ...,
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]])}

We can then put these in a WaveformDataset,

from dingo.gw.dataset import WaveformDataset

dataset_dict = {'parameters': parameters, 'polarizations':polarizations}
wfd = WaveformDataset(dictionary=dataset_dict)

Samples can then be easily indexed,

wfd[0]
{'parameters': {'mass_ratio': 0.21818708420007127,
  'chirp_mass': 73.84505046384619,
  'luminosity_distance': 1000.0,
  'theta_jn': 1.2552044083558263,
  'phase': 1.9663616795623784,
  'a_1': 0.19797999253228177,
  'a_2': 0.24015632412352614,
  'tilt_1': 1.9726056028558197,
  'tilt_2': 1.3762284791622097,
  'phi_12': 2.186446046131814,
  'phi_jl': 4.752777219226601,
  'geocent_time': 0.0},
 'waveform': {'h_plus': array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]),
  'h_cross': array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j])}}

Note

The sample is represented as a nested dictionary. This is a standard format for Dingo.

Automated dataset construction

The simple dataset constructed above is useful for illustrative purposes, but it lacks the several important features:

  • Waveforms are not compressed. A dataset with many samples would therefore take up enormous storage space.

  • Not reproducible. The dataset contains no metadata describing its construction (e.g., waveform approximant, domain, prior, …).

The generate_dataset function automates all of these advanced features:

dingo.gw.dataset.generate_dataset.generate_dataset(settings: Dict, num_processes: int) WaveformDataset

Generate a waveform dataset.

Parameters:
  • settings (dict) – Dictionary of settings to configure the dataset

  • num_processes (int)

Return type:

A WaveformDataset based on the settings.

This function is in turn wrapped by the command-line functions dingo_generate_dataset and dingo_generate_dataset_dag. These take a .yaml file with the same contents as the settings dictionary.

Configuration

A typical settings dictionary / .yaml config file takes the following form, described in detail below:

domain:
  type: FrequencyDomain
  f_min: 20.0
  f_max: 1024.0
  delta_f: 0.125

waveform_generator:
  approximant: IMRPhenomXPHM
  f_ref: 20.0
  # f_start: 15.0  # Optional setting useful for EOB waveforms. Overrides f_min when generating waveforms.
  # new_interface: true # Optional setting for employing new waveform interface. This is needed for SEOBNRv5 approximants, and optional for standard LAL approximants.
  spin_conversion_phase: 0.0

# Dataset only samples over intrinsic parameters. Extrinsic parameters are chosen at train time.
intrinsic_prior:
  mass_1: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_1')
  mass_2: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_2')
  chirp_mass: bilby.gw.prior.UniformInComponentsChirpMass(minimum=25.0, maximum=100.0, name='chirp_mass')
  mass_ratio: bilby.gw.prior.UniformInComponentsMassRatio(minimum=0.125, maximum=1.0, name='mass_ratio')
  phase: default
  a_1: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99, name='a_1')
  a_2: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99, name='a_2')
  tilt_1: default
  tilt_2: default
  phi_12: default
  phi_jl: default
  theta_jn: default
  # Reference values for fixed (extrinsic) parameters. These are needed to generate a waveform.
  luminosity_distance: 100.0  # Mpc
  geocent_time: 0.0  # s

# Dataset size
num_samples: 5000000

# Save a compressed representation of the dataset
compression:
  svd:
    # Truncate the SVD basis at this size. No truncation if zero.
    size: 200
    num_training_samples: 50000
    num_validation_samples: 10000
  whitening: aLIGO_ZERO_DET_high_P_asd.txt
domain

Specifies the data domain. Currenly only FrequencyDomain is implemented.

waveform_generator

Choose the approximant and reference frequency. For EOB models that require time integration, it is usually necessary to specify a lower starting frequency. In this case, f_ref is ignored.

spin_conversion_phase (optional)

Value for phiRef when converting PE spins to Cartesian spins via bilby_to_lalsimulation_spins. When set to None (default), this uses the phase parameter. When set to 0.0, phase only refers to the azimuthal observation angle, allowing for it to be treated as an extrinsic parameter.

Important

It is necessary to set this to 0.0 if planning to train a phase-marginalized network, and then reconstruct the phase synthetically.

intrinsic_prior

Specify the prior over intrinsic parameters. Intrinsic parameters here refer to those parameters that are needed to generate waveform polarizations. Extrinsic parameters here refer to those parameters that can be sampled and applied rapidly during training. As shown in the example, it is also possible to specify default priors, which is convenient for certain parameters. These are listed in dingo.gw.prior.default_intrinsic_dict.

Intrinsic parameters obviously include masses and spins, but also inclination, reference phase, luminosity distance, and time of coalescense at geocenter. Although inclination and phase are often considered extrinsic parameters, they are needed to generate waveform polarizations and cannot be easily transformed.

Luminosity distance and time of coalescense are considered as both intrinsic and extrinsic. Indeed they are needed to generate polarizations, but they can also be easily transformed during training to augment the dataset. We therefore fix them to fiducial values for generating polarizations.

num_samples

The number of samples to include in the dataset. For a production model, we typically use \(5 \times 10^6\) samples.

compression (optional)

How to compress the dataset.

svd (optional)

Construct an SVD basis based on a specified number of additional samples. Save the main dataset in terms of its SVD basis coefficients. The number of elements in the basis is specified by the size setting. The performance of the basis is also evaluated in terms of the mismatch against a number of validation samples. All of the validation information, as well as the basis itself, is saved along with the waveform dataset.

whitening (optional)

Whether to save whitened waveforms, and in particular, whether to construct the basis based on whitened waveforms. The basis will be more efficient if whitening is used to adapt it to the detector noise characteristics. To use whitening, simply specify the desired ASD do use, from the Bilby list of ASDs. Note that the whitening is used only for the internal storage of the dataset. When accessing samples from the dataset, they will be unwhitened.

Dataset compression is implemented internally by setting the WaveformGenerator.transform operator, so that elements are compressed immediately after generation (avoiding the need to store many uncompressed waveforms in memory). Likewise, decompression is implemented by setting the WaveformDataset.decompression_transform operator to apply the inverse transformation. This will act on samples to decompress them when accessed through WaveformDataset.__getitem__().

Important

The automated dataset constructors store the configuration settings in WaveformDataset.settings. This is so that the settings can be accessed by more downstream tasks, and for reference.

Command-line interface

In most cases the command-line interface will be used to generate a dataset. Given a settings file, one can call

dingo_generate_dataset --settings_file settings.yaml
                       --num_processes N
                       --out_file waveform_dataset.hdf5

This will generate a dataset following the configuration in settings.yaml and save it as waveform_dataset.hdf5, using N processes.

To inspect the dataset (or any other Dingo-generated file) use

dingo_ls waveform_dataset.hdf5

This will print the configuration settings, as well as a summary of the SVD compression performance (if available).

For larger datasets, or those based on slower waveform models, Dingo includes a script that builds a condor DAG, dingo_generate_dataset_dag. This splits the generation of waveforms across several nodes, and then reconstitutes the final dataset.