Building a waveform dataset

For training neural networks, the more training samples the better. With too little training data, one runs the risk of overfitting. Waveforms, however, can be expensive to generate and take up significant storage. Dingo adopts several strategies to mitigate these problems:

Dingo partitions parameters into two types—intrinsic and extrinsic—and builds a training set based only on the intrinsic parameters. This consists of waveform polarizations \(h_+\) and \(h_\times\). Extrinsic parameters are selected during training, and applied to generate the detector waveforms \(h_I\). This augments the training set to provide unlimited samples from the extrinsic parameters.
Saved waveforms are compressed using a singular value decomposition. Although this is lossy, waveform mismatches can monitored to ensure that they fall below the intrinsic error in the waveform model.

The `WaveformDataset` class

The WaveformDataset is a storage container for waveform polarizations and parameters, which can used to serve samples to a neural network during training:

class dingo.gw.dataset.WaveformDataset(file_name: str | None = None, dictionary: dict | None = None, transform=None, precision: Literal['single', 'double'] | None = None, domain_update: dict | None = None, svd_size_update: int | None = None, leave_waveforms_on_disk: bool | None = False)

Bases: DingoDataset, Dataset

This class stores a dataset of waveforms (polarizations) and corresponding parameters.

It can load the dataset either from an HDF5 file or suitable dictionary.

It is possible to either load the entire dataset into memory or to load the dataset during training (leave_waveforms_on_disk=True) to reduce the memory footprint. At the moment, it is only possible to load the waveforms on-demand since the standardization dict for all parameters in the dataset has to be computed at the beginning of training.

The waveform data is consumed through a __getitem__() or __getitems__() call which optionally loads the polarizations and applies a chain of transformations, which are classes that implement a __call__() method.

For constructing, provide either file_name, or dictionary containing data and settings entries, or neither.

Parameters:

file_name (str) – HDF5 file containing a dataset
dictionary (dict) – Contains settings and data entries. The dictionary keys should be ‘settings’, ‘parameters’, and ‘polarizations’.
transform (Transform) – Transform to be applied to dataset samples when accessed through __getitem__
precision (str ('single', 'double')) – If provided, changes precision of loaded dataset.
domain_update (dict) – If provided, update domain from existing domain using new settings.
svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).
leave_waveforms_on_disk (bool) – If True, the values for the waveforms are not loaded into RAM when initializing the waveform dataset. Instead, they are loaded lazily in __getitem__().

property dtype_map: Mapping[str, DTypeLike | DTypeMap] | None

Mapping from group names to target dtypes for HDF5 loading.

This enables direct dtype conversion during HDF5 read, avoiding intermediate memory allocation when changing precision.

initialize_decompression(svd_size_update: int | None = None)

Sets up decompression transforms. These are applied to the raw dataset before self.transform. E.g., SVD decompression.

Parameters:: svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).

load_supplemental(domain_update: dict | None = None, svd_size_update: int | None = None)

Method called immediately after loading a dataset.

Creates (and possibly updates) domain, updates dtypes, and initializes any decompression transform. Also zeros data below f_min, and truncates above f_max.

Parameters:

domain_update (dict) – If provided, update domain from existing domain using new settings.
svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).

update_domain(domain_update: dict | None = None)

Update the domain based on new configuration.

The waveform dataset provides waveform polarizations in a particular domain. In Frequency domain, this is [0, domain._f_max]. Furthermore, data is set to 0 below domain._f_min. In practice one may want to train a network based on slightly different domain settings, which corresponds to truncating the likelihood integral.

This method provides functionality for that. It truncates and/or zeroes the dataset to the range specified by the domain, by calling domain.update_data.

Parameters:: domain_update (dict) – Settings dictionary. Must contain a subset of the keys contained in domain_dict.

WaveformDataset subclasses dingo.core.dataset.DingoDataset and torch.utils.data.Dataset. The former provides generic functionality for saving and loading datasets as HDF5 files and dictionaries, and is used in several components of Dingo. The latter allows the WaveformDataset to be used with a PyTorch DataLoader. In general, we follow the PyTorch design framework for training, including Datasets, DataLoaders, and Transforms.

Generating a simple dataset

As described above, the WaveformDataset class is just a container, and does not generate the contents itself. Dataset generation is instead carried out using functions in the dingo.gw.dataset.generate_dataset module. Although in practice, datasets are likely to be generated from a settings file using the command line interface, here we describe how to generate one interactively.

A dataset is based on an intrinsic prior and a waveform generator, so we build these as described here.

import warnings
warnings.filterwarnings("ignore", "Wswiglal-redir-stdio")
import lal

from dingo.gw.waveform_generator import WaveformGenerator
from bilby.core.prior import PriorDict
from dingo.gw.prior import default_intrinsic_dict
from dingo.gw.domains import FrequencyDomain

domain = FrequencyDomain(f_min=20.0, f_max=1024.0, delta_f=0.125)
wfg = WaveformGenerator(approximant='IMRPhenomXPHM', domain=domain, f_ref=20.0)
prior = PriorDict(default_intrinsic_dict)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
Cell In[2], line 4
      2 from bilby.core.prior import PriorDict
      3 from dingo.gw.prior import default_intrinsic_dict
----> 4 from dingo.gw.domains import FrequencyDomain
      6 domain = FrequencyDomain(f_min=20.0, f_max=1024.0, delta_f=0.125)
      7 wfg = WaveformGenerator(approximant='IMRPhenomXPHM', domain=domain, f_ref=20.0)

ImportError: cannot import name 'FrequencyDomain' from 'dingo.gw.domains' (/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/latest/lib/python3.10/site-packages/dingo/gw/domains/__init__.py)

We can use the following function to generate sets of parameters and associated waveforms:

from dingo.gw.dataset.generate_dataset import generate_parameters_and_polarizations

parameters, polarizations = generate_parameters_and_polarizations(wfg,
                                                                  prior,
                                                                  num_samples=100,
                                                                  num_processes=1)

Generating dataset of size 100

parameters

	mass_ratio	chirp_mass	luminosity_distance	theta_jn	phase	a_1	a_2	tilt_1	tilt_2	phi_12	phi_jl	geocent_time
0	0.218187	73.845050	1000.0	1.255204	1.966362	0.197980	0.240156	1.972606	1.376228	2.186446	4.752777	0.0
1	0.381173	87.704762	1000.0	2.033628	3.888862	0.460440	0.692240	1.754236	0.661015	0.790942	5.066653	0.0
2	0.510406	93.479307	1000.0	1.859908	3.469898	0.023533	0.296818	2.552577	0.359922	2.138755	3.489143	0.0
3	0.678305	92.145038	1000.0	0.758713	2.841377	0.172021	0.934613	0.359660	2.157047	3.599841	0.860001	0.0
4	0.624489	33.540545	1000.0	1.582852	1.577590	0.413280	0.964930	1.929234	2.084173	1.543995	5.298489	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...
95	0.540129	87.451546	1000.0	2.696406	5.270380	0.201667	0.187635	0.447384	1.944557	0.052446	0.952740	0.0
96	0.803457	66.013454	1000.0	0.379665	0.175340	0.437341	0.730075	1.475004	2.752046	5.595977	2.047529	0.0
97	0.861454	75.908534	1000.0	1.805871	1.334242	0.505140	0.566819	0.965326	0.194196	0.807147	2.357237	0.0
98	0.380818	45.702456	1000.0	1.684684	3.820672	0.092019	0.228797	1.478859	1.849281	5.860794	0.562862	0.0
99	0.941143	69.169888	1000.0	2.045144	0.209135	0.925224	0.975578	1.644663	1.359320	3.098630	4.976837	0.0

100 rows × 12 columns

polarizations

{'h_plus': array([[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        ...,
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]]),
 'h_cross': array([[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        ...,
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]])}

We can then put these in a WaveformDataset,

from dingo.gw.dataset import WaveformDataset

dataset_dict = {'parameters': parameters, 'polarizations':polarizations}
wfd = WaveformDataset(dictionary=dataset_dict)

Samples can then be easily indexed,

wfd[0]

{'parameters': {'mass_ratio': 0.21818708420007127,
  'chirp_mass': 73.84505046384619,
  'luminosity_distance': 1000.0,
  'theta_jn': 1.2552044083558263,
  'phase': 1.9663616795623784,
  'a_1': 0.19797999253228177,
  'a_2': 0.24015632412352614,
  'tilt_1': 1.9726056028558197,
  'tilt_2': 1.3762284791622097,
  'phi_12': 2.186446046131814,
  'phi_jl': 4.752777219226601,
  'geocent_time': 0.0},
 'waveform': {'h_plus': array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]),
  'h_cross': array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j])}}

Note

The sample is represented as a nested dictionary. This is a standard format for Dingo.

Automated dataset construction

The simple dataset constructed above is useful for illustrative purposes, but it lacks the several important features:

Waveforms are not compressed. A dataset with many samples would therefore take up enormous storage space.
Not reproducible. The dataset contains no metadata describing its construction (e.g., waveform approximant, domain, prior, …).

The generate_dataset function automates all of these advanced features:

dingo.gw.dataset.generate_dataset.generate_dataset(settings: Dict, num_processes: int) → WaveformDataset

Generate a waveform dataset.

Parameters:

settings (dict) – Dictionary of settings to configure the dataset
num_processes (int)

Return type:

A WaveformDataset based on the settings.

This function is in turn wrapped by the command-line functions dingo_generate_dataset and dingo_generate_dataset_dag. These take a .yaml file with the same contents as the settings dictionary.

Configuration

A typical settings dictionary / .yaml config file takes the following form, described in detail below:

domain:
  type: FrequencyDomain
  f_min: 20.0
  f_max: 1024.0
  delta_f: 0.125

waveform_generator:
  approximant: IMRPhenomXPHM
  f_ref: 20.0
  # f_start: 15.0  # Optional setting useful for EOB waveforms. Overrides f_min when generating waveforms.
  # new_interface: true # Optional setting for employing new waveform interface. This is needed for SEOBNRv5 approximants, and optional for standard LAL approximants.
  spin_conversion_phase: 0.0

# Dataset only samples over intrinsic parameters. Extrinsic parameters are chosen at train time.
intrinsic_prior:
  mass_1: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_1')
  mass_2: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_2')
  chirp_mass: bilby.gw.prior.UniformInComponentsChirpMass(minimum=25.0, maximum=100.0, name='chirp_mass')
  mass_ratio: bilby.gw.prior.UniformInComponentsMassRatio(minimum=0.125, maximum=1.0, name='mass_ratio')
  phase: default
  a_1: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99, name='a_1')
  a_2: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99, name='a_2')
  tilt_1: default
  tilt_2: default
  phi_12: default
  phi_jl: default
  theta_jn: default
  # Reference values for fixed (extrinsic) parameters. These are needed to generate a waveform.
  luminosity_distance: 100.0  # Mpc
  geocent_time: 0.0  # s

# Dataset size
num_samples: 5000000

# Save a compressed representation of the dataset
compression:
  svd:
    # Truncate the SVD basis at this size. No truncation if zero.
    size: 200
    num_training_samples: 50000
    num_validation_samples: 10000
  whitening: aLIGO_ZERO_DET_high_P_asd.txt

domain

Specifies the data domain. Currenly only FrequencyDomain is implemented.

waveform_generator

Choose the approximant and reference frequency. For EOB models that require time integration, it is usually necessary to specify a lower starting frequency. In this case, f_ref is ignored.

spin_conversion_phase (optional): Value for phiRef when converting PE spins to Cartesian spins via bilby_to_lalsimulation_spins. When set to None (default), this uses the phase parameter. When set to 0.0, phase only refers to the azimuthal observation angle, allowing for it to be treated as an extrinsic parameter.

Important

It is necessary to set this to 0.0 if planning to train a phase-marginalized network, and then reconstruct the phase synthetically.

intrinsic_prior

Specify the prior over intrinsic parameters. Intrinsic parameters here refer to those parameters that are needed to generate waveform polarizations. Extrinsic parameters here refer to those parameters that can be sampled and applied rapidly during training. As shown in the example, it is also possible to specify default priors, which is convenient for certain parameters. These are listed in dingo.gw.prior.default_intrinsic_dict.

Intrinsic parameters obviously include masses and spins, but also inclination, reference phase, luminosity distance, and time of coalescense at geocenter. Although inclination and phase are often considered extrinsic parameters, they are needed to generate waveform polarizations and cannot be easily transformed.

Luminosity distance and time of coalescense are considered as both intrinsic and extrinsic. Indeed they are needed to generate polarizations, but they can also be easily transformed during training to augment the dataset. We therefore fix them to fiducial values for generating polarizations.

num_samples

The number of samples to include in the dataset. For a production model, we typically use \(5 \times 10^6\) samples.

compression (optional)

How to compress the dataset.

svd (optional): Construct an SVD basis based on a specified number of additional samples. Save the main dataset in terms of its SVD basis coefficients. The number of elements in the basis is specified by the size setting. The performance of the basis is also evaluated in terms of the mismatch against a number of validation samples. All of the validation information, as well as the basis itself, is saved along with the waveform dataset.
whitening (optional): Whether to save whitened waveforms, and in particular, whether to construct the basis based on whitened waveforms. The basis will be more efficient if whitening is used to adapt it to the detector noise characteristics. To use whitening, simply specify the desired ASD do use, from the Bilby list of ASDs. Note that the whitening is used only for the internal storage of the dataset. When accessing samples from the dataset, they will be unwhitened.

Dataset compression is implemented internally by setting the WaveformGenerator.transform operator, so that elements are compressed immediately after generation (avoiding the need to store many uncompressed waveforms in memory). Likewise, decompression is implemented by setting the WaveformDataset.decompression_transform operator to apply the inverse transformation. This will act on samples to decompress them when accessed through WaveformDataset.__getitem__().

Important

The automated dataset constructors store the configuration settings in WaveformDataset.settings. This is so that the settings can be accessed by more downstream tasks, and for reference.

Command-line interface

In most cases the command-line interface will be used to generate a dataset. Given a settings file, one can call

dingo_generate_dataset --settings_file settings.yaml
                       --num_processes N
                       --out_file waveform_dataset.hdf5

This will generate a dataset following the configuration in settings.yaml and save it as waveform_dataset.hdf5, using N processes.

To inspect the dataset (or any other Dingo-generated file) use

dingo_ls waveform_dataset.hdf5

This will print the configuration settings, as well as a summary of the SVD compression performance (if available).

For larger datasets, or those based on slower waveform models, Dingo includes a script that builds a condor DAG, dingo_generate_dataset_dag. This splits the generation of waveforms across several nodes, and then reconstitutes the final dataset.

Building a waveform dataset

The WaveformDataset class

Generating a simple dataset

Automated dataset construction

Configuration

Command-line interface

The `WaveformDataset` class