Building a waveform dataset
For training neural networks, the more training samples the better. With too little training data, one runs the risk of overfitting. Waveforms, however, can be expensive to generate and take up significant storage. Dingo adopts several strategies to mitigate these problems:
Dingo partitions parameters into two types—intrinsic and extrinsic—and builds a training set based only on the intrinsic parameters. This consists of waveform polarizations \(h_+\) and \(h_\times\). Extrinsic parameters are selected during training, and applied to generate the detector waveforms \(h_I\). This augments the training set to provide unlimited samples from the extrinsic parameters.
Saved waveforms are compressed using a singular value decomposition. Although this is lossy, waveform mismatches can monitored to ensure that they fall below the intrinsic error in the waveform model.
The WaveformDataset class
The WaveformDataset is a storage container for waveform polarizations and parameters, which can used to serve samples to a neural network during training:
- class dingo.gw.dataset.WaveformDataset(file_name: str | None = None, dictionary: dict | None = None, transform=None, precision: str | None = None, domain_update: dict | None = None, svd_size_update: int | None = None, leave_waveforms_on_disk: bool | None = False)
Bases:
DingoDataset,DatasetThis class stores a dataset of waveforms (polarizations) and corresponding parameters.
It can load the dataset either from an HDF5 file or suitable dictionary.
It is possible to either load the entire dataset into memory or to load the dataset during training (leave_waveforms_on_disk=True) to reduce the memory footprint. At the moment, it is only possible to load the waveforms on-demand since the standardization dict for all parameters in the dataset has to be computed at the beginning of training.
The waveform data is consumed through a __getitem__() or __getitems__() call which optionally loads the polarizations and applies a chain of transformations, which are classes that implement a __call__() method.
For constructing, provide either file_name, or dictionary containing data and settings entries, or neither.
- Parameters:
file_name (str) – HDF5 file containing a dataset
dictionary (dict) – Contains settings and data entries. The dictionary keys should be ‘settings’, ‘parameters’, and ‘polarizations’.
transform (Transform) – Transform to be applied to dataset samples when accessed through __getitem__
precision (str ('single', 'double')) – If provided, changes precision of loaded dataset.
domain_update (dict) – If provided, update domain from existing domain using new settings.
svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).
leave_waveforms_on_disk (bool) – If True, the values for the waveforms are not loaded into RAM when initializing the waveform dataset. Instead, they are loaded lazily in __getitem__().
- initialize_decompression(svd_size_update: int | None = None)
Sets up decompression transforms. These are applied to the raw dataset before self.transform. E.g., SVD decompression.
- Parameters:
svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).
- load_supplemental(domain_update: dict | None = None, svd_size_update: int | None = None)
Method called immediately after loading a dataset.
Creates (and possibly updates) domain, updates dtypes, and initializes any decompression transform. Also zeros data below f_min, and truncates above f_max.
- Parameters:
domain_update (dict) – If provided, update domain from existing domain using new settings.
svd_size_update (int) – If provided, reduces the SVD size when decompressing (for speed).
- update_domain(domain_update: dict | None = None)
Update the domain based on new configuration.
The waveform dataset provides waveform polarizations in a particular domain. In Frequency domain, this is [0, domain._f_max]. Furthermore, data is set to 0 below domain._f_min. In practice one may want to train a network based on slightly different domain settings, which corresponds to truncating the likelihood integral.
This method provides functionality for that. It truncates and/or zeroes the dataset to the range specified by the domain, by calling domain.update_data.
- Parameters:
domain_update (dict) – Settings dictionary. Must contain a subset of the keys contained in domain_dict.
WaveformDataset subclasses dingo.core.dataset.DingoDataset and torch.utils.data.Dataset. The former provides generic functionality for saving and loading datasets as HDF5 files and dictionaries, and is used in several components of Dingo. The latter allows the WaveformDataset to be used with a PyTorch DataLoader. In general, we follow the PyTorch design framework for training, including Datasets, DataLoaders, and Transforms.
Generating a simple dataset
As described above, the WaveformDataset class is just a container, and does not generate the contents itself. Dataset generation is instead carried out using functions in the dingo.gw.dataset.generate_dataset module. Although in practice, datasets are likely to be generated from a settings file using the command line interface, here we describe how to generate one interactively.
A dataset is based on an intrinsic prior and a waveform generator, so we build these as described here.
import warnings
warnings.filterwarnings("ignore", "Wswiglal-redir-stdio")
import lal
from dingo.gw.waveform_generator import WaveformGenerator
from bilby.core.prior import PriorDict
from dingo.gw.prior import default_intrinsic_dict
from dingo.gw.domains import FrequencyDomain
domain = FrequencyDomain(f_min=20.0, f_max=1024.0, delta_f=0.125)
wfg = WaveformGenerator(approximant='IMRPhenomXPHM', domain=domain, f_ref=20.0)
prior = PriorDict(default_intrinsic_dict)
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
Cell In[2], line 4
2 from bilby.core.prior import PriorDict
3 from dingo.gw.prior import default_intrinsic_dict
----> 4 from dingo.gw.domains import FrequencyDomain
6 domain = FrequencyDomain(f_min=20.0, f_max=1024.0, delta_f=0.125)
7 wfg = WaveformGenerator(approximant='IMRPhenomXPHM', domain=domain, f_ref=20.0)
ImportError: cannot import name 'FrequencyDomain' from 'dingo.gw.domains' (/home/docs/checkouts/readthedocs.org/user_builds/dingo-gw/envs/latest/lib/python3.10/site-packages/dingo/gw/domains/__init__.py)
We can use the following function to generate sets of parameters and associated waveforms:
from dingo.gw.dataset.generate_dataset import generate_parameters_and_polarizations
parameters, polarizations = generate_parameters_and_polarizations(wfg,
prior,
num_samples=100,
num_processes=1)
Generating dataset of size 100
parameters
| mass_ratio | chirp_mass | luminosity_distance | theta_jn | phase | a_1 | a_2 | tilt_1 | tilt_2 | phi_12 | phi_jl | geocent_time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.218187 | 73.845050 | 1000.0 | 1.255204 | 1.966362 | 0.197980 | 0.240156 | 1.972606 | 1.376228 | 2.186446 | 4.752777 | 0.0 |
| 1 | 0.381173 | 87.704762 | 1000.0 | 2.033628 | 3.888862 | 0.460440 | 0.692240 | 1.754236 | 0.661015 | 0.790942 | 5.066653 | 0.0 |
| 2 | 0.510406 | 93.479307 | 1000.0 | 1.859908 | 3.469898 | 0.023533 | 0.296818 | 2.552577 | 0.359922 | 2.138755 | 3.489143 | 0.0 |
| 3 | 0.678305 | 92.145038 | 1000.0 | 0.758713 | 2.841377 | 0.172021 | 0.934613 | 0.359660 | 2.157047 | 3.599841 | 0.860001 | 0.0 |
| 4 | 0.624489 | 33.540545 | 1000.0 | 1.582852 | 1.577590 | 0.413280 | 0.964930 | 1.929234 | 2.084173 | 1.543995 | 5.298489 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 95 | 0.540129 | 87.451546 | 1000.0 | 2.696406 | 5.270380 | 0.201667 | 0.187635 | 0.447384 | 1.944557 | 0.052446 | 0.952740 | 0.0 |
| 96 | 0.803457 | 66.013454 | 1000.0 | 0.379665 | 0.175340 | 0.437341 | 0.730075 | 1.475004 | 2.752046 | 5.595977 | 2.047529 | 0.0 |
| 97 | 0.861454 | 75.908534 | 1000.0 | 1.805871 | 1.334242 | 0.505140 | 0.566819 | 0.965326 | 0.194196 | 0.807147 | 2.357237 | 0.0 |
| 98 | 0.380818 | 45.702456 | 1000.0 | 1.684684 | 3.820672 | 0.092019 | 0.228797 | 1.478859 | 1.849281 | 5.860794 | 0.562862 | 0.0 |
| 99 | 0.941143 | 69.169888 | 1000.0 | 2.045144 | 0.209135 | 0.925224 | 0.975578 | 1.644663 | 1.359320 | 3.098630 | 4.976837 | 0.0 |
100 rows × 12 columns
polarizations
{'h_plus': array([[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]]),
'h_cross': array([[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
...,
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],
[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]])}
We can then put these in a WaveformDataset,
from dingo.gw.dataset import WaveformDataset
dataset_dict = {'parameters': parameters, 'polarizations':polarizations}
wfd = WaveformDataset(dictionary=dataset_dict)
Samples can then be easily indexed,
wfd[0]
{'parameters': {'mass_ratio': 0.21818708420007127,
'chirp_mass': 73.84505046384619,
'luminosity_distance': 1000.0,
'theta_jn': 1.2552044083558263,
'phase': 1.9663616795623784,
'a_1': 0.19797999253228177,
'a_2': 0.24015632412352614,
'tilt_1': 1.9726056028558197,
'tilt_2': 1.3762284791622097,
'phi_12': 2.186446046131814,
'phi_jl': 4.752777219226601,
'geocent_time': 0.0},
'waveform': {'h_plus': array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]),
'h_cross': array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j])}}
Note
The sample is represented as a nested dictionary. This is a standard format for Dingo.
Automated dataset construction
The simple dataset constructed above is useful for illustrative purposes, but it lacks the several important features:
Waveforms are not compressed. A dataset with many samples would therefore take up enormous storage space.
Not reproducible. The dataset contains no metadata describing its construction (e.g., waveform approximant, domain, prior, …).
The generate_dataset function automates all of these advanced features:
- dingo.gw.dataset.generate_dataset.generate_dataset(settings: Dict, num_processes: int) WaveformDataset
Generate a waveform dataset.
- Parameters:
settings (dict) – Dictionary of settings to configure the dataset
num_processes (int)
- Return type:
A WaveformDataset based on the settings.
This function is in turn wrapped by the command-line functions dingo_generate_dataset and dingo_generate_dataset_dag. These take a .yaml file with the same contents as the settings dictionary.
Configuration
A typical settings dictionary / .yaml config file takes the following form, described in detail below:
domain:
type: FrequencyDomain
f_min: 20.0
f_max: 1024.0
delta_f: 0.125
waveform_generator:
approximant: IMRPhenomXPHM
f_ref: 20.0
# f_start: 15.0 # Optional setting useful for EOB waveforms. Overrides f_min when generating waveforms.
# new_interface: true # Optional setting for employing new waveform interface. This is needed for SEOBNRv5 approximants, and optional for standard LAL approximants.
spin_conversion_phase: 0.0
# Dataset only samples over intrinsic parameters. Extrinsic parameters are chosen at train time.
intrinsic_prior:
mass_1: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_1')
mass_2: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_2')
chirp_mass: bilby.gw.prior.UniformInComponentsChirpMass(minimum=25.0, maximum=100.0, name='chirp_mass')
mass_ratio: bilby.gw.prior.UniformInComponentsMassRatio(minimum=0.125, maximum=1.0, name='mass_ratio')
phase: default
a_1: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99, name='a_1')
a_2: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99, name='a_2')
tilt_1: default
tilt_2: default
phi_12: default
phi_jl: default
theta_jn: default
# Reference values for fixed (extrinsic) parameters. These are needed to generate a waveform.
luminosity_distance: 100.0 # Mpc
geocent_time: 0.0 # s
# Dataset size
num_samples: 5000000
# Save a compressed representation of the dataset
compression:
svd:
# Truncate the SVD basis at this size. No truncation if zero.
size: 200
num_training_samples: 50000
num_validation_samples: 10000
whitening: aLIGO_ZERO_DET_high_P_asd.txt
- domain
Specifies the data domain. Currenly only
FrequencyDomainis implemented.- waveform_generator
Choose the approximant and reference frequency. For EOB models that require time integration, it is usually necessary to specify a lower starting frequency. In this case,
f_refis ignored.- spin_conversion_phase (optional)
Value for
phiRefwhen converting PE spins to Cartesian spins viabilby_to_lalsimulation_spins. When set toNone(default), this uses thephaseparameter. When set to 0.0,phaseonly refers to the azimuthal observation angle, allowing for it to be treated as an extrinsic parameter.Important
It is necessary to set this to 0.0 if planning to train a
phase-marginalized network, and then reconstruct thephasesynthetically.
- intrinsic_prior
Specify the prior over intrinsic parameters. Intrinsic parameters here refer to those parameters that are needed to generate waveform polarizations. Extrinsic parameters here refer to those parameters that can be sampled and applied rapidly during training. As shown in the example, it is also possible to specify
defaultpriors, which is convenient for certain parameters. These are listed indingo.gw.prior.default_intrinsic_dict.Intrinsic parameters obviously include masses and spins, but also inclination, reference phase, luminosity distance, and time of coalescense at geocenter. Although inclination and phase are often considered extrinsic parameters, they are needed to generate waveform polarizations and cannot be easily transformed.
Luminosity distance and time of coalescense are considered as both intrinsic and extrinsic. Indeed they are needed to generate polarizations, but they can also be easily transformed during training to augment the dataset. We therefore fix them to fiducial values for generating polarizations.
- num_samples
The number of samples to include in the dataset. For a production model, we typically use \(5 \times 10^6\) samples.
- compression (optional)
How to compress the dataset.
- svd (optional)
Construct an SVD basis based on a specified number of additional samples. Save the main dataset in terms of its SVD basis coefficients. The number of elements in the basis is specified by the
sizesetting. The performance of the basis is also evaluated in terms of the mismatch against a number of validation samples. All of the validation information, as well as the basis itself, is saved along with the waveform dataset.- whitening (optional)
Whether to save whitened waveforms, and in particular, whether to construct the basis based on whitened waveforms. The basis will be more efficient if whitening is used to adapt it to the detector noise characteristics. To use whitening, simply specify the desired ASD do use, from the Bilby list of ASDs. Note that the whitening is used only for the internal storage of the dataset. When accessing samples from the dataset, they will be unwhitened.
Dataset compression is implemented internally by setting the
WaveformGenerator.transformoperator, so that elements are compressed immediately after generation (avoiding the need to store many uncompressed waveforms in memory). Likewise, decompression is implemented by setting theWaveformDataset.decompression_transformoperator to apply the inverse transformation. This will act on samples to decompress them when accessed throughWaveformDataset.__getitem__().
Important
The automated dataset constructors store the configuration settings in WaveformDataset.settings. This is so that the settings can be accessed by more downstream tasks, and for reference.
Command-line interface
In most cases the command-line interface will be used to generate a dataset. Given a settings file, one can call
dingo_generate_dataset --settings_file settings.yaml
--num_processes N
--out_file waveform_dataset.hdf5
This will generate a dataset following the configuration in settings.yaml and save it as waveform_dataset.hdf5, using N processes.
To inspect the dataset (or any other Dingo-generated file) use
dingo_ls waveform_dataset.hdf5
This will print the configuration settings, as well as a summary of the SVD compression performance (if available).
For larger datasets, or those based on slower waveform models, Dingo includes a script that builds a condor DAG, dingo_generate_dataset_dag. This splits the generation of waveforms across several nodes, and then reconstitutes the final dataset.