Data pre-processing

A sample from a WaveformDataset consists of labeled waveform polarizations \((\theta_{\text{intrinsic}}, (h_+,h_\times))\), represented as a nested dictionary. This must be transformed into noisy detector data \(d_I\) (with additional noise context data) in a form suitable for input to a neural network. Dingo accomplishes this by applying a sequence of transforms to the sample.

A transform is simply a class with a __call__() method, which takes a sample as input and returns a transformed sample. A sequence of transforms can be then be composed to build a more complex transform in a modular way. Dingo’s training transform sequence is stored as WaveformDataset.transform, and is applied automatically when elements are accessed through indexing.

GW transform sequence

For Dingo, the flowchart below indicates the sequence of transforms applied to a sample from a WaveformDataset.

        flowchart TB
    sample[Sample from WaveformDataset]
    sample-->extrinsic([SampleExtrinsicParameters])
    subgraph det[Simulate waveforms in detectors]
        direction TB
        det_times[/GetDetectorTimes/]
        det_times-->gnpe_maybe{Using GNPE?}
        gnpe_maybe-- No -->project_det[/ProjectOntoDetectors/]
        gnpe_maybe-- Yes -->gnpe_times([GNPECoalescenceTimes])
        gnpe_times-->project_det
    end
    subgraph noise[Add noise]
        direction TB
        sample_asd([SampleNoiseASD])
        sample_asd-->whiten[/WhitenAndScaleStrain/]
        whiten-->add_noise([AddWhiteNoiseComplex])
    end
    subgraph output[Prepare output]
        direction TB
        standardize[/SelectStandardizeRepackageParameters/]   
        standardize-->repackage[/RepackageStrainsAndASDS/]
        repackage-->unpack[/UnpackDict/]
    end
    extrinsic-->det
    det-->noise
    noise-->output
    output-->E[End]
    

Flowchart for Dingo data-preprocessing pipeline for training, starting from a sample from a WaveformDataset. Transforms with rounded corners include an element of randomness, whereas trapezoidal items are deterministic.

Important

Some pre-processing transforms include an element of randomness. This serves to augment the training data and reduce overfitting.

Extrinsic parameters

The starting point for this chain of transforms is a sample sample with parameters and polarizations sub-dictionaries. The first transform samples the extrinsic parameters, and adds a new sub-dictionary extrinsic_parameters to sample. Extrinsic parameters include sky position (right ascension, declination), polarization, time of coalescense, and luminosity distance (the latter two of which are also considered intrinsic parameters).

class dingo.gw.transforms.SampleExtrinsicParameters(extrinsic_prior_dict)

Sample extrinsic parameters and add them to sample in a separate dictionary.

Detector waveforms

The next sequence of transforms applies the extrinsic parameters to sample["polarizations"] to produce detector waveforms in sample["waveform"]. First it calculates the arrival time \(t_I\) of the waveform in each detector, based on the time of coalescense at geocenter and the sky position, and stores this in sample["extrinsic_parameters"],

class dingo.gw.transforms.GetDetectorTimes(ifo_list, ref_time)

Compute the time shifts in the individual detectors based on the sky position (ra, dec), the geocent_time and the ref_time.

Important

Dingo models are trained for a fixed set of detectors. This must be selected prior to training, and a new model must be trained if one wishes to analyze data in a different set of detectors. Thus, e.g., separate models must be trained for HL and HLV configurations.

Note

During training, Dingo fixes the orientation of the Earth (and corresponding interferometer positions and orientations) to that at a fixed reference time ref_time. This is so that the model does not have to learn about the rotation of the Earth. This is corrected in post-processing by shifting the inferred right ascension by the difference between the true and reference sidereal times.

Optionally, the times \(t_I\) are perturbed to give new “proxy times” as part of the GNPE algorithm.

class dingo.gw.transforms.GNPECoalescenceTimes(ifo_list, kernel, exact_global_equivariance=True, inference=False)

GNPE [1] Transformation for detector coalescence times.

For each of the detector coalescence times, a proxy is generated by adding a perturbation epsilon from the GNPE kernel to the true detector time. This proxy is subtracted from the detector time, such that the overall time shift only amounts to -epsilon in training. This standardizes the input data to the inference network, since the applied time shifts are always restricted to the range of the kernel.

To preserve information at inference time, conditioning of the inference network on the proxies is required. To that end, the proxies are stored in sample[ ‘gnpe_proxies’].

We can enforce an exact equivariance under global time translations, by subtracting one proxy (by convention: the first one, usually for H1 ifo) from all other proxies, and from the geocent time, see [1]. This is enabled with the flag exact_global_equivariance.

Note that this transform does not modify the data itself. It only determines the amount by which to time-shift the data.

[1]: arxiv.org/abs/2111.13139

Parameters:
  • ifo_list (bilby.gw.detector.InterferometerList) – List of interferometers.

  • kernel (str) – Defines a Bilby prior, to be used for all interferometers.

  • exact_global_equivariance (bool = True) – Whether to impose the exact global time translation symmetry.

  • inference (bool = False) – Whether to use inference or training mode.

Finally, the detector waveforms \(h_I\) are calculated from the extrinsic parameters. (In the backend, these transforms use the Bilby interferometer libraries.) The contents of the extrinsic_parameters sub-dictionary are then moved into sample["parameters"]; this was essentially a holding place for parameters not yet applied to the waveform.

class dingo.gw.transforms.ProjectOntoDetectors(ifo_list, domain, ref_time)

Project the GW polarizations onto the detectors in ifo_list. This does not sample any new parameters, but relies on the parameters provided in sample[‘extrinsic_parameters’]. Specifically, this transform applies the following operations:

  1. Rescale polarizations to account for sampled luminosity distance

  2. Project polarizations onto the antenna patterns using the ref_time and the extrinsic parameters (ra, dec, psi)

  3. Time shift the strains in the individual detectors according to the times <ifo.name>_time provided in the extrinsic parameters.

Noise

Once the detector waveforms have been obtained, noise \(n_I\) must be added to simulate realistic data. First, noise ASDs are selected randomly for each detector from an ASDDataset for the relevant observing run. This is stored in sample["asds"]. For details see ASD dataset.

class dingo.gw.transforms.SampleNoiseASD(asd_dataset)

Sample a batch of random ASDs for each detector and place them in sample[‘asds’].

The waveform is then whitened based on the PSD, and furthermore scaled by the standard deviation of white noise. This is so that each input to the network will have unit variance, which is important for successful training.

class dingo.gw.transforms.WhitenAndScaleStrain(scale_factor)

Whiten the strain data by dividing w.r.t. the corresponding asds, and scale it with 1/scale_factor.

In uniform frequency domain the scale factor should be

1 / np.sqrt(4.0 * delta_f).

This accounts for frequency binning

For whitened waveforms, noise is white, so finally this is randomly sampled and added to sample["waveform"].

class dingo.gw.transforms.AddWhiteNoiseComplex

Adds white noise with a standard deviation determined by self.scale to the complex strain data.

Output

The final set of transforms prepares the sample for input to the neural network. First, the desired inference parameters are selected. By taking only a subset of parameters, one can train a marginalized posterior model. These parameters are also standardized to have zero mean and unit variance to improve training. (Standardization will be undone in post-processing after inference.) The parameters will then be repackaged into a numpy.ndarray, so that parameter labels are implicit based on ordering.

class dingo.gw.transforms.SelectStandardizeRepackageParameters(parameters_dict, standardization_dict, inverse=False, as_type=None, device='cpu')

This transformation selects the parameters in standardization_dict, normalizes them by setting p = (p - mean) / std, and repackages the selected parameters to a numpy array.

as_type: str = None

only applies, if self.inverse == True * if None, data type is kept * if ‘dict’, dict with * if ‘pandas’, use pandas.DataFrame

The waveform and asds dictionaries are also repackaged into a single array of shape suitable for input to the network. In particular, the complex frequency domain strain data are decomposed into real and imaginary parts.

class dingo.gw.transforms.RepackageStrainsAndASDS(ifos, first_index=0)

Repackage the strains and the asds into an [num_ifos, 3, num_bins] dimensional tensor. Order of ifos is provided by self.ifos. By convention, [:,i,:] is used for:

i = 0: strain.real i = 1: strain.imag i = 2: 1 / (asd * 1e23)

Finally, the samples dictionary of arrays is unpacked to a tuple of arrays for parameters and data.

class dingo.gw.transforms.UnpackDict(selected_keys)

Unpacks the dictionary to prepare it for final output of the dataloader. Only returns elements specified in selected_keys.

When used with a torch DataLoader, the final numpy arrays are automatically transformed into torch tensors.

Building the transforms

The following function will set the transform property of a WaveformDataset to the above transform sequence:

dingo.gw.training.set_train_transforms(wfd, data_settings, asd_dataset_path, omit_transforms=None)

Set the transform attribute of a waveform dataset based on a settings dictionary. The transform takes waveform polarizations, samples random extrinsic parameters, projects to detectors, adds noise, and formats the data for input to the neural network. It also implements optional GNPE transformations.

Note that the WaveformDataset is modified in-place, so this function returns nothing.

Parameters:
  • wfd (WaveformDataset)

  • data_settings (dict)

  • asd_dataset_path (str) – Path corresponding to the ASD dataset used to generate noise.

  • omit_transforms – List of sub-transforms to omit from the full composition.

The various options are specified by passing an appropriate data_settings dictionary. In practice, these settings will be specified along with other training settings.

Sample data_settings dictionary for configuring a sequence of training transforms. This dictionary includes several options not needed for set_train_transforms, but which are needed as part of other training settings.
waveform_dataset_path: /path/to/waveform_dataset.hdf5  # Contains intrinsic waveforms
train_fraction: 0.95
domain_update:
  f_min: 20.0
  f_max: 1024.0
svd_size_update: 200  # Optionally, reduce the SVD size when decompressing (for performance)
detectors:
  - H1
  - L1
extrinsic_prior:  # Sampled at train time
  dec: default
  ra: default
  geocent_time: bilby.core.prior.Uniform(minimum=-0.10, maximum=0.10)
  psi: default
  luminosity_distance: bilby.core.prior.Uniform(minimum=100.0, maximum=1000.0)
ref_time: 1126259462.391
gnpe_time_shifts:
  kernel: bilby.core.prior.Uniform(minimum=-0.001, maximum=0.001)
  exact_equiv: True
inference_parameters: default
waveform_dataset_path

Points to the waveform dataset.

train_fraction

Fraction of waveform dataset to be used for training. The remainder are used to compute the test loss.

domain_update (optional)

Optionally specify new domain properties. These will update the domain associated to the WaveformDataset. They must necessarily describe a domain contained within the original.

svd_size_update (optional)

If the WaveformDataset uses SVD compression, optionally use a smaller number of basis elements than stored in the dataset. Decompression of the waveforms is the slowest preprocessing operation, so using this option can improve training speed at the expense of accuracy.

detectors

Set the desired GW interferometers for the Dingo model.

extrinsic_prior

Specify the extrinsic prior. Default options are available.

ref_time

Reference time for the interferometer locations and orientations. See the important note above.

gnpe_time_shifts (optional)

GNPE kernel and additional options. See GNPE.

inference_parameters

Parameters to infer with the model. At present they must be a subset of sample["parameters"]. By specifying a strict subset, this can be used to marginalize over parameters. The default setting points to dingo.gw.prior.default_inference_parameters:

import warnings
warnings.filterwarnings("ignore", "Wswiglal-redir-stdio")
import lal
from dingo.gw.prior import default_inference_parameters
default_inference_parameters
['chirp_mass',
 'mass_ratio',
 'phase',
 'a_1',
 'a_2',
 'tilt_1',
 'tilt_2',
 'phi_12',
 'phi_jl',
 'theta_jn',
 'luminosity_distance',
 'geocent_time',
 'ra',
 'dec',
 'psi']