Detector noise

During training, simulated noise \(n_I\) is added to waveforms \(h_I(\theta)\) measured in detectors to produce realistic simulated data,

\[ d_I = h_I(\theta) + n_I. \]

Dingo assumes this noise to be stationary and Gaussian, thus it is independent in each frequency bin, with variance given by some power spectral density (PSD).

Important

Similar to extrinsic parameters, detector noise is repeatedly sampled during training and added to the simulated signal. This augments the training set with new noise realizations for each epoch, reducing overfitting.

Although noise is mostly stationary and Guassian during an LVK observing run, the PSD in each detector does tend to drift from event to event. In a usual likelihood-based PE run, this is taken into account by estimating the PSD at the time of the event (either using Welch’s method on signal-free data surrounding the event, or at the same time as the event using BayesWave), and using this in the likelihood integral.

Dingo also estimates the PSD just prior to an event and uses this at inference time in two ways:

It whitens the data with respect to this PSD.
It provides the PSD (or rather, the inverse ASD) as context to the neural network.

A suitably trained model can therefore make use of the PSD as needed to generate the posterior.

ASD dataset

To train a model to perform inference conditioned on the noise PSD, it is necessary to not just sample random noise realizations for a given PSD, but also sample the PSD from a distribution for a given observing run. Training in this way is necessary to perform fully amortized inference and account for the variation of PSDs from event to event.

The ASDDataset class stores a set of ASD samples for several detectors, allowing for sampling during training.

As with the noise realizations, a random ASD is chosen from the dataset when preparing each sample during training. This augments the training set compared to fixing the noise ASD for each sample prior to training.

Similarly to the WaveformDataset, the ASDDataset is just a container. Dingo includes routines for building such a dataset from observational data.

Generating an ASDDataset

`dingo_generate_asd_dataset`

The basic approach is as follows:

Identify stretches of data within an observing run meeting certain criteria (sufficiently long, without events, and sufficiently high quality, …) or take-in user-specified stretches.
Fetch data corresponding to these stretches using either
- GWOSC
- channels, optionally specified in the settings file.
Estimate ASDs using Welch’s method on these stretches.
Save the collection of ASDs.

usage: dingo_generate_asd_dataset [-h] --data_dir DATA_DIR [--settings_file SETTINGS_FILE] [--time_segments_file TIME_SEGMENTS_FILE] [--out_name OUT_NAME] [--verbose]

Generate an ASD dataset based on a settings file.

optional arguments:
  -h, --help            show this help message and exit
  --data_dir DATA_DIR   Path where the PSD data is to be stored. Must contain a 'settings.yaml' file.
  --settings_file SETTINGS_FILE
                        Path to a settings file in case two different datasets are generated in the same directory
  --time_segments_file TIME_SEGMENTS_FILE
                        Optional file containing a dictionary of a list of time segments that should be used for estimating PSDs.This has to be a pickle file.
  --out_name OUT_NAME   Path to resulting ASD dataset
  --verbose

where the settings file is of the form

dataset_settings:
  f_min: 0
  f_max: 2048
  f_s: 4096
  time_psd: 1024
  T: 8
  time_gap: 0
  window:
    roll_off: 0.4
    type: tukey
  num_psds_max: 20
  channels:
   H1: H1:DCS-CALIB_STRAIN_C02
   L1: L1:DCS-CALIB_STRAIN_C02
  detectors:
    - H1
    - L1
  observing_run: O2
condor:
  env_path: path/to/environment
  num_jobs: 2    # per detector
  num_cpus: 16
  memory_cpus: 16000

Options correspond to the following:

f_min, f_max (optional): Lower and upper frequency range of the ASDs. Defaults to 0 and f_s/2, respectively.
Sampling rate f_s (Hz): This should be at least twice the value of f_max expected to be used.
Data length time_psd (s): The entire length of data from which to estimate a PSD using Welch’s method. Periodigrams are calculated on segments of this, and then averaged using the median method.
Segment length T (s): The length of each segment on which to take the DFT and calculate a periodigram.
Gap time_gap (s): Gap between duration-T segments. E.g., if time_psd=1024, T=8, time_gap=8, then for each PSD, 64 periodigrams are computed, each using data stretches 8 s long, with gaps of 8 s between segments. Segments would then be \([0~\text{s}, 8~\text{s}], [16~\text{s}, 24~\text{s}], \ldots\).
Window function: Parameters of the window function used before taking DFT of data segments.
num_psds_max (optional): If set, stop building the dataset after this number of PSDs have been estimated. This setting is useful for building a single-PSD dataset for pretraining a network.
Channels (optional): If set, data will be fetched from these channels, instead of using GWOSC.
Detectors: Which detectors (H1, L1, V1, …) to include in the dataset.
Observing run: Which observing run to use when estimating PSDs.
Condor (optional): Settings for HTCondor useful for parallelizing the ASD estimation across condor jobs.

`dingo_generate_synthetic_asd_dataset`

This method generates a dataset of synthetic ASDs from a dataset of existing ASDs to enhance robustness against ASD distribution shifts. In particular, this allows to generate a dataset of synthetic ASDs that are scaled by a fiducial ASD in order to adapt to a new observing run. This is particularly useful for training Dingo networks at the beginning of an observing run, when the number of training ASDs is limited. It also allows to generate smoother synthetic ASDs that more closely resemble those from BayesWave. The implementation follows the steps explained in this paper.

usage: dingo_generate_synthetic_asd_dataset [-h] --asd_dataset ASD_DATASET --settings_file SETTINGS_FILE [--num_processes NUM_PROCESSES] [--out_file OUT_FILE] [--verbose]

Generate a synthetic noise ASD dataset from an existing dataset of real ASDs.

optional arguments:
  -h, --help            show this help message and exit
  --asd_dataset ASD_DATASET
                        Path to existing ASD dataset to be parameterized and re-sampled
  --settings_file SETTINGS_FILE
                        YAML file containing database settings
  --num_processes NUM_PROCESSES
                        Number of processes to use in pool for parallel parameterization
  --out_file OUT_FILE   Name of file for storing dataset.
  --verbose

with a settings file of the form

parameterization_settings:
  num_spline_positions: 30
  num_spectral_segments: 400
  sigma: 0.14
  delta_f: -1
  smoothen: True
sampling_settings:
   bandwidth_spectral: 0.5
   bandwidth_spline: 0.25
   num_samples: 500
   split_frequencies:
     - 30
     - 100
   rescaling_psd_paths:
     H1: /path/to/rescaling_asd_H1.hdf5
     L1: /path/to/rescaling_asd_L1.hdf5

Options correspond to the following:

num_spline_positions: Number of nodes to use for the cubic spline interpolating the broad-band noise PSD.
num_spectral_segments: Maximum number of spectral lines to model.
sigma: Standard deviation of the Normal distribution parameterizing \(p(\log S_n|z)\).
delta_f: If > 0, truncates each spectral line.
smoothen: Whether to save the smooth ASDs (True) or the noisy ASDs (False). The noisy synthetic ASDs resemble real ASDs estimated with Welch’s method more closely, while the smooth ASDs are more similar to ASDs generated with BayesWave. (Default: False)
bandwidth_spectral, bandwidth_spline: Bandwidths for the KDEs modeling the distribution over spectral lines and broad-band noise, respectively. These determine the width of the resulting distribution.
num_samples: Number of synthetic ASDs to generate.
split_frequencies: (Set of) frequencies at dividing the broad-band noise into independent segments, e.g. due to different dominant noise sources (shot noise, seismic noise, etc.).
rescaling_psd_paths: Paths to ASD datasets for each detector to which the synthetic ASDs should be rescaled, e.g. the PSDs from the target observing run. If the dataset contains multiple ASDs, we use the first one. (Optional; if not provided, no rescaling will be done.)

Data conditioning

Importantly, the variance of white noise in each frequency bin is not 1, but rather

\[ \sigma^2_{\text{white}} = \frac{1}{4\delta f} \]

where \(\delta f\) is the frequency resolution.

The denominator in the noise variance is seen to arise most easily in the noise-weighted inner product,

\[ (a | b) = 4 \text{Re} \int_{f_\text{min}}^{f_\text{max}} df\, \frac{a^\ast(f)b(f)}{S_{\text{n}}(f)} \]

The noise standard deviation is stored in the property UniformFrequencyDomain.noise_std.

Prior to Dingo v0.9.0, the noise variance included a “window factor” correction \(w\): \(\sigma^2_{\text{white}} = w/(4\delta f)\). However, this approach was found to be incorrect and has been removed. For the full discussion, see Tablot et al..