Detector noise
During training, simulated noise \(n_I\) is added to waveforms \(h_I(\theta)\) measured in detectors to produce realistic simulated data,
Dingo assumes this noise to be stationary and Gaussian, thus it is independent in each frequency bin, with variance given by some power spectral density (PSD).
Important
Similar to extrinsic parameters, detector noise is repeatedly sampled during training and added to the simulated signal. This augments the training set with new noise realizations for each epoch, reducing overfitting.
Although noise is mostly stationary and Guassian during an LVK observing run, the PSD in each detector does tend to drift from event to event. In a usual likelihood-based PE run, this is taken into account by estimating the PSD at the time of the event (either using Welch’s method on signal-free data surrounding the event, or at the same time as the event using BayesWave), and using this in the likelihood integral.
Dingo also estimates the PSD just prior to an event and uses this at inference time in two ways:
It whitens the data with respect to this PSD.
It provides the PSD (or rather, the inverse ASD) as context to the neural network.
A suitably trained model can therefore make use of the PSD as needed to generate the posterior.
ASD dataset
To train a model to perform inference conditioned on the noise PSD, it is necessary to not just sample random noise realizations for a given PSD, but also sample the PSD from a distribution for a given observing run. Training in this way is necessary to perform fully amortized inference and account for the variation of PSDs from event to event.
The ASDDataset class stores a set of ASD samples for several detectors, allowing for sampling during training.
As with the noise realizations, a random ASD is chosen from the dataset when preparing each sample during training. This augments the training set compared to fixing the noise ASD for each sample prior to training.
Similarly to the WaveformDataset, the ASDDataset is just a container. Dingo includes routines for building such a dataset from observational data.
Generating an ASDDataset
dingo_generate_asd_dataset
The basic approach is as follows:
Identify stretches of data within an observing run meeting certain criteria (sufficiently long, without events, and sufficiently high quality, …) or take-in user-specified stretches.
Fetch data corresponding to these stretches using either
channels, optionally specified in the settings file.
Estimate ASDs using Welch’s method on these stretches.
Save the collection of ASDs.
usage: dingo_generate_asd_dataset [-h] --data_dir DATA_DIR [--settings_file SETTINGS_FILE] [--time_segments_file TIME_SEGMENTS_FILE] [--out_name OUT_NAME] [--verbose]
Generate an ASD dataset based on a settings file.
optional arguments:
-h, --help show this help message and exit
--data_dir DATA_DIR Path where the PSD data is to be stored. Must contain a 'settings.yaml' file.
--settings_file SETTINGS_FILE
Path to a settings file in case two different datasets are generated in the same directory
--time_segments_file TIME_SEGMENTS_FILE
Optional file containing a dictionary of a list of time segments that should be used for estimating PSDs.This has to be a pickle file.
--out_name OUT_NAME Path to resulting ASD dataset
--verbose
where the settings file is of the form
dataset_settings:
f_min: 0
f_max: 2048
f_s: 4096
time_psd: 1024
T: 8
time_gap: 0
window:
roll_off: 0.4
type: tukey
num_psds_max: 20
channels:
H1: H1:DCS-CALIB_STRAIN_C02
L1: L1:DCS-CALIB_STRAIN_C02
detectors:
- H1
- L1
observing_run: O2
condor:
env_path: path/to/environment
num_jobs: 2 # per detector
num_cpus: 16
memory_cpus: 16000
Options correspond to the following:
f_min,f_max(optional)Lower and upper frequency range of the ASDs. Defaults to 0 and
f_s/2, respectively.- Sampling rate
f_s(Hz) This should be at least twice the value of
f_maxexpected to be used.- Data length
time_psd(s) The entire length of data from which to estimate a PSD using Welch’s method. Periodigrams are calculated on segments of this, and then averaged using the
medianmethod.- Segment length
T(s) The length of each segment on which to take the DFT and calculate a periodigram.
- Gap
time_gap(s) Gap between duration-
Tsegments. E.g., iftime_psd=1024,T=8,time_gap=8, then for each PSD, 64 periodigrams are computed, each using data stretches 8 s long, with gaps of 8 s between segments. Segments would then be \([0~\text{s}, 8~\text{s}], [16~\text{s}, 24~\text{s}], \ldots\).- Window function
Parameters of the window function used before taking DFT of data segments.
num_psds_max(optional)If set, stop building the dataset after this number of PSDs have been estimated. This setting is useful for building a single-PSD dataset for pretraining a network.
- Channels (optional)
If set, data will be fetched from these channels, instead of using GWOSC.
- Detectors
Which detectors (H1, L1, V1, …) to include in the dataset.
- Observing run
Which observing run to use when estimating PSDs.
- Condor (optional)
Settings for HTCondor useful for parallelizing the ASD estimation across condor jobs.
dingo_generate_synthetic_asd_dataset
This method generates a dataset of synthetic ASDs from a dataset of existing ASDs to enhance robustness against ASD distribution shifts. In particular, this allows to generate a dataset of synthetic ASDs that are scaled by a fiducial ASD in order to adapt to a new observing run. This is particularly useful for training Dingo networks at the beginning of an observing run, when the number of training ASDs is limited. It also allows to generate smoother synthetic ASDs that more closely resemble those from BayesWave. The implementation follows the steps explained in this paper.
usage: dingo_generate_synthetic_asd_dataset [-h] --asd_dataset ASD_DATASET --settings_file SETTINGS_FILE [--num_processes NUM_PROCESSES] [--out_file OUT_FILE] [--verbose]
Generate a synthetic noise ASD dataset from an existing dataset of real ASDs.
optional arguments:
-h, --help show this help message and exit
--asd_dataset ASD_DATASET
Path to existing ASD dataset to be parameterized and re-sampled
--settings_file SETTINGS_FILE
YAML file containing database settings
--num_processes NUM_PROCESSES
Number of processes to use in pool for parallel parameterization
--out_file OUT_FILE Name of file for storing dataset.
--verbose
with a settings file of the form
parameterization_settings:
num_spline_positions: 30
num_spectral_segments: 400
sigma: 0.14
delta_f: -1
smoothen: True
sampling_settings:
bandwidth_spectral: 0.5
bandwidth_spline: 0.25
num_samples: 500
split_frequencies:
- 30
- 100
rescaling_psd_paths:
H1: /path/to/rescaling_asd_H1.hdf5
L1: /path/to/rescaling_asd_L1.hdf5
Options correspond to the following:
num_spline_positionsNumber of nodes to use for the cubic spline interpolating the broad-band noise PSD.
num_spectral_segmentsMaximum number of spectral lines to model.
sigmaStandard deviation of the Normal distribution parameterizing \(p(\log S_n|z)\).
delta_fIf > 0, truncates each spectral line.
smoothenWhether to save the smooth ASDs (True) or the noisy ASDs (False). The noisy synthetic ASDs resemble real ASDs estimated with Welch’s method more closely, while the smooth ASDs are more similar to ASDs generated with BayesWave. (Default: False)
bandwidth_spectral, bandwidth_splineBandwidths for the KDEs modeling the distribution over spectral lines and broad-band noise, respectively. These determine the width of the resulting distribution.
num_samplesNumber of synthetic ASDs to generate.
split_frequencies(Set of) frequencies at dividing the broad-band noise into independent segments, e.g. due to different dominant noise sources (shot noise, seismic noise, etc.).
rescaling_psd_pathsPaths to ASD datasets for each detector to which the synthetic ASDs should be rescaled, e.g. the PSDs from the target observing run. If the dataset contains multiple ASDs, we use the first one. (Optional; if not provided, no rescaling will be done.)
Data conditioning
Importantly, the variance of white noise in each frequency bin is not 1, but rather
where \(\delta f\) is the frequency resolution.
The denominator in the noise variance is seen to arise most easily in the noise-weighted inner product,
The noise standard deviation is stored in the property UniformFrequencyDomain.noise_std.
Prior to Dingo v0.9.0, the noise variance included a “window factor” correction \(w\): \(\sigma^2_{\text{white}} = w/(4\delta f)\). However, this approach was found to be incorrect and has been removed. For the full discussion, see Tablot et al..