{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3cef5021-4f26-43dd-b761-70632793e32e",
   "metadata": {},
   "source": [
    "# Detector noise\n",
    "\n",
    "During training, simulated noise $n_I$ is added to waveforms $h_I(\\theta)$ measured in detectors to produce realistic simulated data,\n",
    "\n",
    "$$\n",
    "d_I = h_I(\\theta) + n_I.\n",
    "$$\n",
    "\n",
    "Dingo assumes this noise to be stationary and Gaussian, thus it is independent in each frequency bin, with variance given by some power spectral density (PSD).\n",
    "```{important}\n",
    "Similar to extrinsic parameters, detector noise is repeatedly sampled **during training** and added to the simulated signal. This augments the training set with new noise realizations for each epoch, reducing overfitting. \n",
    "```\n",
    "\n",
    "Although noise is *mostly* stationary and Guassian during an LVK observing run, the PSD in each detector does tend to drift from event to event. In a usual likelihood-based PE run, this is taken into account by estimating the PSD at the time of the event (either using [Welch's method](https://en.wikipedia.org/wiki/Welch%27s_method) on signal-free data surrounding the event, or at the same time as the event using [BayesWave](https://git.ligo.org/lscsoft/bayeswave)), and using this in the likelihood integral.\n",
    "\n",
    "Dingo also estimates the PSD just prior to an event and uses this at inference time in two ways:\n",
    "1. It whitens the data with respect to this PSD.\n",
    "2. It provides the PSD (or rather, the inverse ASD) as context to the neural network.\n",
    "\n",
    "A suitably trained model can therefore make use of the PSD as needed to generate the posterior.\n",
    "\n",
    "(asd-dataset)=\n",
    "## ASD dataset\n",
    "\n",
    "To train a model to perform inference conditioned on the noise PSD, it is necessary to not just sample random noise realizations for a given PSD, but also **sample the PSD** from a distribution for a given observing run. Training in this way is necessary to perform fully amortized inference and account for the variation of PSDs from event to event.\n",
    "\n",
    "The `ASDDataset` class stores a set of ASD samples for several detectors, allowing for sampling during training.\n",
    "\n",
    "```{eval-rst}\n",
    ".. autoclass:: dingo.gw.ASD_dataset.noise_dataset.ASDDataset\n",
    "    :members:\n",
    "    :inherited-members:\n",
    "    :show-inheritance:\n",
    "```\n",
    "\n",
    "As with the noise realizations, a random ASD is chosen from the dataset when preparing each sample during training. This augments the training set compared to fixing the noise ASD for each sample prior to training.\n",
    "\n",
    "Similarly to the `WaveformDataset`, the `ASDDataset` is just a container. Dingo includes routines for building such a dataset from observational data.\n",
    "\n",
    "## Generating an ASDDataset\n",
    "\n",
    "### `dingo_generate_asd_dataset`\n",
    " The basic approach is as follows:\n",
    "1. Identify stretches of data within an observing run meeting certain criteria (sufficiently long, without events, and sufficiently high quality, ...) or take-in user-specified stretches.\n",
    "2. Fetch data corresponding to these stretches using either\n",
    "    - [GWOSC](https://www.gw-openscience.org)\n",
    "    - channels, optionally specified in the settings file.\n",
    "3. Estimate ASDs using Welch's method on these stretches.\n",
    "4. Save the collection of ASDs.\n",
    "\n",
    "```text\n",
    "usage: dingo_generate_asd_dataset [-h] --data_dir DATA_DIR [--settings_file SETTINGS_FILE] [--time_segments_file TIME_SEGMENTS_FILE] [--out_name OUT_NAME] [--verbose]\n",
    "\n",
    "Generate an ASD dataset based on a settings file.\n",
    "\n",
    "optional arguments:\n",
    "  -h, --help            show this help message and exit\n",
    "  --data_dir DATA_DIR   Path where the PSD data is to be stored. Must contain a 'settings.yaml' file.\n",
    "  --settings_file SETTINGS_FILE\n",
    "                        Path to a settings file in case two different datasets are generated in the same directory\n",
    "  --time_segments_file TIME_SEGMENTS_FILE\n",
    "                        Optional file containing a dictionary of a list of time segments that should be used for estimating PSDs.This has to be a pickle file.\n",
    "  --out_name OUT_NAME   Path to resulting ASD dataset\n",
    "  --verbose\n",
    "\n",
    "```\n",
    "where the settings file is of the form\n",
    "```yaml\n",
    "dataset_settings:\n",
    "  f_min: 0\n",
    "  f_max: 2048\n",
    "  f_s: 4096\n",
    "  time_psd: 1024\n",
    "  T: 8\n",
    "  time_gap: 0\n",
    "  window:\n",
    "    roll_off: 0.4\n",
    "    type: tukey\n",
    "  num_psds_max: 20\n",
    "  channels:\n",
    "   H1: H1:DCS-CALIB_STRAIN_C02\n",
    "   L1: L1:DCS-CALIB_STRAIN_C02\n",
    "  detectors:\n",
    "    - H1\n",
    "    - L1\n",
    "  observing_run: O2\n",
    "condor:\n",
    "  env_path: path/to/environment\n",
    "  num_jobs: 2    # per detector\n",
    "  num_cpus: 16\n",
    "  memory_cpus: 16000\n",
    "```\n",
    "\n",
    "Options correspond to the following:\n",
    "\n",
    "`f_min`, `f_max` (optional)\n",
    ": Lower and upper frequency range of the ASDs. Defaults to 0 and `f_s/2`, respectively.\n",
    "\n",
    "Sampling rate `f_s` (Hz)\n",
    ": This should be at least twice the value of `f_max` expected to be used.\n",
    "\n",
    "Data length `time_psd` (s)\n",
    ": The entire length of data from which to estimate a PSD using Welch's method. Periodigrams are calculated on segments of this, and then averaged using the `median` method.\n",
    "\n",
    "Segment length `T` (s)\n",
    ": The length of each segment on which to take the DFT and calculate a periodigram.\n",
    "\n",
    "Gap `time_gap` (s)\n",
    ": Gap between duration-`T` segments. E.g., if `time_psd=1024`, `T=8`, `time_gap=8`, then for each PSD, 64 periodigrams are computed, each using data stretches 8 s long, with gaps of 8 s between segments. Segments would then be $[0~\\text{s}, 8~\\text{s}], [16~\\text{s}, 24~\\text{s}], \\ldots$.\n",
    "\n",
    "Window function\n",
    ": Parameters of the window function used before taking DFT of data segments.\n",
    "\n",
    "`num_psds_max` (optional)\n",
    ": If set, stop building the dataset after this number of PSDs have been estimated. This setting is useful for building a single-PSD dataset for pretraining a network.\n",
    "\n",
    "Channels (optional)\n",
    ": If set, data will be fetched from these channels, instead of using GWOSC.\n",
    "\n",
    "Detectors\n",
    ": Which detectors (H1, L1, V1, ...) to include in the dataset.\n",
    "\n",
    "Observing run\n",
    ": Which observing run to use when estimating PSDs.\n",
    "\n",
    "Condor (optional)\n",
    ": Settings for [HTCondor](https://htcondor.readthedocs.io/en/latest/index.html) useful for parallelizing the ASD estimation across condor jobs.\n",
    "\n",
    "### `dingo_generate_synthetic_asd_dataset`\n",
    "This method generates a dataset of synthetic ASDs from a dataset of existing ASDs to enhance robustness against ASD distribution shifts. In particular, this allows to generate a dataset of synthetic ASDs that are *scaled* by a fiducial ASD in order to adapt to a new observing run. This is particularly useful for training Dingo networks at the beginning of an observing run, when the number of training ASDs is limited. It also allows to generate smoother synthetic ASDs that more closely resemble those from BayesWave. The implementation follows the steps explained in this [paper](https://inspirehep.net/literature/2182788).\n",
    "```text\n",
    "usage: dingo_generate_synthetic_asd_dataset [-h] --asd_dataset ASD_DATASET --settings_file SETTINGS_FILE [--num_processes NUM_PROCESSES] [--out_file OUT_FILE] [--verbose]\n",
    "\n",
    "Generate a synthetic noise ASD dataset from an existing dataset of real ASDs.\n",
    "\n",
    "optional arguments:\n",
    "  -h, --help            show this help message and exit\n",
    "  --asd_dataset ASD_DATASET\n",
    "                        Path to existing ASD dataset to be parameterized and re-sampled\n",
    "  --settings_file SETTINGS_FILE\n",
    "                        YAML file containing database settings\n",
    "  --num_processes NUM_PROCESSES\n",
    "                        Number of processes to use in pool for parallel parameterization\n",
    "  --out_file OUT_FILE   Name of file for storing dataset.\n",
    "  --verbose\n",
    "```\n",
    "with a settings file of the form\n",
    "```yaml\n",
    "parameterization_settings:\n",
    "  num_spline_positions: 30\n",
    "  num_spectral_segments: 400\n",
    "  sigma: 0.14\n",
    "  delta_f: -1\n",
    "  smoothen: True\n",
    "sampling_settings:\n",
    "   bandwidth_spectral: 0.5\n",
    "   bandwidth_spline: 0.25\n",
    "   num_samples: 500\n",
    "   split_frequencies:\n",
    "     - 30\n",
    "     - 100\n",
    "   rescaling_psd_paths:\n",
    "     H1: /path/to/rescaling_asd_H1.hdf5\n",
    "     L1: /path/to/rescaling_asd_L1.hdf5\n",
    "```\n",
    "Options correspond to the following:\n",
    "\n",
    "`num_spline_positions`\n",
    ": Number of nodes to use for the cubic spline interpolating the broad-band noise PSD.\n",
    "\n",
    "`num_spectral_segments`\n",
    ": Maximum number of spectral lines to model.\n",
    "\n",
    "`sigma`\n",
    ": Standard deviation of the Normal distribution parameterizing $p(\\log S_n|z)$.\n",
    "\n",
    "`delta_f`\n",
    ": If > 0, truncates each spectral line.\n",
    "\n",
    "`smoothen`\n",
    ": Whether to save the smooth ASDs (True) or the noisy ASDs (False). The noisy synthetic ASDs resemble real ASDs estimated with Welch's method more closely, while the smooth ASDs are more similar to ASDs generated with BayesWave. (Default: False) \n",
    "\n",
    "`bandwidth_spectral, bandwidth_spline`\n",
    ": Bandwidths for the KDEs modeling the distribution over spectral lines and broad-band noise, respectively. These determine the width of the resulting distribution.\n",
    "\n",
    "`num_samples`\n",
    ": Number of synthetic ASDs to generate.\n",
    "\n",
    "`split_frequencies`\n",
    ": (Set of) frequencies at dividing the broad-band noise into independent segments, e.g. due to different dominant noise sources (shot noise, seismic noise, etc.).\n",
    "\n",
    "`rescaling_psd_paths`\n",
    ": Paths to ASD datasets for each detector to which the synthetic ASDs should be rescaled, e.g. the PSDs from the target observing run. If the dataset contains multiple ASDs, we use the first one. (Optional; if not provided, no rescaling will be done.)\n",
    "\n",
    "(ref:window-factor)=\n",
    "## Data conditioning\n",
    "\n",
    "Importantly, the variance of *white* noise in each frequency bin is not 1, but rather\n",
    "\n",
    "$$\n",
    "\\sigma^2_{\\text{white}} = \\frac{1}{4\\delta f}\n",
    "$$\n",
    "\n",
    "where $\\delta f$ is the frequency resolution.\n",
    "\n",
    "The denominator in the noise variance is seen to arise most easily in the noise-weighted inner product,\n",
    "\n",
    "$$\n",
    "(a | b) = 4 \\text{Re} \\int_{f_\\text{min}}^{f_\\text{max}} df\\, \\frac{a^\\ast(f)b(f)}{S_{\\text{n}}(f)}\n",
    "$$\n",
    "\n",
    "The noise standard deviation is stored in the property `UniformFrequencyDomain.noise_std`. \n",
    "\n",
    "Prior to Dingo v0.9.0, the noise variance included a \"window factor\" correction $w$: $\\sigma^2_{\\text{white}} = w/(4\\delta f)$. However, this approach was found to be incorrect and has been removed. For the full discussion, see [Tablot et al.](https://arxiv.org/abs/2508.11091).      "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dde997d7-222c-4915-8208-56d55a865188",
   "metadata": {
    "pycharm": {
     "is_executing": true
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "cccb21eac58f1df3",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}