{ "cells": [ { "cell_type": "markdown", "id": "45b4625b-e3f2-469a-b43e-33923240857f", "metadata": {}, "source": [ "# Data pre-processing\n", "\n", "A sample from a `WaveformDataset` consists of labeled waveform polarizations $(\\theta_{\\text{intrinsic}}, (h_+,h_\\times))$, represented as a nested dictionary. This must be transformed into noisy detector data $d_I$ (with additional noise context data) in a form suitable for input to a neural network. Dingo accomplishes this by applying a sequence of [**transforms**](https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html) to the sample.\n", "\n", "A transform is simply a class with a `__call__()` method, which takes a sample as input and returns a transformed sample. A sequence of transforms can be then be [composed](https://pytorch.org/vision/stable/generated/torchvision.transforms.Compose.html#torchvision.transforms.Compose) to build a more complex transform in a modular way. Dingo's training transform sequence is stored as `WaveformDataset.transform`, and is applied automatically when elements are accessed through indexing.\n", "\n", "## GW transform sequence\n", "\n", "For Dingo, the flowchart below indicates the sequence of transforms applied to a sample from a `WaveformDataset`.\n", "\n", "```{mermaid}\n", " :caption: Flowchart for Dingo data-preprocessing pipeline for training, starting from a sample from a `WaveformDataset`. Transforms with rounded corners include an element of randomness, whereas trapezoidal items are deterministic.\n", " \n", "flowchart TB\n", " sample[Sample from WaveformDataset]\n", " sample-->extrinsic([SampleExtrinsicParameters])\n", " subgraph det[Simulate waveforms in detectors]\n", " direction TB\n", " det_times[/GetDetectorTimes/]\n", " det_times-->gnpe_maybe{Using GNPE?}\n", " gnpe_maybe-- No -->project_det[/ProjectOntoDetectors/]\n", " gnpe_maybe-- Yes -->gnpe_times([GNPECoalescenceTimes])\n", " gnpe_times-->project_det\n", " end\n", " subgraph noise[Add noise]\n", " direction TB\n", " sample_asd([SampleNoiseASD])\n", " sample_asd-->whiten[/WhitenAndScaleStrain/]\n", " whiten-->add_noise([AddWhiteNoiseComplex])\n", " end\n", " subgraph output[Prepare output]\n", " direction TB\n", " standardize[/SelectStandardizeRepackageParameters/] \n", " standardize-->repackage[/RepackageStrainsAndASDS/]\n", " repackage-->unpack[/UnpackDict/]\n", " end\n", " extrinsic-->det\n", " det-->noise\n", " noise-->output\n", " output-->E[End]\n", "```\n", "\n", "```{important}\n", "Some pre-processing transforms include an element of randomness. This serves to augment the training data and reduce overfitting.\n", "```\n", "\n", "### Extrinsic parameters\n", "\n", "The starting point for this chain of transforms is a sample `sample` with `parameters` and `polarizations` sub-dictionaries. The first transform samples the extrinsic parameters, and adds a new sub-dictionary `extrinsic_parameters` to `sample`. Extrinsic parameters include sky position (right ascension, declination), polarization, time of coalescense, and luminosity distance (the latter two of which are also considered intrinsic parameters).\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.SampleExtrinsicParameters\n", " :members:\n", "```\n", "\n", "### Detector waveforms\n", "\n", "The next sequence of transforms applies the extrinsic parameters to `sample[\"polarizations\"]` to produce detector waveforms in `sample[\"waveform\"]`. First it calculates the arrival time $t_I$ of the waveform in each detector, based on the time of coalescense at geocenter and the sky position, and stores this in `sample[\"extrinsic_parameters\"]`,\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.GetDetectorTimes\n", " :members:\n", "```\n", "\n", "(ref:ref-time)=\n", "```{important}\n", "Dingo models are trained for a **fixed set of detectors.** This must be selected prior to training, and a new model must be trained if one wishes to analyze data in a different set of detectors. Thus, e.g., separate models must be trained for HL and HLV configurations.\n", "```\n", "\n", "```{note}\n", "During training, Dingo **fixes the orientation of the Earth** (and corresponding interferometer positions and orientations) to that at a fixed reference time `ref_time`. This is so that the model does not have to learn about the rotation of the Earth. This is corrected in post-processing by shifting the inferred right ascension by the difference between the true and reference sidereal times.\n", "```\n", "\n", "Optionally, the times $t_I$ are perturbed to give new \"proxy times\" as part of the [](gnpe.md) algorithm.\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.GNPECoalescenceTimes\n", " :members:\n", "```\n", "\n", "Finally, the detector waveforms $h_I$ are calculated from the extrinsic parameters. (In the backend, these transforms use the Bilby interferometer libraries.) The contents of the `extrinsic_parameters` sub-dictionary are then moved into `sample[\"parameters\"]`; this was essentially a holding place for parameters not yet applied to the waveform.\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.ProjectOntoDetectors\n", " :members:\n", "```\n", "\n", "### Noise\n", "\n", "Once the detector waveforms have been obtained, noise $n_I$ must be added to simulate realistic data. First, noise ASDs are selected randomly for each detector from an `ASDDataset` for the relevant observing run. This is stored in `sample[\"asds\"]`. For details see [](noise_dataset.ipynb#asd-dataset).\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.SampleNoiseASD\n", " :members:\n", "```\n", "\n", "The waveform is then whitened based on the PSD, and furthermore scaled by the standard deviation of white noise. This is so that each input to the network will have unit variance, which is important for successful training.\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.WhitenAndScaleStrain\n", " :members:\n", "```\n", "\n", "For whitened waveforms, noise is white, so finally this is randomly sampled and added to `sample[\"waveform\"]`.\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.AddWhiteNoiseComplex\n", " :members:\n", "```\n", "\n", "### Output\n", "\n", "The final set of transforms prepares the sample for input to the neural network. First, the desired inference parameters are selected. By taking only a subset of `parameters`, one can train a marginalized posterior model. These parameters are also standardized to have zero mean and unit variance to improve training. (Standardization will be undone in post-processing after inference.) The parameters will then be repackaged into a `numpy.ndarray`, so that parameter labels are implicit based on ordering.\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.SelectStandardizeRepackageParameters\n", " :members:\n", "```\n", "\n", "The `waveform` and `asds` dictionaries are also repackaged into a single array of shape suitable for input to the network. In particular, the complex frequency domain strain data are decomposed into real and imaginary parts.\n", "\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.RepackageStrainsAndASDS\n", " :members:\n", "```\n", "\n", "Finally, the `samples` dictionary of arrays is unpacked to a tuple of arrays for parameters and data.\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.transforms.UnpackDict\n", " :members:\n", "```\n", "\n", "When used with a torch `DataLoader`, the final numpy arrays are automatically transformed into torch tensors.\n", "\n", "\n", "## Building the transforms\n", "\n", "The following function will set the `transform` property of a `WaveformDataset` to the above transform sequence:\n", "\n", "```{eval-rst}\n", ".. autofunction:: dingo.gw.training.set_train_transforms\n", "```\n", "\n", "The various options are specified by passing an appropriate `data_settings` dictionary. In practice, these settings will be specified along with other [training settings](training).\n", "\n", "```{code-block} yaml\n", "---\n", "caption: Sample `data_settings` dictionary for configuring a sequence of training transforms. This dictionary includes several options not needed for `set_train_transforms`, but which are needed as part of other training settings.\n", "---\n", "waveform_dataset_path: /path/to/waveform_dataset.hdf5 # Contains intrinsic waveforms\n", "train_fraction: 0.95\n", "domain_update:\n", " f_min: 20.0\n", " f_max: 1024.0\n", "svd_size_update: 200 # Optionally, reduce the SVD size when decompressing (for performance)\n", "detectors:\n", " - H1\n", " - L1\n", "extrinsic_prior: # Sampled at train time\n", " dec: default\n", " ra: default\n", " geocent_time: bilby.core.prior.Uniform(minimum=-0.10, maximum=0.10)\n", " psi: default\n", " luminosity_distance: bilby.core.prior.Uniform(minimum=100.0, maximum=1000.0)\n", "ref_time: 1126259462.391\n", "gnpe_time_shifts:\n", " kernel: bilby.core.prior.Uniform(minimum=-0.001, maximum=0.001)\n", " exact_equiv: True\n", "inference_parameters: default\n", "```\n", "\n", "waveform_dataset_path\n", ": Points to the waveform dataset.\n", "\n", "train_fraction\n", ": Fraction of waveform dataset to be used for training. The remainder are used to compute the test loss.\n", "\n", "domain_update (optional)\n", ": Optionally specify new domain properties. These will update the domain associated to the `WaveformDataset`. They must necessarily describe a domain contained within the original.\n", "\n", "svd_size_update (optional)\n", ": If the `WaveformDataset` uses SVD compression, optionally use a smaller number of basis elements than stored in the dataset. Decompression of the waveforms is the slowest preprocessing operation, so using this option can improve training speed at the expense of accuracy.\n", "\n", "detectors\n", ": Set the desired GW interferometers for the Dingo model.\n", "\n", "extrinsic_prior\n", ": Specify the extrinsic prior. Default options are available.\n", "\n", "ref_time\n", ": Reference time for the interferometer locations and orientations. See the [important note](ref:ref-time) above.\n", "\n", "gnpe_time_shifts (optional)\n", ": GNPE kernel and additional options. See [](gnpe.md).\n", "\n", "inference_parameters\n", ": Parameters to infer with the model. At present they must be a subset of `sample[\"parameters\"]`. By specifying a strict subset, this can be used to marginalize over parameters. The `default` setting points to `dingo.gw.prior.default_inference_parameters`:" ] }, { "cell_type": "code", "execution_count": 1, "id": "92aad1270d34fa9b", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:25:47.239983Z", "start_time": "2025-03-06T12:25:47.142317Z" } }, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\", \"Wswiglal-redir-stdio\")\n", "import lal" ] }, { "cell_type": "code", "execution_count": 2, "id": "1f8f1915-7da5-4559-be5e-604d3f325b5e", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:25:47.740156Z", "start_time": "2025-03-06T12:25:47.247422Z" } }, "outputs": [ { "data": { "text/plain": [ "['chirp_mass',\n", " 'mass_ratio',\n", " 'phase',\n", " 'a_1',\n", " 'a_2',\n", " 'tilt_1',\n", " 'tilt_2',\n", " 'phi_12',\n", " 'phi_jl',\n", " 'theta_jn',\n", " 'luminosity_distance',\n", " 'geocent_time',\n", " 'ra',\n", " 'dec',\n", " 'psi']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from dingo.gw.prior import default_inference_parameters\n", "default_inference_parameters" ] }, { "cell_type": "code", "execution_count": null, "id": "2bcfa50e-9109-486e-8f24-a2f246e4054d", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:25:47.781212Z", "start_time": "2025-03-06T12:25:47.779904Z" } }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 5 }