{ "cells": [ { "cell_type": "markdown", "id": "08b18d01-1d05-459a-b463-b748618268db", "metadata": {}, "source": [ "# Building a waveform dataset\n", "\n", "For training neural networks, the more training samples the better. With too little training data, one runs the risk of overfitting. Waveforms, however, can be expensive to generate and take up significant storage. Dingo adopts several strategies to mitigate these problems:\n", "\n", "* Dingo partitions parameters into two types---intrinsic and extrinsic---and builds a training set based only on the intrinsic parameters. This consists of waveform polarizations $h_+$ and $h_\\times$. Extrinsic parameters are selected during training, and applied to generate the detector waveforms $h_I$. This augments the training set to provide unlimited samples from the extrinsic parameters.\n", "\n", "* Saved waveforms are compressed using a singular value decomposition. Although this is lossy, waveform mismatches can monitored to ensure that they fall below the intrinsic error in the waveform model. \n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "697a2506-0b8c-40ee-8036-2e2911b22e08", "metadata": {}, "source": [ "## The `WaveformDataset` class\n", "\n", "The `WaveformDataset` is a storage container for waveform polarizations and parameters, which can used to serve samples to a neural network during training:\n", "\n", "```{eval-rst}\n", ".. autoclass:: dingo.gw.dataset.WaveformDataset\n", " :members:\n", " :inherited-members:\n", " :show-inheritance:\n", "```\n", "\n", "`WaveformDataset` subclasses `dingo.core.dataset.DingoDataset` and `torch.utils.data.Dataset`. The former provides generic functionality for saving and loading datasets as HDF5 files and dictionaries, and is used in several components of Dingo. The latter allows the `WaveformDataset` to be used with a PyTorch `DataLoader`. In general, we follow the PyTorch design framework for training, including [Datasets, DataLoaders,](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) and [Transforms](https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html)." ] }, { "cell_type": "markdown", "id": "e16c9039-fbef-4996-8eac-96cc8f42dd52", "metadata": { "tags": [] }, "source": [ "## Generating a simple dataset\n", "\n", "As described above, the `WaveformDataset` class is just a container, and does not generate the contents itself. Dataset generation is instead carried out using functions in the `dingo.gw.dataset.generate_dataset` module. Although in practice, datasets are likely to be generated from a settings file using the command line interface, here we describe how to generate one interactively.\n", "\n", "A dataset is based on an intrinsic prior and a waveform generator, so we build these as described [here](generating_waveforms.ipynb)." ] }, { "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:26:13.184578Z", "start_time": "2025-03-06T12:26:13.100578Z" } }, "cell_type": "code", "source": [ "import warnings\n", "warnings.filterwarnings(\"ignore\", \"Wswiglal-redir-stdio\")\n", "import lal" ], "id": "c05d01b6b5c311a2", "outputs": [], "execution_count": 1 }, { "cell_type": "code", "id": "a40003ad-e6c4-4def-8935-dedb3e448935", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:26:14.282156Z", "start_time": "2025-03-06T12:26:13.197947Z" } }, "source": [ "from dingo.gw.waveform_generator import WaveformGenerator\n", "from bilby.core.prior import PriorDict\n", "from dingo.gw.prior import default_intrinsic_dict\n", "from dingo.gw.domains import FrequencyDomain\n", "\n", "domain = FrequencyDomain(f_min=20.0, f_max=1024.0, delta_f=0.125)\n", "wfg = WaveformGenerator(approximant='IMRPhenomXPHM', domain=domain, f_ref=20.0)\n", "prior = PriorDict(default_intrinsic_dict)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Setting spin_conversion_phase = None. Using phase parameter for conversion to cartesian spins.\n" ] } ], "execution_count": 2 }, { "cell_type": "markdown", "id": "99729561-9b18-4be7-8a67-3b6d59a69858", "metadata": {}, "source": [ "We can use the following function to generate sets of parameters and associated waveforms:" ] }, { "cell_type": "code", "id": "6ea82e56-dcff-4e84-8105-27fcee0c7566", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:26:15.636350Z", "start_time": "2025-03-06T12:26:14.323162Z" } }, "source": [ "from dingo.gw.dataset.generate_dataset import generate_parameters_and_polarizations\n", "\n", "parameters, polarizations = generate_parameters_and_polarizations(wfg,\n", " prior,\n", " num_samples=100,\n", " num_processes=1)" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Generating dataset of size 100\n" ] } ], "execution_count": 3 }, { "cell_type": "code", "id": "93896ee8-100c-44aa-88d2-7ffab93ac1c3", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:26:15.651837Z", "start_time": "2025-03-06T12:26:15.644358Z" } }, "source": [ "parameters" ], "outputs": [ { "data": { "text/plain": [ " mass_ratio chirp_mass luminosity_distance theta_jn phase a_1 \\\n", "0 0.218187 73.845050 1000.0 1.255204 1.966362 0.197980 \n", "1 0.381173 87.704762 1000.0 2.033628 3.888862 0.460440 \n", "2 0.510406 93.479307 1000.0 1.859908 3.469898 0.023533 \n", "3 0.678305 92.145038 1000.0 0.758713 2.841377 0.172021 \n", "4 0.624489 33.540545 1000.0 1.582852 1.577590 0.413280 \n", ".. ... ... ... ... ... ... \n", "95 0.540129 87.451546 1000.0 2.696406 5.270380 0.201667 \n", "96 0.803457 66.013454 1000.0 0.379665 0.175340 0.437341 \n", "97 0.861454 75.908534 1000.0 1.805871 1.334242 0.505140 \n", "98 0.380818 45.702456 1000.0 1.684684 3.820672 0.092019 \n", "99 0.941143 69.169888 1000.0 2.045144 0.209135 0.925224 \n", "\n", " a_2 tilt_1 tilt_2 phi_12 phi_jl geocent_time \n", "0 0.240156 1.972606 1.376228 2.186446 4.752777 0.0 \n", "1 0.692240 1.754236 0.661015 0.790942 5.066653 0.0 \n", "2 0.296818 2.552577 0.359922 2.138755 3.489143 0.0 \n", "3 0.934613 0.359660 2.157047 3.599841 0.860001 0.0 \n", "4 0.964930 1.929234 2.084173 1.543995 5.298489 0.0 \n", ".. ... ... ... ... ... ... \n", "95 0.187635 0.447384 1.944557 0.052446 0.952740 0.0 \n", "96 0.730075 1.475004 2.752046 5.595977 2.047529 0.0 \n", "97 0.566819 0.965326 0.194196 0.807147 2.357237 0.0 \n", "98 0.228797 1.478859 1.849281 5.860794 0.562862 0.0 \n", "99 0.975578 1.644663 1.359320 3.098630 4.976837 0.0 \n", "\n", "[100 rows x 12 columns]" ], "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mass_ratiochirp_massluminosity_distancetheta_jnphasea_1a_2tilt_1tilt_2phi_12phi_jlgeocent_time
00.21818773.8450501000.01.2552041.9663620.1979800.2401561.9726061.3762282.1864464.7527770.0
10.38117387.7047621000.02.0336283.8888620.4604400.6922401.7542360.6610150.7909425.0666530.0
20.51040693.4793071000.01.8599083.4698980.0235330.2968182.5525770.3599222.1387553.4891430.0
30.67830592.1450381000.00.7587132.8413770.1720210.9346130.3596602.1570473.5998410.8600010.0
40.62448933.5405451000.01.5828521.5775900.4132800.9649301.9292342.0841731.5439955.2984890.0
.......................................
950.54012987.4515461000.02.6964065.2703800.2016670.1876350.4473841.9445570.0524460.9527400.0
960.80345766.0134541000.00.3796650.1753400.4373410.7300751.4750042.7520465.5959772.0475290.0
970.86145475.9085341000.01.8058711.3342420.5051400.5668190.9653260.1941960.8071472.3572370.0
980.38081845.7024561000.01.6846843.8206720.0920190.2287971.4788591.8492815.8607940.5628620.0
990.94114369.1698881000.02.0451440.2091350.9252240.9755781.6446631.3593203.0986304.9768370.0
\n", "

100 rows × 12 columns

\n", "
" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 4 }, { "cell_type": "code", "id": "07e97ba2-8f27-4b6b-9bc1-eb74980ae4be", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:26:15.685047Z", "start_time": "2025-03-06T12:26:15.682234Z" } }, "source": [ "polarizations" ], "outputs": [ { "data": { "text/plain": [ "{'h_plus': array([[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " ...,\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]]),\n", " 'h_cross': array([[0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " ...,\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j],\n", " [0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]])}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 5 }, { "cell_type": "markdown", "id": "47598d1a-719c-4d3c-aa80-288c80cdf7d2", "metadata": {}, "source": [ "We can then put these in a `WaveformDataset`," ] }, { "cell_type": "code", "id": "dc63f907-67cf-4cad-907e-e9cf8ad4ba27", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:26:15.745182Z", "start_time": "2025-03-06T12:26:15.743586Z" } }, "source": [ "from dingo.gw.dataset import WaveformDataset\n", "\n", "dataset_dict = {'parameters': parameters, 'polarizations':polarizations}\n", "wfd = WaveformDataset(dictionary=dataset_dict)" ], "outputs": [], "execution_count": 6 }, { "cell_type": "markdown", "id": "4410acfe-43a5-4312-8c73-2a7c0f2ab936", "metadata": {}, "source": [ "Samples can then be easily indexed," ] }, { "cell_type": "code", "id": "5d7e6e47-473d-4691-b718-64e08b15985f", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:26:15.784797Z", "start_time": "2025-03-06T12:26:15.781570Z" } }, "source": [ "wfd[0]" ], "outputs": [ { "data": { "text/plain": [ "{'parameters': {'mass_ratio': 0.21818708420007127,\n", " 'chirp_mass': 73.84505046384619,\n", " 'luminosity_distance': 1000.0,\n", " 'theta_jn': 1.2552044083558263,\n", " 'phase': 1.9663616795623784,\n", " 'a_1': 0.19797999253228177,\n", " 'a_2': 0.24015632412352614,\n", " 'tilt_1': 1.9726056028558197,\n", " 'tilt_2': 1.3762284791622097,\n", " 'phi_12': 2.186446046131814,\n", " 'phi_jl': 4.752777219226601,\n", " 'geocent_time': 0.0},\n", " 'waveform': {'h_plus': array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j]),\n", " 'h_cross': array([0.+0.j, 0.+0.j, 0.+0.j, ..., 0.+0.j, 0.+0.j, 0.+0.j])}}" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "execution_count": 7 }, { "cell_type": "markdown", "id": "1bc66ea3-86d6-4e26-83cc-b3fadcbe1166", "metadata": {}, "source": [ "```{note}\n", "The sample is represented as a *nested dictionary*. This is a standard format for Dingo.\n", "```" ] }, { "attachments": {}, "cell_type": "markdown", "id": "11d4167a-e6a1-44c0-abd0-9184c56d40f5", "metadata": {}, "source": [ "## Automated dataset construction\n", "\n", "The simple dataset constructed above is useful for illustrative purposes, but it lacks the several important features:\n", "* Waveforms are not compressed. A dataset with many samples would therefore take up enormous storage space.\n", "* Not reproducible. The dataset contains no metadata describing its construction (e.g., waveform approximant, domain, prior, ...).\n", "\n", "The `generate_dataset` function automates all of these advanced features:\n", "```{eval-rst}\n", ".. autofunction:: dingo.gw.dataset.generate_dataset.generate_dataset\n", "```\n", "This function is in turn wrapped by the command-line functions `dingo_generate_dataset` and `dingo_generate_dataset_dag`. These take a `.yaml` file with the same contents as the settings dictionary.\n", "\n", "#### Configuration\n", "\n", "A typical settings dictionary / `.yaml` config file takes the following form, described in detail below:\n", "\n", "```yaml\n", "domain:\n", " type: FrequencyDomain\n", " f_min: 20.0\n", " f_max: 1024.0\n", " delta_f: 0.125\n", "\n", "waveform_generator:\n", " approximant: IMRPhenomXPHM\n", " f_ref: 20.0\n", " # f_start: 15.0 # Optional setting useful for EOB waveforms. Overrides f_min when generating waveforms.\n", " # new_interface: true # Optional setting for employing new waveform interface. This is needed for SEOBNRv5 approximants, and optional for standard LAL approximants.\n", " spin_conversion_phase: 0.0\n", "\n", "# Dataset only samples over intrinsic parameters. Extrinsic parameters are chosen at train time.\n", "intrinsic_prior:\n", " mass_1: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_1')\n", " mass_2: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_2')\n", " chirp_mass: bilby.gw.prior.UniformInComponentsChirpMass(minimum=25.0, maximum=100.0, name='chirp_mass')\n", " mass_ratio: bilby.gw.prior.UniformInComponentsMassRatio(minimum=0.125, maximum=1.0, name='mass_ratio')\n", " phase: default\n", " a_1: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99, name='a_1')\n", " a_2: bilby.core.prior.Uniform(minimum=0.0, maximum=0.99, name='a_2')\n", " tilt_1: default\n", " tilt_2: default\n", " phi_12: default\n", " phi_jl: default\n", " theta_jn: default\n", " # Reference values for fixed (extrinsic) parameters. These are needed to generate a waveform.\n", " luminosity_distance: 100.0 # Mpc\n", " geocent_time: 0.0 # s\n", "\n", "# Dataset size\n", "num_samples: 5000000\n", "\n", "# Save a compressed representation of the dataset\n", "compression:\n", " svd:\n", " # Truncate the SVD basis at this size. No truncation if zero.\n", " size: 200\n", " num_training_samples: 50000\n", " num_validation_samples: 10000\n", " whitening: aLIGO_ZERO_DET_high_P_asd.txt\n", "```\n", "\n", "domain\n", ": Specifies the data domain. Currenly only `FrequencyDomain` is implemented.\n", "\n", "waveform_generator\n", ": Choose the approximant and reference frequency. For EOB models that require time integration, it is usually necessary to specify a lower starting frequency. In this case, `f_ref` is ignored.\n", "\n", " spin_conversion_phase (optional)\n", " : Value for `phiRef` when converting PE spins to Cartesian spins via `bilby_to_lalsimulation_spins`. When set to `None` (default), this uses the `phase`\n", " parameter. When set to 0.0, `phase` only refers to the azimuthal observation angle, allowing for it to be treated as an extrinsic parameter.\n", " ```{important}\n", " It is necessary to set this to 0.0 if planning to train a `phase`-marginalized network, and then reconstruct the `phase` synthetically.\n", " ```\n", "\n", "intrinsic_prior\n", ": Specify the prior over intrinsic parameters. Intrinsic parameters here refer to those parameters that are needed to generate waveform polarizations. Extrinsic parameters here refer to those parameters that can be sampled and applied rapidly during training. As shown in the example, it is also possible to specify `default` priors, which is convenient for certain parameters. These are listed in `dingo.gw.prior.default_intrinsic_dict`.\n", "\n", " Intrinsic parameters obviously include masses and spins, but also inclination, reference phase, luminosity distance, and time of coalescense at geocenter. Although inclination and phase are often considered extrinsic parameters, they are needed to generate waveform polarizations and cannot be easily transformed.\n", "\n", " Luminosity distance and time of coalescense are considered as *both* intrinsic and extrinsic. Indeed they are needed to generate polarizations, but they can also be easily transformed during training to augment the dataset. We therefore fix them to fiducial values for generating polarizations.\n", " \n", "num_samples\n", ": The number of samples to include in the dataset. For a production model, we typically use $5 \\times 10^6$ samples.\n", "\n", "compression (optional)\n", ": How to compress the dataset.\n", "\n", " svd (optional)\n", " : Construct an SVD basis based on a specified number of additional samples. Save the main dataset in terms of its SVD basis coefficients. The number of elements in the basis is specified by the `size` setting. The performance of the basis is also evaluated in terms of the mismatch against a number of validation samples. All of the validation information, as well as the basis itself, is saved along with the waveform dataset.\n", " \n", " whitening (optional)\n", " : Whether to save whitened waveforms, and in particular, whether to construct the basis based on whitened waveforms. The basis will be more efficient if whitening is used to adapt it to the detector noise characteristics. To use whitening, simply specify the desired ASD do use, from the Bilby [list of ASDs](https://git.ligo.org/lscsoft/bilby/-/tree/master/bilby/gw/detector/noise_curves). Note that the whitening is used only for the internal storage of the dataset. When accessing samples from the dataset, they will be unwhitened.\n", " \n", " Dataset compression is implemented internally by setting the `WaveformGenerator.transform` operator, so that elements are compressed immediately after generation (avoiding the need to store many uncompressed waveforms in memory). Likewise, decompression is implemented by setting the `WaveformDataset.decompression_transform` operator to apply the inverse transformation. This will act on samples to decompress them when accessed through `WaveformDataset.__getitem__()`.\n", " \n", "```{important}\n", "The automated dataset constructors store the configuration settings in `WaveformDataset.settings`. This is so that the settings can be accessed by more downstream tasks, and for reference.\n", "```" ] }, { "cell_type": "markdown", "id": "6870040b-b04f-4f60-b9a5-39b988076b1b", "metadata": {}, "source": [ "### Command-line interface\n", "\n", "In most cases the command-line interface will be used to generate a dataset. Given a settings file, one can call\n", "```bash\n", "dingo_generate_dataset --settings_file settings.yaml\n", " --num_processes N\n", " --out_file waveform_dataset.hdf5\n", "```\n", "This will generate a dataset following the configuration in `settings.yaml` and save it as `waveform_dataset.hdf5`, using `N` processes.\n", "\n", "To inspect the dataset (or any other Dingo-generated file) use\n", "```bash\n", "dingo_ls waveform_dataset.hdf5\n", "```\n", "This will print the configuration settings, as well as a summary of the SVD compression performance (if available).\n", "\n", "For larger datasets, or those based on slower waveform models, Dingo includes a script that builds a condor DAG, `dingo_generate_dataset_dag`. This splits the generation of waveforms across several nodes, and then reconstitutes the final dataset." ] }, { "cell_type": "code", "id": "cad20834-f89b-43a7-856f-34c8b5c47a8a", "metadata": { "ExecuteTime": { "end_time": "2025-03-06T12:26:15.822924Z", "start_time": "2025-03-06T12:26:15.821486Z" } }, "source": [], "outputs": [], "execution_count": null } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 5 }