# Toy Example

The goal of the following tutorial is to take a user from start to finish analyzing GW150914 using dingo.

```{caution}
This is only a toy example which is useful for testing on a local machine. This
is NOT meant be used for production gravitational wave analyses.
```

There are 4 main steps:

1. Generate the waveform dataset
2. Generate the ASD dataset
3. Train the network
4. Do inference

In this tutorial as well as the [npe model](example_npe_model) and [gnpe model](example_gnpe_model) the following file structure will
be employed

```
toy_npe_model/

    #  config files
    waveform_dataset_settings.yaml
    asd_dataset_settings.yaml
    train_settings.yaml
    GW150914.ini

    training_data/
        waveform_dataset.hdf5
        asd_dataset/ # Contains the asd_dataset.hdf5 and also temp files for asd generation

    training/
        model_050.pt
        model_stage_0.pt
        model_latest.pt
        history.txt
        #  etc...

    outdir_GW150914/
        #  dingo_pipe output
```

The config files which are the only ones which need to be edited are contained in the top level directory. In the next
few sections these config files will be explained. To download sample config files, please visit 
https://github.com/dingo-gw/dingo/tree/main/examples. In this tutorial the `toy_npe_model` folder will be used.


Step 1 Generating a waveform dataset
------------------------------------

After downloading the files for the tutorial first run

```
cd toy_npe_model/
mkdir training_data
mkdir training
```

to set up the file structure. Then run

```
dingo_generate_dataset --settings waveform_dataset_settings.yaml --out_file training_data/waveform_dataset.hdf5
```

which will create a 
{py:class}`dingo.gw.waveform_generator.waveform_generator.WaveformGenerator`
object and store it at the location provided with `--out_file`. For convenience, 
here is the waveform dataset file

```yaml
domain:
  type: UniformFrequencyDomain
  f_min: 20.0
  f_max: 1024.0
  delta_f: 0.25  # Expressions like 1.0/8.0 would require eval and are not supported

waveform_generator:
  approximant: IMRPhenomD
  f_ref: 20.0
  # f_start: 15.0  # Optional setting useful for EOB waveforms. Overrides f_min when generating waveforms.

# Dataset only samples over intrinsic parameters. Extrinsic parameters are chosen at train time.
intrinsic_prior:
  mass_1: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_1')
  mass_2: bilby.core.prior.Constraint(minimum=10.0, maximum=80.0, name='mass_2')
  chirp_mass: bilby.gw.prior.UniformInComponentsChirpMass(minimum=15.0, maximum=100.0, name='chirp_mass')
  mass_ratio: bilby.gw.prior.UniformInComponentsMassRatio(minimum=0.125, maximum=1.0, name='mass_ratio')
  phase: default
  chi_1: bilby.gw.prior.AlignedSpin(name='chi_1', a_prior=Uniform(minimum=0, maximum=0.9))
  chi_2: bilby.gw.prior.AlignedSpin(name='chi_2', a_prior=Uniform(minimum=0, maximum=0.9))
  theta_jn: default
  # Reference values for fixed (extrinsic) parameters. These are needed to generate a waveform.
  luminosity_distance: 100.0  # Mpc
  geocent_time: 0.0  # s

# Dataset size
num_samples: 10000

compression: None
```

The file `waveform_dataset_settings.yaml` contains four
sections: `domain`, `waveform_generator`, `intrinsic_prior`, and `compression`. The
domain section defines the settings for storing the waveform. Note the `type`
attribute; this does not refer to the native domain of the waveform model, but
rather to the internal {py:class}`dingo.gw.domains.Domain` class. This allows the use
of time domain waveform models, which are transformed into Fourier domain before
being passed to the network. Currently, only
the {py:class}`dingo.gw.domains.FrequencyDomain` class is supported for training the
network. It is sometimes advisable to generate waveforms with a higher `f_max` and then
truncate them at a lower `f_max` for training due to issues with generating short waveforms
for some of the waveform models implemented in LALSuite's LALSimulation package
(https://lscsoft.docs.ligo.org/lalsuite/lalsimulation/).


The `waveform_generator` section specifies the `approximant` attribute.
At present any waveform model, aka `approximant`, that is callable through LALSimulation's
`SimInspiralFD` API can be used to generate waveforms for dingo via the
{py:class}`dingo.gw.waveform_generator.waveform_generator.WaveformGenerator` module (see
[generating_waveforms](generating_waveforms.md)).

The `intrinsic_prior` section is based on Bilby's prior module.
Default values can be found in `dingo.gw.prior`.
Two priors to note are the `chirp_mass` and `mass_ratio`, whose minimum values are set
to 15.0 and 0.125, respectively. Extending these priors towards lower chirp masses
or more extreme mass-ratios may lead to poor performance of the embedding network and normalizing 
flow during training and would require changes to the network setup.
Note that the `luminosity_distance` and `geocent_time` are defined as constants
to generate the waveform at a fixed reference point.

The compression section can be set to None for testing purposes. For a practical
example of how it is used, see the next tutorial.

Step 2 Generating the Amplitude Spectral Density (ASD) dataset
--------------------------------------------------------------

To generate an ASD dataset run

```
dingo_generate_asd_dataset --settings_file asd_dataset_settings.yaml --data_dir training_data/asd_dataset
```

This command will generate an {py:class}`dingo.gw.noise.asd_dataset.ASDDataset` object in the form of an .hdf5 file, which will be used later for training. The reason for specifying a folder instead of a file, as in the waveform dataset example, is because some temporary data is downloaded to create Welch estimates of the ASD. This data can be removed later, but it is sometimes useful for understanding how the ASDs were estimated. For convenience here is a copy of the `asd_dataset_settings.yaml` file.

```yaml
dataset_settings:
f_s: 4096
time_psd: 1024
T: 4
window:
    roll_off: 0.4
    type: tukey
time_gap: 0          # specifies the time skipped between to consecutive PSD estimates. If set < 0, the time segments overlap
num_psds_max: 1  # if set > 0, only a subset of all available PSDs will be used
detectors:
    - H1
    - L1
observing_run: O1
```

The `asd_dataset_settings.yaml` file includes several attributes. `f_s` is the sampling frequency in Hz, `time_psd` is the length of time used for an ASD estimate, and `T` is the duration of each ASD segment. Thus, the value of `time_psd`/`T` gives the number of segments analyzed to estimate one ASD. To avoid spectral leakage, a window is applied to each segment. We use the standard window used in LVK analyses, a Tukey window with a roll off of $\alpha=0.4$. The next attribute, `num_psds_max=1`, defines the number of ASDs stored in the ASD dataset. For now, we will use only one. See the next [tutorial](example_npe_model.md) for a more advanced setup.

Step 3 Training the network
---------------------------

To train the network, first the paths to the correct datasets must be specified before executing:

```
dingo_train --settings_file train_settings.yaml --train_dir training
```

While this file contains numerous settings that are discussed in [training](training.md), we will cover the most significant ones here. Again here is the file.


```yaml
data:
  waveform_dataset_path: training_data/waveform_dataset.hdf5  # Contains intrinsic waveforms
  train_fraction: 0.95
  detectors:
    - H1
    - L1
  extrinsic_prior:  # Sampled at train time
    dec: default
    ra: default
    geocent_time: bilby.core.prior.Uniform(minimum=-0.10, maximum=0.10, name='geocent_time')
    psi: default
    luminosity_distance: bilby.core.prior.Uniform(minimum=100.0, maximum=1000.0, name='luminosity_distance')
  ref_time: 1126259462.391
  inference_parameters: 
  - chirp_mass
  - mass_ratio
  - chi_1
  - chi_2
  - theta_jn
  - dec
  - ra
  - geocent_time
  - luminosity_distance
  - psi
  - phase

# Model architecture
model:
  posterior_model_type: normalizing_flow
  # kwargs for neural spline flow
  posterior_kwargs:
    num_flow_steps: 5
    base_transform_kwargs:
      hidden_dim: 64 
      num_transform_blocks: 5
      activation: elu
      dropout_probability: 0.0
      batch_norm: True
      num_bins: 8
      base_transform_type: rq-coupling
  # kwargs for embedding net
  embedding_kwargs:
    output_dim: 128
    hidden_dims: [1024, 512, 256, 128]
    activation: elu
    dropout: 0.0
    batch_norm: True
    svd:
      num_training_samples: 1000
      num_validation_samples: 100
      size: 50

# The first stage (and only) stage of training. 
training:
  stage_0:
    epochs: 20
    asd_dataset_path: training_data/asd_dataset/asds_O1.hdf5  # this should just contain a single fiducial ASD per detector for pretraining
    freeze_rb_layer: True
    optimizer:
      type: adam
      lr: 0.0001
    scheduler:
      type: cosine
      T_max: 20
    batch_size: 64

# Local settings for training that have no impact on the final trained network.
local:
  device: cpu  # Change this to 'cuda' for training on a GPU.
  num_workers: 6  # num_workers >0 does not work on Mac, see https://stackoverflow.com/questions/64772335/pytorch-w-parallelnative-cpp206
  runtime_limits:
    max_time_per_run: 36000
    max_epochs_per_run: 30
  checkpoint_epochs: 15
  leave_waveforms_on_disk: True
```

For training, several `extrinsic_priors` are set, which project the waveforms generated in step 1 onto the detector network according to the specified priors. This is considerably cheaper than generating waveforms sampled from the full intrinsic plus extrinsic prior in step 1.

Another crucial setting is `inference_parameters`. By default, all the parameters described in `dingo.gw.prior` are inferred. If a parameter needs to be marginalized over, this parameter can be omitted from `inference_parameters`.

Essential settings for the model architecture of the normalizing flow (i.e., the neural spline flow and the embedding network) are as follows: `posterior_kwargs.num_flow_steps` describes the number of flow transforms from the base distribution to the final distribution, while `embedding_net_kwargs.hidden_dim` defines the dimensions of the neural network's hidden layer, which selects the most important data features. Finally, `embedding_net_kwargs.svd` describes the settings of the SVD used as a pre-processing step before passing data vectors to the embedding network. For a production network, these values should be much higher than those used in this tutorial.

Next, we turn to the training section. Here we only employ a single stage of training with settings provided under the `stage_0` attribute. This stage uses the training dataset generated in step 1 for 30 epochs. We also specify the `asd_dataset_path` here, which was created in step 2.

Finally, the local settings section specifies technical details of the training setup. It contains information about, e.g., parallelization during training and the device used. An important setting here is `num_workers`, which determines how many PyTorch dataloader processes are spawned during training. If training is too slow, a potential cause is a lack of workers to load data into the network. This can be identified if the dataloader times in the `dingo_train` output exceed 100ms. The solution is generally to increase the number of workers.

Step 4 Doing Inference
----------------------

The final step is to do inference, for example on GW150914. To do this we will use
[dingo_pipe](dingo_pipe.md). For a local run execute:

```
dingo_pipe GW150914.ini
```

This calls `dingo_pipe` on an INI file that specifies the event to run on,
```ini
################################################################################
##  Job submission arguments
################################################################################

local = True
accounting = dingo
request-cpus-importance-sampling = 2

################################################################################
##  Sampler arguments
################################################################################

model = training/model_latest.pt
device = 'cpu'
num-samples = 5000
batch-size = 5000
recover-log-prob = false
importance-sample = false

################################################################################
## Data generation arguments
################################################################################

trigger-time = GW150914
label = GW150914
outdir = outdir_GW150914
channel-dict = {H1:GWOSC, L1:GWOSC}
psd-length = 128
# sampling-frequency = 2048.0
# importance-sampling-updates = {'duration': 4.0}

################################################################################
## Plotting arguments
################################################################################

plot-corner = true
plot-weights = true
plot-log-probs = true
```

This will generate files which are described in [dingo_pipe](dingo_pipe.md). To see the results, take a look in `outdir_GW150914`. We set the flag `importance-sample = False` in the INI file, which disables importance sampling for this simple example. Generally one would omit this (it defaults to True).

We can load and manipulate the data with the following code. For example, here we create a cornerplot

```
from dingo.gw.result import Result
result = Result(file_name="outdir_GW150914/result/GW150914_data0_1126259462-4_sampling.hdf5")
result.plot_corner()
```

Notice the results don't look very promising, but this is expected as the settings used in this
example are not enough to warrant convergence. Dingo should also automatically generate a cornerplot which will
be displayed under outdir_GW150914.