Neural network architecture

Dingo employs different network architectures for posterior estimation, e.g. Neural posterior estimation and Flow Matching Posterior estimation, see here for an introduction. A central object is the conditional neural density estimator, a deep neural network trained to represent the Bayesian posterior. This section describes the neural network architecture developed in [3], and subsequently used in [4], [5] and [6].

Neural spline flow with SVD compression

The NPE architecture consists of two compenents, the embedding network which compresses the high-dimensionl data to a lower dimensional feature vector, and the conditional normalizing flow which estimates the Bayesian posterior based on this feature vector. Both components are trained jointly and end-to-end with the objective descriped here. The network can be build with

from dingo.core.nn.nsf import create_nsf_with_rb_projection_embedding_net

Embedding network

The embedding network compresses the high-dimensional conditioning information (consisting of frequency domain strain and PSD data). The first layer of this network is initialized with an SVD matrix from a reduced basis built with non-noisy waveforms. This projection filters out the noise that is orthogonal to the signal manifold, and significantly simplifies the task for the neural network.

The initial compression layer is followed by a sequence of residual blocks consisting of dense layers for further compression. Example kwargs:

embedding_kwargs = {
    "input_dims": (2, 3, 8033),
    "output_dim": 128,
    "hidden_dims": [
        1024, 1024, 1024, 1024, 1024, 1024, \
        512, 512, 512, 512, 512, 512, \
        256, 256, 256, 256, 256, 256, \
        128, 128, 128, 128, 128, 128
    ],
    "activation": "elu",
    "dropout": 0.0,
    "batch_norm": True,
    "svd": {
        "num_training_samples": 50000,
        "num_validation_samples": 5000,
        "size": 200,
    }
}

Here, input_dims=(2, 3, 8033) refers to the input dimension, for frequency domain data with 8033 frequency bins and 3 channels (real part, complex part, ASD) in 2 detectors. The embedding network compresses this to output_dim=128 components. The SVD initialization is controlled with the svd argument, and the residual blocks are specified with hidden_dims.

Note

Not all of these arguments have to be set in the configuration file when training dingo. For example, the input_dims argument is automatically filled in based on the specified domain information and number of detectors. Similarly, the context_dim of the flow (see below) is filled in based on the output_dim of the embedding network and the number of GNPE proxies. See the Dingo examples for the corresponding configuration files and training commands.

Discrete Normalizing Flow

We use the neural spline flow as a density estimator. This takes the output of the embedding network as context information and estimates the Bayesian posterior distribution. Example kwargs:

posterior_kwargs = {
    "input_dim": 15,
    "context_dim": 129,
    "num_flow_steps": 30,
    "base_transform_kwargs": {
        "hidden_dim": 512,
        "num_transform_blocks": 5,
        "activation": "elu",
        "dropout_probability": 0.0,
        "batch_norm": True,
        "num_bins": 8,
        "base_transform_type": "rq-coupling",
    },
}

This creates a neural spline flow with input_dim=15 parameters, conditioned on a 129 dimensional context vector, corresponding to the 128 dimensional output of the embedding network and one GNPE proxy variable. The neural spline flow consists of num_flow_steps=30 layers, for which the transformation is specified with base_transform_kwargs.

nde = create_nsf_with_rb_projection_embedding_net(posterior_kwargs, embedding_kwargs)

Continuous Normalizing Flow

Flow Matching Posterior Estimation (FMPE) utilizes continuous normalizing flows to represent posterior distributions. Instead of discrete mappings, FMPE models a velocity field that governs transformations over time, mapping a base distribution (e.g., standard normal) to the posterior.

We use a dense residual network to parameterize the velocity field, which takes as input the parameters \( \theta \), time \( t \), and context \( d \). The model is trained by minimizing the flow matching loss, which ensures that the learned velocity field matches the optimal transport-inspired target velocity field.

Example kwargs for an FMPE model:

posterior_kwargs = {
    "activation": "gelu",
    "batch_norm": True,
    "context_with_glu": False,
    "dropout": 0.0,
    "hidden_dims": [
        1024, 1024, 1024, 1024, 1024, 1024,
        512, 512, 512, 512, 512, 512, 512, 512
    ],
    "sigma_min": 0.001,
    "theta_embedding_kwargs": {
        "embedding_net": {
            "activation": "gelu",
            "hidden_dims": [16, 32, 64, 128, 256],
            "output_dim": 256,
            "type": "DenseResidualNet",
        },
        "encoding": {
            "encode_all": False,
            "frequencies": 0,
        },
    },
    "theta_with_glu": True,
    "time_prior_exponent": 1,
    "type": "DenseResidualNet",
}

Explanation of Key Settings

theta_embedding_kwargs:
- Specifies how the parameter (\theta) is embedded for input into the velocity network.
- Contains settings for:
  - embedding_net: Defines the embedding network, such as a DenseResidualNet. This network transforms the input (\theta) into a higher-dimensional representation for downstream processing. The hidden_dims control the architecture depth and complexity, while the output_dim specifies the dimensionality of the embedding.
  - encoding: Configures the positional encoding for (\theta). The encode_all flag determines whether all components of (\theta) or only a subset (e.g., the first component) are encoded. frequencies adjusts the sinusoidal encoding settings.
hidden_dims:
- Defines the layer sizes of the velocity network. The dimensions decrease progressively to capture hierarchical features and allow for efficient computation. Larger dimensions can better approximate complex velocity fields but increase computational costs.
sigma_min:
- Controls the minimum noise level in the interpolation path between the base distribution and the target posterior distribution. Too low values of sigma_min can lead to sharp trajectories that are harder for the velocity network to model accurately, potentially resulting in unstable training.
theta_with_glu:
- If True, applies Gated Linear Units (GLUs) for processing (\theta). GLUs introduce an additional nonlinear gating mechanism, which can improve the network’s ability to model complex relationships between inputs.
context_with_glu:
- Similar to theta_with_glu, but applies GLUs to the context (d). Useful for enhancing feature extraction when the context data has intricate dependencies.