GNPE

GNPE (Gibbs- or Group-Equivariant Neural Posterior Estimation) is an algorithm that can generate significantly improved results by incorporating known physical symmetries into NPE.[1] The aim is to simplify the data seen by the network by using the symmetries to transform certain parameters to “standardized” values. This simplifies the learning task of the network. At inference time, the standardizing transform is initially unknown, so we use Gibbs sampling to simultaneously learn the transform (along with the rest of the parameters) and apply it to simplify the data.

For gravitational waves, we use GNPE to standardize the times of arrival of the signal in the individual interferometers. (This corresponds to translations of the time of arrival at geocenter, and approximate sky rotations.) In frequency domain, time translations correspond to multiplication of the data by \(e^{-2\pi i f \Delta t}\), and a standard NPE network would have to learn to interpret such transformations consistent with the prior from the data. We found this to be a challenging learning task, which limited inference performance on the other parameters. Instead, GNPE leverages our knowledge of the time translations to build a network that is only required to interpret a much narrower window of arrival times.

We now provide a brief description of the GNPE method. Readers more interested in getting started with GNPE may skip to Usage below.

Description of method

GNPE allows us to incorporate knowledge of joint symmetries of data and parameters. That is, if a parameter (e.g., coalescence time) is transformed by a certain amount (\(\Delta t\)), then there is a corresponding transformation of the data (multiplication by \(e^{-2\pi i f \Delta t}\)) such that the transformed data is equally likely to occur under the transformed parameter,

\[ p(t_c | d) = p(t_c + \Delta t | d\cdot e^{-2\pi i f \Delta t}). \]

It is based on two ideas:

Gibbs + NPE

Gibbs sampling is an algorithm for obtaining samples from a joint distribution \(p(x, y)\) if we are able to sample directly from each of the conditionals, \(p(x|y)\) and \(p(y|x)\). Starting from some point \(y_0\), we construct a Markov chain \(\{(x_i, y_i)\}\) by sampling

\(x_i \sim p(x_i | y_{i-1})\),
\(y_i \sim p(y_i | x_i)\),

and repeating until the chain is converged. The stationary distribution of the Markov chain is then \(p(x, y)\).

_images/gibbs_figure.jpg — Illustration of Gibbs sampling for a distribution \(p(x, y)\).

Gibbs sampling can be combined with NPE by first introducing blurred “proxy” versions of a subset of parameters, which we denote \(\hat\theta\) i.e., \(\hat\theta \sim p(\hat\theta | \theta)\) where \(p(\hat\theta | \theta)\) is defined by a blurring kernel. For example, for GWs we take \(\hat t_I = t_I + \epsilon_I\), where \(\epsilon_I \sim \text{Unif}(-1~\text{ms}, 1~\text{ms})\). We then train a network to model the posterior, but now conditioned also on \(\hat \theta\), i.e., \(p(\theta | d, \hat\theta)\). We can then apply Gibbs sampling to obtain samples from the joint distribution \(p(\theta, \hat \theta | d)\), since we are able to sample individually from the conditional distributions:

We can sample from \(p(\hat\theta | \theta)\) since we defined the blurring kernel.
We can sample from \(p(\theta | d, \hat\theta)\) since we are modeling it using NPE.

Finally, we can drop \(\hat \theta\) from the samples to obtain the desired posterior samples.

The trick now is that since \(p(\theta | d, \hat\theta)\) is conditional on \(\hat \theta\), we can apply any \(\hat\theta\)-dependent transformation to \(d\). Returning to the time translations, \(\hat t_I\) is a good approximation to \(t_I\), so we apply the inverse time shift \(d_I \to d_I\cdot e^{2 \pi i f \hat t_I}\), which brings \(d_I\) into a close approximation to having coalescence time \(0\) in each detector. This means that the network never sees any data with merger time further than \(1~\text{ms}\) from \(0\), greatly simplifying the learning task.

In practice, we generate many Monte Carlo chains in parallel—one for each desired sample and with different starting points—and keep only the final sample from each chain—rather than generating one long chain. Each individual chain in this ensemble is unlikely to converge, but if the individual chains are initialized from a distribution sufficiently close to \(p(\hat \theta | d)\) then the collection of final samples from each chain should be a good approximation to samples from \(p(\theta, \hat\theta|d)\).

Group-equivariant NPE

So far we have described how Gibbs sampling together with NPE can simplify data by allowing any \(\hat\theta\)-dependent transformation of \(d\), simplifying the data distribution. If we know the data and parameters to be equivariant under a particular transformation, however, we can go a step further and enforce this exactly. To do so, we simply drop the dependence of the neural density estimator on \(\hat\theta\).

For gravitational waves, the overall time translation symmetry (in each detector) of the time of coalescence at geocenter is an exact symmetry, so we fully enforce this. The sky rotation, however, corresponds to an approximate symmetry: it shifts the time of coalescence in each detector, but a subleading effect is to change angle of incidence on a detector and hence the combination of polarizations observed. For this latter symmetry, we simply do not drop the proxy dependence.

Tip

GNPE is a generic method to incorporate symmetries into NPE:

Any symmetry (exact or approximate) connecting data and parameters
Any architecture, as it just requires (at most) conditioning on the proxy variables

As far as we are aware, GNPE is the only way to incorporate symmetries connecting data and parameters into architectures such as normalizing flows.

Usage

Training

To use GNPE for GW inference one must train two Dingo models:

An initialization network modeling \(p(t_I | d)\). This gives the initial guess of the proxy variables for the staring point of the Gibbs sampler. Since this is only modeling two or three parameters and it does not need to give perfect results, this network can also be much smaller than typical Dingo networks.

For an HL detector network, to infer just the detector coalescence times, set this in the train configuration.
```
data:
  inference_parameters: [H1_time, L1_time]
```
A main “GNPE” network, conditional on the proxy variables, \(p(\theta | d, \hat t_I)\). Implicitly in this expression, the data are transformed by the proxies, and the exact time-translation symmetry is enforced.

To condition this network on the correct proxies, we configure it to use GNPE in the settings file.
```
data:
  gnpe_time_shifts:
    kernel: bilby.core.prior.Uniform(minimum=-0.001, maximum=0.001)
    exact_equiv: True
```
This sets the blurring kernel to be \(\text{Unif}(-1~\text{ms}, 1~\text{ms})\) for all \(\hat t_I\), and it specifies to enforce the overall time of coalescence symmetry exactly. Dingo will determine automatically from the detectors setting which proxy variables to condition on.

Complete example config files for both networks are provided in the /examples folder.

Inference

The inference script must be pointed to both trained networks in order to sample using GNPE.

dingo_analyze_event
  --model model
  --model_init model_init
  --gps_time_event gps_time_event
  --num_samples num_samples
  --num_gnpe_iterations num_gnpe_iterations
  --batch_size batch_size

The number of Gibbs iterations is also specified here (typically 30 is appropriate). This script will save the final samples from each Gibbs chain.

The `GNPESampler` class

The inference script above uses the GWSamplerGNPE class, which is based on GNPESampler,

class dingo.core.samplers.GNPESampler(model: BasePosteriorModel, init_sampler: Sampler, num_iterations: int = 1)

Bases: Sampler

Base class for GNPE sampler. It wraps a PosteriorModel and a standard Sampler for initialization. The former is used to generate initial samples for Gibbs sampling.

A GNPE network is conditioned on additional “proxy” context theta^, i.e.,

p(theta | theta^, d)

The theta^ depend on theta via a fixed kernel p(theta^ | theta). Combining these known distributions, this class uses Gibbs sampling to draw samples from the joint distribution,

p(theta, theta^ | d)

The advantage of this approach is that we are allowed to perform any transformation of d that depends on theta^. In particular, we can use this freedom to simplify the data, e.g., by aligning data to have merger times = 0 in each detector. The merger times are unknown quantities that must be inferred jointly with all other parameters, and GNPE provides a means to do this iteratively. See https://arxiv.org/abs/2111.13139 for additional details.

Gibbs sampling breaks access to the probability density, so this must be recovered through other means. One way is to train an unconditional flow to represent p(theta^ | d) for fixed d based on the samples produced through the GNPE Gibbs sampling. Starting from these, a single Gibbs iteration gives theta from the GNPE network, along with the probability density in the joint space. This is implemented in GNPESampler provided the init_sampler provides proxies directly and num_iterations = 1.

Attributes (beyond those of Sampler)

init_samplerSampler: Used for providing initial samples for Gibbs sampling.
num_iterationsint: Number of Gibbs iterations to perform.

iteration_tracker : IterationTracker not set up remove_init_outliers : float not set up

param model:
type model:: BasePosteriorModel
param init_sampler:: Used for generating initial samples
type init_sampler:: Sampler
param num_iterations:: Number of GNPE iterations to be performed by sampler.
type num_iterations:: int

property context: Data on which to condition the sampler. For injections, there should be a ‘parameters’ key with truth values.

property event_metadata: Metadata for data analyzed. Can in principle influence any post-sampling parameter transformations (e.g., sky position correction), as well as the likelihood detector positions.

log_prob(samples: DataFrame | dict) → ndarray

Calculate the model log probability at specific sample points.

Parameters:: samples (pd.DataFrame | dict) – Sample points at which to calculate the log probability.
Return type:: np.array of log probabilities.

property num_iterations: The number of GNPE iterations to perform when sampling.

run_sampler(num_samples: int, batch_size: int | None = None)

Generates samples and stores them in self.samples. Conditions the model on self.context if appropriate (i.e., if the model is not unconditional).

If possible, it also calculates the log_prob and saves it as a column in self.samples. When using GNPE it is not possible to obtain the log_prob due to the many Gibbs iterations. However, in the case of just one iteration, and when starting from a sampler for the proxy, the GNPESampler does calculate the log_prob.

Allows for batched sampling, e.g., if limited by GPU memory. Actual sampling for each batch is performed by _run_sampler(), which will differ for Sampler and GNPESampler.

Parameters:

num_samples (int) – Number of samples requested.
batch_size (int, optional) – Batch size for sampler.

to_result() → Result

Export samples, metadata, and context information to a Result instance, which can be used for saving or, e.g., importance sampling, training an unconditional flow, etc.

Return type:: Result

In addition to storing a PosteriorModel, a GNPESampler also stores a second Sampler instance, which is based on the initialization network. When run_sampler() is called, it first generates samples from the initialization network, perturbs them with the kernel to obtain proxy samples, and then performs num_iterations Gibbs steps to obtain the final samples.