dingo.core package

Subpackages

Submodules

dingo.core.dataset module

class dingo.core.dataset.DingoDataset(file_name: str | None = None, dictionary: dict | None = None, data_keys: List | None = None, leave_on_disk_keys: list | None = None)

Bases: object

This is a generic dataset class with save / load methods.

A common use case is to inherit multiply from DingoDataset and torch.utils.data.Dataset, in which case the subclass picks up these I/O methods, and DingoDataset is acting as a Mixin class.

Alternatively, if the torch Dataset is not needed, then DingoDataset can be subclassed directly.

For constructing, provide either file_name, or dictionary containing data and settings entries, or neither.

Parameters:
  • file_name (str) – HDF5 file containing a dataset

  • dictionary (dict) – Contains settings and data entries. The data keys should be the same as save_keys

  • data_keys (list) – Variables that should be saved / loaded. This allows for class to store additional variables beyond those that are saved. Typically, this list would be provided by any subclass.

  • leave_on_disk_keys (Optional[list]) – Keys for which the values are not loaded into RAM when initializing the dataset. This reduces the memory footprint during training. Instead, the values are loaded from the HDF5 file during training.

dataset_type = 'dingo_dataset'
from_dictionary(dictionary: dict)
from_file(file_name: str)
to_dictionary()
to_file(file_name: str, mode: str = 'w')
dingo.core.dataset.recursive_hdf5_load(group, keys: List[str] | None = None, idx: int | List[int] | None = None)

This is a generic helper function to recursively load data from an HDF5 file.

Parameters:
  • group (h5py.Group) – Group from which to recursively load data.

  • keys (list[str] or None) – List of keys to load. If None, load all keys.

  • idx (int or list[int] or None) – If idx is provided, only the datapoints corresponding to the given indices are loaded.

dingo.core.dataset.recursive_hdf5_save(group, d)

dingo.core.likelihood module

class dingo.core.likelihood.Likelihood

Bases: object

log_likelihood(theta)
log_likelihood_multi(theta: DataFrame, num_processes: int = 1) ndarray

Calculate the log likelihood at multiple points in parameter space. Works with multiprocessing.

This wraps the log_likelihood() method.

Parameters:
  • theta (pd.DataFrame) – Parameters values at which to evaluate likelihood.

  • num_processes (int) – Number of processes to use.

Return type:

np.array of log likelihoods

dingo.core.multiprocessing module

dingo.core.multiprocessing.apply_func_with_multiprocessing(func: callable, theta: DataFrame, num_processes: int = 1) ndarray

Call func(theta.iloc[idx].to_dict()) with multiprocessing.

Parameters:
  • func (callable)

  • theta (pd.DataFrame) – Parameters with multiple rows, evaluate func for each row.

  • num_processes (int) – Number of parallel processes to use.

Returns:

result – Output array, where result[idx] = func(theta.iloc[idx].to_dict())

Return type:

np.ndarray

dingo.core.result module

class dingo.core.result.Result(file_name=None, dictionary=None)

Bases: DingoDataset

A dataset class to hold a collection of samples, implementing I/O, importance sampling, and unconditional flow training.

Attributes:
samplespd.Dataframe

Contains parameter samples, as well as (possibly) log_prob, log_likelihood, weights, log_prior, delta_log_prob_target.

domainDomain

Should be implemented in a subclass.

priorPriorDict

Should be implemented in a subclass.

likelihoodLikelihood

Should be implemented in a subclass.

contextdict

Context data from which the samples were produced (e.g., strain data, ASDs).

metadata : dict event_metadata : dict log_evidence : float log_evidence_std : float (property) effective_sample_size, n_eff : float (property) sample_efficiency : float (property)

For constructing, provide either file_name, or dictionary containing data and settings entries, or neither.

Parameters:
  • file_name (str) – HDF5 file containing a dataset

  • dictionary (dict) – Contains settings and data entries. The data keys should be the same as save_keys

  • data_keys (list) – Variables that should be saved / loaded. This allows for class to store additional variables beyond those that are saved. Typically, this list would be provided by any subclass.

  • leave_on_disk_keys (Optional[list]) – Keys for which the values are not loaded into RAM when initializing the dataset. This reduces the memory footprint during training. Instead, the values are loaded from the HDF5 file during training.

property base_metadata
property constraint_parameter_keys
dataset_type = 'core_result'
property effective_sample_size
property fixed_parameter_keys
get_all_injection_credible_levels(keys: list[str] | None = None, weighted: bool = False)

Get credible levels for all parameters.

Adapted from Bilby.

Parameters:
  • keys (list, optional) – A list of keys for which return the credible levels, if None, defaults to search_parameter_keys

  • weighted (bool, optional) – Whether to use sample weights in calculating credible level.

Returns:

credible_levels – The credible levels at which the injected parameters are found.

Return type:

dict

get_injection_credible_level(parameter: str, weighted: bool = False)

Get the credible level of the injected parameter.

Calculated as CDF(injection value).

Adapted from Bilby.

Parameters:
  • parameter (str) – Parameter to get credible level for

  • weighted (bool, optional) – Whether to use sample weights in calculating credible level.

Returns:

float

Return type:

credible level

importance_sample(num_processes: int = 1, **likelihood_kwargs)

Calculate importance weights for samples.

Importance sampling starts with samples have been generated from a proposal distribution q(theta), in this case a neural network model. Certain networks (i.e., non-GNPE) also provide the log probability of each sample, which is required for importance sampling.

Given the proposal, we re-weight samples according to the (un-normalized) target distribution, which we take to be the likelihood L(theta) times the prior pi(theta). This gives sample weights

w(theta) ~ pi(theta) L(theta) / q(theta),

where the overall normalization does not matter (and we take to have mean 1). Since q(theta) enters this expression, importance sampling is only possible when we know the log probability of each sample.

As byproducts, this method also estimates the evidence and effective sample size of the importance sampled points.

This method modifies the samples pd.DataFrame in-place, adding new columns for log_likelihood, log_prior, and weights. It also stores the log_evidence as an attribute.

Parameters:
  • num_processes (int) – Number of parallel processes to use when calculating likelihoods. (This is the most expensive task.)

  • likelihood_kwargs (dict) – kwargs that are forwarded to the likelihood constructor. E.g., options for marginalization.

property injection_parameters
property log_bayes_factor
property log_evidence_std
classmethod merge(parts)

Merge several Result instances into one. Check that they are compatible, in the sense of having the same metadata. Finally, calculate a new log evidence for the combined result.

This is useful when recombining separate importance sampling jobs.

Parameters:

parts (list[Result]) – List of sub-Results to be combined.

Return type:

Combined Result.

property metadata
property n_eff
property num_samples
parameter_subset(parameters)

Return a new object of the same type, with only a subset of parameters. Drops all other columns in samples DataFrame as well (e.g., log_prob, weights).

Parameters:

parameters (list) – List of parameters to keep.

Return type:

Result

plot_corner(parameters: list | None = None, filename: str = 'corner.pdf', **kwargs)

Generate a corner plot of the samples.

Parameters:
  • parameters (list[str]) – List of parameters to include. If None, include all parameters. (Default: None)

  • filename (str) – Where to save samples.

  • legend_font_size (int) – Font size of the legend.

plot_log_probs(filename='log_probs.png')

Make a scatter plot of the target versus proposal log probabilities. For the target, subtract off the log evidence.

plot_weights(filename='weights.png')

Make a scatter plot of samples weights vs log proposal.

print_summary()

Display the number of samples, and (if importance sampling is complete) the log evidence and number of effective samples.

rejection_sample(max_samples_per_draw: int = 1, clip_weights: bool = False, random_state=None)

Generate unweighted posterior samples from weighted ones via rejection sampling.

Each original sample contributes at most max_samples_per_draw copies to the output, so the result avoids the excessive duplication that sampling_importance_resampling() can produce for high-weight samples.

Algorithm (unbiased, maximum efficiency)

The weights are first scaled so that the largest weight equals max_samples_per_draw. Each sample i then contributes

  • floor(w_scaled[i]) copies deterministically (integer part), and

  • one additional copy with probability w_scaled[i] - floor(w_scaled[i]) (fractional part, a single Bernoulli draw).

The expected number of copies of sample i is therefore exactly w_scaled[i] w[i], which guarantees an unbiased representation of the posterior. Using the integer part deterministically (rather than rounding stochastically) maximises the expected total number of output samples for a given max_samples_per_draw.

Optional weight clipping

When clip_weights=True, the ceil(sqrt(N)) largest weights are replaced by their mean and the weights are re-normalized to mean 1 before rejection sampling. This number of clips is the theoretically optimal choice that yields asymptotically unbiased results [1]. Using the mean (rather than the minimum) of the clipped group preserves their total weight, which minimises the bias introduced by clipping. The net effect is reduced weight variance and a larger expected number of output samples.

If the samples DataFrame has no weights column the samples are already unweighted and are returned unchanged.

Parameters:
  • max_samples_per_draw (int) – Maximum number of copies any single input sample may contribute to the output. Default is 1 (standard rejection sampling, no duplicates).

  • clip_weights (bool) – Whether to clip the ceil(sqrt(N)) largest weights to their mean before rejection sampling. Default is False.

  • random_state (int or None) – Seed for the random number generator, for reproducibility.

Returns:

Unweighted samples (the weights column is dropped).

Return type:

pd.DataFrame

References

reset_event(event_dataset)

Set the Result context and event_metadata based on an EventDataset.

If these attributes already exist, perform a comparison to check for changes. Update relevant objects appropriately. Note that setting context and event_metadata attributes directly would not perform these additional checks and updates.

Parameters:

event_dataset (EventDataset) – New event to be used for importance sampling.

property sample_efficiency
sampling_importance_resampling(num_samples=None, random_state=None)

Generate unweighted posterior samples from weighted ones. New samples are sampled with probability proportional to the sample weight. Resampling is done with replacement, until the desired number of unweighted samples is obtained.

Parameters:
  • num_samples (int) – Number of samples to resample.

  • random_state (int or None) – Sampling seed.

Returns:

Unweighted samples

Return type:

pd.Dataframe

property search_parameter_keys
split(num_parts)

Split the Result into a set of smaller results. The samples are evenly divided among the sub-results. Additional information (metadata, context, etc.) are copied into each.

This is useful for splitting expensive tasks such as importance sampling across multiple jobs.

Parameters:

num_parts (int) – The number of parts to split the Result across.

Return type:

list of sub-Results.

train_unconditional_flow(parameters, nde_settings: dict, train_dir: str | None = None, threshold_std: float | None = inf)

Train an unconditional flow to represent the distribution of self.samples.

Parameters:
  • parameters (list) – List of parameters over which to train the flow. Can be a subset of the existing parameters.

  • nde_settings (dict) – Configuration settings for the neural density estimator.

  • train_dir (Optional[str]) – Where to save the output of network training, e.g., logs, checkpoints. If not provide, a temporary directory is used.

  • threshold_std (Optional[float]) – Drop samples more than threshold_std standard deviations away from the mean (in any parameter) before training the flow. This is meant to remove outlier samples.

Return type:

PosteriorModel

dingo.core.result.check_equal_dict_of_arrays(a, b)
dingo.core.result.freeze(d)
dingo.core.result.make_pp_plot(results: list[Result], filename=None, save=True, confidence_interval=[0.68, 0.95, 0.997], lines=None, legend_fontsize='x-small', keys=None, title=True, confidence_interval_alpha=0.1, weighted: bool = False, **kwargs)

Make a P-P plot for a set of runs with injected signals.

Adapted from Bilby.

Parameters:
  • results (list[Result]) – A list of Result objects, each of these should have injected_parameters

  • filename (str, optional) – The name of the file to save, the default is “outdir/pp.png”

  • save (bool, optional) – Whether to save the file, default=True

  • confidence_interval ((float, list), optional) – The confidence interval to be plotted, defaulting to 1-2-3 sigma

  • lines (list) – If given, a list of matplotlib line formats to use, must be greater than the number of parameters.

  • legend_fontsize (float) – The font size for the legend

  • keys (list) – A list of keys to use, if None defaults to search_parameter_keys

  • title (bool) – Whether to add the number of results and total p-value as a plot title

  • confidence_interval_alpha (float, list, optional) – The transparency for the background condifence interval

  • weighted (bool, optional) – Whether to use weighted vs unweighted samples. It is useful to make PP plots using unweighted samples to test networks without importance sampling.

  • kwargs – Additional kwargs to pass to matplotlib.pyplot.plot

Returns:

matplotlib figure and a NamedTuple with attributes combined_pvalue, pvalues, and names.

Return type:

fig, pvals

dingo.core.samplers module

class dingo.core.samplers.GNPESampler(model: BasePosteriorModel, init_sampler: Sampler, num_iterations: int = 1)

Bases: Sampler

Base class for GNPE sampler. It wraps a PosteriorModel and a standard Sampler for initialization. The former is used to generate initial samples for Gibbs sampling.

A GNPE network is conditioned on additional “proxy” context theta^, i.e.,

p(theta | theta^, d)

The theta^ depend on theta via a fixed kernel p(theta^ | theta). Combining these known distributions, this class uses Gibbs sampling to draw samples from the joint distribution,

p(theta, theta^ | d)

The advantage of this approach is that we are allowed to perform any transformation of d that depends on theta^. In particular, we can use this freedom to simplify the data, e.g., by aligning data to have merger times = 0 in each detector. The merger times are unknown quantities that must be inferred jointly with all other parameters, and GNPE provides a means to do this iteratively. See https://arxiv.org/abs/2111.13139 for additional details.

Gibbs sampling breaks access to the probability density, so this must be recovered through other means. One way is to train an unconditional flow to represent p(theta^ | d) for fixed d based on the samples produced through the GNPE Gibbs sampling. Starting from these, a single Gibbs iteration gives theta from the GNPE network, along with the probability density in the joint space. This is implemented in GNPESampler provided the init_sampler provides proxies directly and num_iterations = 1.

Attributes (beyond those of Sampler)

init_samplerSampler

Used for providing initial samples for Gibbs sampling.

num_iterationsint

Number of Gibbs iterations to perform.

iteration_tracker : IterationTracker not set up remove_init_outliers : float not set up

param model:

type model:

BasePosteriorModel

param init_sampler:

Used for generating initial samples

type init_sampler:

Sampler

param num_iterations:

Number of GNPE iterations to be performed by sampler.

type num_iterations:

int

property gnpe_proxy_parameters
property init_sampler
property num_iterations

The number of GNPE iterations to perform when sampling.

class dingo.core.samplers.Sampler(model: BasePosteriorModel)

Bases: object

Sampler class that wraps a PosteriorModel. Allows for conditional and unconditional models.

Draws samples from the model based on (optional) context data.

This is intended for use either as a standalone sampler, or as a sampler producing initial sample points for a GNPE sampler.

run_sampler()
log_prob()
to_result()
to_hdf5()
model
Type:

BasePosteriorModel

inference_parameters
Type:

list

samples

Samples produced from the model by run_sampler().

Type:

DataFrame

context
Type:

dict

metadata
Type:

dict

event_metadata
Type:

dict

unconditional_model

Whether the model is unconditional, in which case it is not provided context information.

Type:

bool

transform_pre, transform_post

Transforms to be applied to data and parameters during inference. These are typically implemented in a subclass.

Type:

Transform

Parameters:

model (BasePosteriorModel)

property context

Data on which to condition the sampler. For injections, there should be a ‘parameters’ key with truth values.

property event_metadata

Metadata for data analyzed. Can in principle influence any post-sampling parameter transformations (e.g., sky position correction), as well as the likelihood detector positions.

log_prob(samples: DataFrame | dict) ndarray

Calculate the model log probability at specific sample points.

Parameters:

samples (pd.DataFrame | dict) – Sample points at which to calculate the log probability.

Return type:

np.array of log probabilities.

run_sampler(num_samples: int, batch_size: int | None = None)

Generates samples and stores them in self.samples. Conditions the model on self.context if appropriate (i.e., if the model is not unconditional).

If possible, it also calculates the log_prob and saves it as a column in self.samples. When using GNPE it is not possible to obtain the log_prob due to the many Gibbs iterations. However, in the case of just one iteration, and when starting from a sampler for the proxy, the GNPESampler does calculate the log_prob.

Allows for batched sampling, e.g., if limited by GPU memory. Actual sampling for each batch is performed by _run_sampler(), which will differ for Sampler and GNPESampler.

Parameters:
  • num_samples (int) – Number of samples requested.

  • batch_size (int, optional) – Batch size for sampler.

to_hdf5(label='result', outdir='.')
to_result() Result

Export samples, metadata, and context information to a Result instance, which can be used for saving or, e.g., importance sampling, training an unconditional flow, etc.

Return type:

Result

dingo.core.transforms module

class dingo.core.transforms.GetItem(key)

Bases: object

class dingo.core.transforms.RenameKey(old, new)

Bases: object

Module contents