dingo.core package
Subpackages
- dingo.core.density package
- dingo.core.nn package
- Submodules
- dingo.core.nn.cfnets module
- dingo.core.nn.enets module
- dingo.core.nn.nsf module
- Module contents
- dingo.core.posterior_models package
- Submodules
- dingo.core.posterior_models.base_model module
BasePosteriorModelBasePosteriorModel.initialize_network()BasePosteriorModel.initialize_optimizer_and_scheduler()BasePosteriorModel.load_model()BasePosteriorModel.log_prob()BasePosteriorModel.loss()BasePosteriorModel.network_to_device()BasePosteriorModel.sample()BasePosteriorModel.sample_and_log_prob()BasePosteriorModel.save_model()BasePosteriorModel.train()
test_epoch()train_epoch()
- dingo.core.posterior_models.build_model module
- dingo.core.posterior_models.cflow_base module
ContinuousFlowPosteriorModelContinuousFlowPosteriorModel.evaluate_vector_field()ContinuousFlowPosteriorModel.initialize_network()ContinuousFlowPosteriorModel.integration_rangeContinuousFlowPosteriorModel.log_prob()ContinuousFlowPosteriorModel.rhs_of_joint_ode()ContinuousFlowPosteriorModel.sample()ContinuousFlowPosteriorModel.sample_and_log_prob()ContinuousFlowPosteriorModel.sample_t()ContinuousFlowPosteriorModel.sample_theta_0()
compute_divergence()compute_hutchinson_divergence()compute_log_prior()norm_without_divergence_component()
- dingo.core.posterior_models.flow_matching module
- dingo.core.posterior_models.normalizing_flow module
- dingo.core.posterior_models.score_matching module
ScoreDiffusionPosteriorModelScoreDiffusionPosteriorModel.alpha()ScoreDiffusionPosteriorModel.beta()ScoreDiffusionPosteriorModel.evaluate_vector_field()ScoreDiffusionPosteriorModel.get_likelihood_weighting()ScoreDiffusionPosteriorModel.get_t_theta_t_score()ScoreDiffusionPosteriorModel.loss()ScoreDiffusionPosteriorModel.mu()ScoreDiffusionPosteriorModel.sigma()
- Module contents
- dingo.core.utils package
- Submodules
- dingo.core.utils.backward_compatibility module
- dingo.core.utils.condor_utils module
- dingo.core.utils.gnpeutils module
- dingo.core.utils.logging_utils module
- dingo.core.utils.misc module
- dingo.core.utils.plotting module
- dingo.core.utils.pt_to_hdf5 module
- dingo.core.utils.torchutils module
- dingo.core.utils.trainutils module
- Module contents
Submodules
dingo.core.dataset module
- class dingo.core.dataset.DingoDataset(file_name: str | None = None, dictionary: dict | None = None, data_keys: List | None = None, leave_on_disk_keys: list | None = None)
Bases:
objectThis is a generic dataset class with save / load methods.
A common use case is to inherit multiply from DingoDataset and torch.utils.data.Dataset, in which case the subclass picks up these I/O methods, and DingoDataset is acting as a Mixin class.
Alternatively, if the torch Dataset is not needed, then DingoDataset can be subclassed directly.
For constructing, provide either file_name, or dictionary containing data and settings entries, or neither.
- Parameters:
file_name (str) – HDF5 file containing a dataset
dictionary (dict) – Contains settings and data entries. The data keys should be the same as save_keys
data_keys (list) – Variables that should be saved / loaded. This allows for class to store additional variables beyond those that are saved. Typically, this list would be provided by any subclass.
leave_on_disk_keys (Optional[list]) – Keys for which the values are not loaded into RAM when initializing the dataset. This reduces the memory footprint during training. Instead, the values are loaded from the HDF5 file during training.
- dataset_type = 'dingo_dataset'
- from_dictionary(dictionary: dict)
- from_file(file_name: str)
- to_dictionary()
- to_file(file_name: str, mode: str = 'w')
- dingo.core.dataset.recursive_hdf5_load(group, keys: List[str] | None = None, idx: int | List[int] | None = None)
This is a generic helper function to recursively load data from an HDF5 file.
- Parameters:
group (h5py.Group) – Group from which to recursively load data.
keys (list[str] or None) – List of keys to load. If None, load all keys.
idx (int or list[int] or None) – If idx is provided, only the datapoints corresponding to the given indices are loaded.
- dingo.core.dataset.recursive_hdf5_save(group, d)
dingo.core.likelihood module
- class dingo.core.likelihood.Likelihood
Bases:
object- log_likelihood(theta)
- log_likelihood_multi(theta: DataFrame, num_processes: int = 1) ndarray
Calculate the log likelihood at multiple points in parameter space. Works with multiprocessing.
This wraps the log_likelihood() method.
- Parameters:
theta (pd.DataFrame) – Parameters values at which to evaluate likelihood.
num_processes (int) – Number of processes to use.
- Return type:
np.array of log likelihoods
dingo.core.multiprocessing module
- dingo.core.multiprocessing.apply_func_with_multiprocessing(func: callable, theta: DataFrame, num_processes: int = 1) ndarray
Call func(theta.iloc[idx].to_dict()) with multiprocessing.
- Parameters:
func (callable)
theta (pd.DataFrame) – Parameters with multiple rows, evaluate func for each row.
num_processes (int) – Number of parallel processes to use.
- Returns:
result – Output array, where result[idx] = func(theta.iloc[idx].to_dict())
- Return type:
np.ndarray
dingo.core.result module
- class dingo.core.result.Result(file_name=None, dictionary=None)
Bases:
DingoDatasetA dataset class to hold a collection of samples, implementing I/O, importance sampling, and unconditional flow training.
- Attributes:
- samplespd.Dataframe
Contains parameter samples, as well as (possibly) log_prob, log_likelihood, weights, log_prior, delta_log_prob_target.
- domainDomain
Should be implemented in a subclass.
- priorPriorDict
Should be implemented in a subclass.
- likelihoodLikelihood
Should be implemented in a subclass.
- contextdict
Context data from which the samples were produced (e.g., strain data, ASDs).
metadata : dict event_metadata : dict log_evidence : float log_evidence_std : float (property) effective_sample_size, n_eff : float (property) sample_efficiency : float (property)
For constructing, provide either file_name, or dictionary containing data and settings entries, or neither.
- Parameters:
file_name (str) – HDF5 file containing a dataset
dictionary (dict) – Contains settings and data entries. The data keys should be the same as save_keys
data_keys (list) – Variables that should be saved / loaded. This allows for class to store additional variables beyond those that are saved. Typically, this list would be provided by any subclass.
leave_on_disk_keys (Optional[list]) – Keys for which the values are not loaded into RAM when initializing the dataset. This reduces the memory footprint during training. Instead, the values are loaded from the HDF5 file during training.
- property base_metadata
- property constraint_parameter_keys
- dataset_type = 'core_result'
- property effective_sample_size
- property fixed_parameter_keys
- get_all_injection_credible_levels(keys: list[str] | None = None, weighted: bool = False)
Get credible levels for all parameters.
Adapted from Bilby.
- Parameters:
keys (list, optional) – A list of keys for which return the credible levels, if None, defaults to search_parameter_keys
weighted (bool, optional) – Whether to use sample weights in calculating credible level.
- Returns:
credible_levels – The credible levels at which the injected parameters are found.
- Return type:
dict
- get_injection_credible_level(parameter: str, weighted: bool = False)
Get the credible level of the injected parameter.
Calculated as CDF(injection value).
Adapted from Bilby.
- Parameters:
parameter (str) – Parameter to get credible level for
weighted (bool, optional) – Whether to use sample weights in calculating credible level.
- Returns:
float
- Return type:
credible level
- importance_sample(num_processes: int = 1, **likelihood_kwargs)
Calculate importance weights for samples.
Importance sampling starts with samples have been generated from a proposal distribution q(theta), in this case a neural network model. Certain networks (i.e., non-GNPE) also provide the log probability of each sample, which is required for importance sampling.
Given the proposal, we re-weight samples according to the (un-normalized) target distribution, which we take to be the likelihood L(theta) times the prior pi(theta). This gives sample weights
w(theta) ~ pi(theta) L(theta) / q(theta),
where the overall normalization does not matter (and we take to have mean 1). Since q(theta) enters this expression, importance sampling is only possible when we know the log probability of each sample.
As byproducts, this method also estimates the evidence and effective sample size of the importance sampled points.
This method modifies the samples pd.DataFrame in-place, adding new columns for log_likelihood, log_prior, and weights. It also stores the log_evidence as an attribute.
- Parameters:
num_processes (int) – Number of parallel processes to use when calculating likelihoods. (This is the most expensive task.)
likelihood_kwargs (dict) – kwargs that are forwarded to the likelihood constructor. E.g., options for marginalization.
- property injection_parameters
- property log_bayes_factor
- property log_evidence_std
- classmethod merge(parts)
Merge several Result instances into one. Check that they are compatible, in the sense of having the same metadata. Finally, calculate a new log evidence for the combined result.
This is useful when recombining separate importance sampling jobs.
- Parameters:
parts (list[Result]) – List of sub-Results to be combined.
- Return type:
Combined Result.
- property metadata
- property n_eff
- property num_samples
- parameter_subset(parameters)
Return a new object of the same type, with only a subset of parameters. Drops all other columns in samples DataFrame as well (e.g., log_prob, weights).
- Parameters:
parameters (list) – List of parameters to keep.
- Return type:
- plot_corner(parameters: list | None = None, filename: str = 'corner.pdf', **kwargs)
Generate a corner plot of the samples.
- Parameters:
parameters (list[str]) – List of parameters to include. If None, include all parameters. (Default: None)
filename (str) – Where to save samples.
legend_font_size (int) – Font size of the legend.
- plot_log_probs(filename='log_probs.png')
Make a scatter plot of the target versus proposal log probabilities. For the target, subtract off the log evidence.
- plot_weights(filename='weights.png')
Make a scatter plot of samples weights vs log proposal.
- print_summary()
Display the number of samples, and (if importance sampling is complete) the log evidence and number of effective samples.
- rejection_sample(max_samples_per_draw: int = 1, clip_weights: bool = False, random_state=None)
Generate unweighted posterior samples from weighted ones via rejection sampling.
Each original sample contributes at most
max_samples_per_drawcopies to the output, so the result avoids the excessive duplication thatsampling_importance_resampling()can produce for high-weight samples.Algorithm (unbiased, maximum efficiency)
The weights are first scaled so that the largest weight equals
max_samples_per_draw. Each sampleithen contributesfloor(w_scaled[i])copies deterministically (integer part), andone additional copy with probability
w_scaled[i] - floor(w_scaled[i])(fractional part, a single Bernoulli draw).
The expected number of copies of sample
iis therefore exactlyw_scaled[i] ∝ w[i], which guarantees an unbiased representation of the posterior. Using the integer part deterministically (rather than rounding stochastically) maximises the expected total number of output samples for a givenmax_samples_per_draw.Optional weight clipping
When
clip_weights=True, theceil(sqrt(N))largest weights are replaced by their mean and the weights are re-normalized to mean 1 before rejection sampling. This number of clips is the theoretically optimal choice that yields asymptotically unbiased results [1]. Using the mean (rather than the minimum) of the clipped group preserves their total weight, which minimises the bias introduced by clipping. The net effect is reduced weight variance and a larger expected number of output samples.If the samples DataFrame has no
weightscolumn the samples are already unweighted and are returned unchanged.- Parameters:
max_samples_per_draw (int) – Maximum number of copies any single input sample may contribute to the output. Default is 1 (standard rejection sampling, no duplicates).
clip_weights (bool) – Whether to clip the
ceil(sqrt(N))largest weights to their mean before rejection sampling. Default is False.random_state (int or None) – Seed for the random number generator, for reproducibility.
- Returns:
Unweighted samples (the
weightscolumn is dropped).- Return type:
pd.DataFrame
References
- reset_event(event_dataset)
Set the Result context and event_metadata based on an EventDataset.
If these attributes already exist, perform a comparison to check for changes. Update relevant objects appropriately. Note that setting context and event_metadata attributes directly would not perform these additional checks and updates.
- Parameters:
event_dataset (EventDataset) – New event to be used for importance sampling.
- property sample_efficiency
- sampling_importance_resampling(num_samples=None, random_state=None)
Generate unweighted posterior samples from weighted ones. New samples are sampled with probability proportional to the sample weight. Resampling is done with replacement, until the desired number of unweighted samples is obtained.
- Parameters:
num_samples (int) – Number of samples to resample.
random_state (int or None) – Sampling seed.
- Returns:
Unweighted samples
- Return type:
pd.Dataframe
- property search_parameter_keys
- split(num_parts)
Split the Result into a set of smaller results. The samples are evenly divided among the sub-results. Additional information (metadata, context, etc.) are copied into each.
This is useful for splitting expensive tasks such as importance sampling across multiple jobs.
- Parameters:
num_parts (int) – The number of parts to split the Result across.
- Return type:
list of sub-Results.
- train_unconditional_flow(parameters, nde_settings: dict, train_dir: str | None = None, threshold_std: float | None = inf)
Train an unconditional flow to represent the distribution of self.samples.
- Parameters:
parameters (list) – List of parameters over which to train the flow. Can be a subset of the existing parameters.
nde_settings (dict) – Configuration settings for the neural density estimator.
train_dir (Optional[str]) – Where to save the output of network training, e.g., logs, checkpoints. If not provide, a temporary directory is used.
threshold_std (Optional[float]) – Drop samples more than threshold_std standard deviations away from the mean (in any parameter) before training the flow. This is meant to remove outlier samples.
- Return type:
PosteriorModel
- dingo.core.result.check_equal_dict_of_arrays(a, b)
- dingo.core.result.freeze(d)
- dingo.core.result.make_pp_plot(results: list[Result], filename=None, save=True, confidence_interval=[0.68, 0.95, 0.997], lines=None, legend_fontsize='x-small', keys=None, title=True, confidence_interval_alpha=0.1, weighted: bool = False, **kwargs)
Make a P-P plot for a set of runs with injected signals.
Adapted from Bilby.
- Parameters:
results (list[Result]) – A list of Result objects, each of these should have injected_parameters
filename (str, optional) – The name of the file to save, the default is “outdir/pp.png”
save (bool, optional) – Whether to save the file, default=True
confidence_interval ((float, list), optional) – The confidence interval to be plotted, defaulting to 1-2-3 sigma
lines (list) – If given, a list of matplotlib line formats to use, must be greater than the number of parameters.
legend_fontsize (float) – The font size for the legend
keys (list) – A list of keys to use, if None defaults to search_parameter_keys
title (bool) – Whether to add the number of results and total p-value as a plot title
confidence_interval_alpha (float, list, optional) – The transparency for the background condifence interval
weighted (bool, optional) – Whether to use weighted vs unweighted samples. It is useful to make PP plots using unweighted samples to test networks without importance sampling.
kwargs – Additional kwargs to pass to matplotlib.pyplot.plot
- Returns:
matplotlib figure and a NamedTuple with attributes combined_pvalue, pvalues, and names.
- Return type:
fig, pvals
dingo.core.samplers module
- class dingo.core.samplers.GNPESampler(model: BasePosteriorModel, init_sampler: Sampler, num_iterations: int = 1)
Bases:
SamplerBase class for GNPE sampler. It wraps a PosteriorModel and a standard Sampler for initialization. The former is used to generate initial samples for Gibbs sampling.
A GNPE network is conditioned on additional “proxy” context theta^, i.e.,
p(theta | theta^, d)
The theta^ depend on theta via a fixed kernel p(theta^ | theta). Combining these known distributions, this class uses Gibbs sampling to draw samples from the joint distribution,
p(theta, theta^ | d)
The advantage of this approach is that we are allowed to perform any transformation of d that depends on theta^. In particular, we can use this freedom to simplify the data, e.g., by aligning data to have merger times = 0 in each detector. The merger times are unknown quantities that must be inferred jointly with all other parameters, and GNPE provides a means to do this iteratively. See https://arxiv.org/abs/2111.13139 for additional details.
Gibbs sampling breaks access to the probability density, so this must be recovered through other means. One way is to train an unconditional flow to represent p(theta^ | d) for fixed d based on the samples produced through the GNPE Gibbs sampling. Starting from these, a single Gibbs iteration gives theta from the GNPE network, along with the probability density in the joint space. This is implemented in GNPESampler provided the init_sampler provides proxies directly and num_iterations = 1.
Attributes (beyond those of Sampler)
- init_samplerSampler
Used for providing initial samples for Gibbs sampling.
- num_iterationsint
Number of Gibbs iterations to perform.
iteration_tracker : IterationTracker not set up remove_init_outliers : float not set up
- param model:
- type model:
BasePosteriorModel
- param init_sampler:
Used for generating initial samples
- type init_sampler:
Sampler
- param num_iterations:
Number of GNPE iterations to be performed by sampler.
- type num_iterations:
int
- property gnpe_proxy_parameters
- property init_sampler
- property num_iterations
The number of GNPE iterations to perform when sampling.
- class dingo.core.samplers.Sampler(model: BasePosteriorModel)
Bases:
objectSampler class that wraps a PosteriorModel. Allows for conditional and unconditional models.
Draws samples from the model based on (optional) context data.
This is intended for use either as a standalone sampler, or as a sampler producing initial sample points for a GNPE sampler.
- run_sampler()
- log_prob()
- to_result()
- to_hdf5()
- model
- Type:
- inference_parameters
- Type:
list
- samples
Samples produced from the model by run_sampler().
- Type:
DataFrame
- context
- Type:
dict
- metadata
- Type:
dict
- event_metadata
- Type:
dict
- unconditional_model
Whether the model is unconditional, in which case it is not provided context information.
- Type:
bool
- transform_pre, transform_post
Transforms to be applied to data and parameters during inference. These are typically implemented in a subclass.
- Type:
Transform
- Parameters:
model (BasePosteriorModel)
- property context
Data on which to condition the sampler. For injections, there should be a ‘parameters’ key with truth values.
- property event_metadata
Metadata for data analyzed. Can in principle influence any post-sampling parameter transformations (e.g., sky position correction), as well as the likelihood detector positions.
- log_prob(samples: DataFrame | dict) ndarray
Calculate the model log probability at specific sample points.
- Parameters:
samples (pd.DataFrame | dict) – Sample points at which to calculate the log probability.
- Return type:
np.array of log probabilities.
- run_sampler(num_samples: int, batch_size: int | None = None)
Generates samples and stores them in self.samples. Conditions the model on self.context if appropriate (i.e., if the model is not unconditional).
If possible, it also calculates the log_prob and saves it as a column in self.samples. When using GNPE it is not possible to obtain the log_prob due to the many Gibbs iterations. However, in the case of just one iteration, and when starting from a sampler for the proxy, the GNPESampler does calculate the log_prob.
Allows for batched sampling, e.g., if limited by GPU memory. Actual sampling for each batch is performed by _run_sampler(), which will differ for Sampler and GNPESampler.
- Parameters:
num_samples (int) – Number of samples requested.
batch_size (int, optional) – Batch size for sampler.
- to_hdf5(label='result', outdir='.')
dingo.core.transforms module
- class dingo.core.transforms.GetItem(key)
Bases:
object
- class dingo.core.transforms.RenameKey(old, new)
Bases:
object