Neural Posterior Estimation with Differentiable Simulators


Bayesian Deep Learning for Cosmology and Time Domain Astrophysics #2

June 20-24, 2022


Justine Zeghal, François Lanusse, Alexandre Boucaud, Eric Aubourg, Benjamin Remy


Drawing Drawing Drawing Drawing Drawing

Context

    We want to infer the parameters that generated an observation $x_0$.

    $\to$ Bayes theorem: $ p(\theta|x_0)$$ \propto$ $ \underbrace{p(x_0|\theta)}_{\text{likelihood}}$ $ \underbrace{p(\theta)}_{\text{prior}}$

    $ p(x_0|\theta) = \int p(x_0|\theta, z)p(z)dz$ $\to$ intractable



    Different ways to do Likelihood-Free Inference:


    • Likelihood Estimation
    • Ratio Estimation
    • Posterior Estimation
    • Posterior Estimation

      $\to$ Require a large number of simulations

What if we add the gradients of joint log probability of the simulator with respect to input parameters in the LFI process ?

Posterior Estimation Algorithm



Draw N parameters $\theta_i \sim p(\theta)$

$\downarrow$

Draw N simulations $x_i \sim p(x|\theta)_{i=1..N}$

$\downarrow$

Train a neural density estimator $q_{\phi}(\theta|x)$ on $(x_i,\theta_i)_{i=1..N}$

$\downarrow$

Approximate the posterior: $p(\theta|x_0) \approx q_{\phi}(\theta|x=x_0)$





The idea

    With a few simulations it's hard to approximate the distribution.

    $\to$ we need more simulations

    but if we have a few simulations

    and the gradients

    then it's possible to have an idea of the shape of the distribution.

$\to$ We integrate the gradients $\nabla_{\theta} \log p(\theta|x)$ in the process to reduce the number of simulations.

How can we train our neural density estimator with both simulations and score ?

Normalizing Flows (NFs) as Neural Density Estimators

    The key idea of NFs is to transform a simple density distribution $p_z(z)$ through a series of bijective functions $f_i$ to reconstruct a complex target distribution $p_x(x)$.

$z$

$z'$

$\longrightarrow$

$f_2(z')$

$\longleftarrow$

$f^{-1}_1(z')$

$\longrightarrow$

$f_{1}(z) $

$\longleftarrow$

$f_2^{-1}(x)$

$x$



The density distribution is obtained by using the change of variable formula:

$\scriptstyle p_x^{\phi}(x) = p_z\left(f^{-1}_{\phi}(x)\right) \det \left|\frac{\partial f^{-1}_{\phi}(x)}{\partial x} \right|. $

How can we find the transformation parameter $\phi$ from the data $x$
to be as close as possible to the true distribution $p(x)$?

$\to$ we need a tool to compare distributions: the Kullback-Leiber Divergence.

$$\begin{array}{ll} D_{KL}(p(x)||p_x^{\phi}(x)) &= \mathbb{E}_{p(x)}\Big[ \log\left(\frac{p(x)}{p_x^{\phi}(x)}\right) \Big] \\ &= \mathbb{E}_{p(x)}\left[ \log\left(p(x)\right) \right] - \mathbb{E}_{p(x)}\left[ \log\left(p_x^{\phi}(x)\right) \right]\\ \end{array} $$

$$ \implies Loss =- \mathbb{E}_{p(x)}\left[ \log\left(p_x^{\phi}(x)\right) \right] + cte $$

By minimizing the negative log likelihood.

    But to train the NF, we want to use both simulations and gradients:

    $\displaystyle \mathcal{L} = $ $\displaystyle \ -\mathbb{E}[\log \: p_x^{\phi}(x)]$ $+\lambda \ $$ \displaystyle \mathbb{E}\left[ \parallel \nabla_{x} \log p_x(x) - \underbrace{\nabla_x \log p_x^{\phi}(x)}\parallel_2^2 \right]$

Problem: the gradient of current NFs lack expressivity.

Smooth Normalizing Flows paper here

Experiment: Lotka Volterra

  • Draw N parameters $\alpha,\beta,\gamma,\delta \sim p(\underbrace{\alpha,\beta,\gamma,\delta}_{\theta})$

  • Run N simulations $x \sim p(\underbrace{\text{Prey}, \: \text{Predator}}_{x}| \underbrace{\alpha,\beta,\gamma,\delta}_{\theta})$

  • Compress $x \in \mathbb{R}^{20}$ into $y = r(x) \in \mathbb{R}^4$

  • Train a NF on ($y_i$,$\theta_i$)$_{i=1..N}$

  • Approximate the posterior: $p(\theta|x_0) \approx q_{\phi}(\theta|r(x_0))$

Results

$ \underbrace{p(\theta|x_0)}_{\text{posterior}} \propto$ $ \underbrace{p(x_0|\theta)}_{\text{likelihood}}$ $ \underbrace{p(\theta)}_{\text{prior}}$

  • We tested NPE (neural posterior estimation) with only simulations

  • Then with simulations and scores

  • We compared this to NLE (neural likelihood estimation) with only simulations

  • And to SCANDAL which is NLE with scores (proposed by Brehmer et al. 2019)

And found out that whether with our method or SCANDAL the gradients do not help.

Prior analysis: wide prior

    The prior we used in our experiment.


    The true posterior.


    The posterior is very small compare to the prior.

Prior analysis: tight prior

    The prior.


    The true posterior.


    And we performed the same experiment in this framework.




$ \underbrace{p(\theta|x_0)}_{\text{posterior}} \propto$ $ \underbrace{p(x_0|\theta)}_{\text{likelihood}}$ $ \underbrace{p(\theta)}_{\text{prior}}$

  • We tested NPE (neural posterior estimation) with only simulations

  • Then with simulations and scores

  • We compared this to NLE (neural likelihood estimation) with only simulations

  • And to SCANDAL which is NLE with scores (proposed by Brehmer et al. 2019)

With simulations only

With simulations and score

Conclusion



  • This is the first NPE method that uses the score.

  • We used Smooth Normalizing Flows to be able to train using the score.

  • The score helps reducing the number of simulations when the prior is not too wide.

  • More and more simulators are differentiable.





  • Thank you !