Flow Matching for Scalable Simulation-Based Inference

Via flow matching, continuous normalizing flows can be trained efficiently for the use in Simulation-based Inference. They yield comparative results on benchmarking as well as high-dimensional problems whilst being more flexible than discrete flows.

Neural Posterior Estimation (NPE) has become an established tool for Likelihood-free Inference, and time-discrete Normalizing Flows have proven themselves as the most powerful density estimation technique for NPE. However, these have shown difficulty scaling to higher dimensional problems due to limiting factors of their architecture.

By defining a continuous flow, rather than a discrete one, the restrictive architectural choices can be avoided. Flow-matching allows to efficiently train continuous Normalizing Flows by defining a vector field from a known source distribution $p_B$ to a targeted but unknown distribution $p_D$ [Lip22F]. This allows more flexibility for complex data and higher dimensional problems.

[Wil23F] propose the use of flow matching to train a continuous Normalizing Flow for NPE. The authors show empirically that this approach is capable of out-performing discrete flows on benchmarking tasks, as proposed by [Lue21B], and tackling high dimensional problems.

Figure 1. [Wil23F], Table 1. Comparison of key aspects of different methods for Posterior Estimation. The authors highlight that flow matching Posterior Estimation (FMPE) yields a tractable density estimation without limiting the estimator’s architecture.

In a similar attempt borrowing ideas from diffusion methods, Neural Posterior Score Estimation (NPSE) was introduced to estimate the gradient of the log posterior distribution. However, as the authors point out in Figure 1 and in contrast to flow-based methods, NPSE does not allow evaluation of the posterior density.

Flow Matching for SBI

The goal of Simulation-based Inference (SBI) is to infer the posterior distribution $p(\theta|x)$ of a parameter $\theta \in \mathbb{R}^d$ given observations $x \in \mathbb{R}^n$. At the core of SBI, a universal density approximator is used to yield $q_{\phi}(x) \approx p(\theta|x)$. In contrast to discrete flows, flow matching admits flexible network architectures by learning a time-dependent vector field $v_{t,x}$, generating a probability path $p_{t,x}$ that transports a source distribution $p_B$ to a target distribution $p_D$.

In the setting of SBI, the source distribution is $p_B = q_0(\theta \mid x) = q_0(\theta)$ and the target distribution is $p_D = p_1(\theta \mid x)$. For each $t\in [0,1]$ the time-dependent vector field $v_{t,x}$ defines the velocity of the sample trajectory:

$$ \begin{align} \frac{d}{dt} \Psi_{t,x}(\theta) &= v_{t,x}(\Psi_{t,x}(\theta)) \\ \Psi_{0,x}(\theta) &= \theta \end{align} $$

The single trajectories $\Psi_{t,x}$ are obtained by integrating the above ODE. A neural network is used to approximate the vector field $v_{t,x}$, which yields the probability path $p_{t,x}$ leading to the final density:

$$ \begin{align} q(\theta \mid x) &= (\Psi_{1,x})_{\ast} q_0(\theta) \\ &= q_0(\theta) \exp \left( - \int^{1}_{0} \operatorname{div} v_{t,x}(\Psi_{t,x}(\theta)) dt \right), \end{align} $$

where $(\cdot)_{\ast}$ denotes the push-forward of a operator.

Usually, the negative log-likelihood loss function is minimized to train a density estimator to approximate the posterior distribution. This, however, requires many network passes to solve the ODE. With flow matching, in contrast, it is possible to define the loss on a sample-conditional basis. Provided a sample-conditional probability path $p_t(\theta \mid \theta_1)$ and a sample-conditional vector field $u_{t,x}(\theta \mid \theta_1)$, the flow matching loss minimizes the difference between $v_{t,x}(\theta)$ and the marginal vector field $u_{t,x}(\theta,x)$ that generates $p_t(\theta \mid x)$.

$$ \mathcal{L}_{FM} = \mathbb{E}_{t, x, \theta_1, \theta_t} \Vert v_{t,x}(\theta) - u_t(\theta \mid \theta_1) \Vert^2, $$

where $t\sim \mathcal{U}[0,1],~ x\sim p(x),~ \theta_1 \sim p(\theta \mid x),~ \theta_t \sim p_t(\theta_t \mid \theta_1)$.

To utilize flow matching within the SBI framework, the authors employ Bayes' rule to change the aforementioned loss function $\mathbb{E}_{p(x)p(\theta \mid x)}$ to $\mathbb{E}_{p(\theta)p(x \mid \theta)}$, eliminating intractable expressions in the SBI-specific setting. Furthermore, the authors adapt the distribution of the time parameter $t$ in order to focus on larger values of $t$, i.e. closer to the targeted distribution.

$$ \mathcal{L}_{FMPE} = \mathbb{E}_{t, \theta_1, x, \theta_t} \Vert v_{t,x}(\theta) - u_t(\theta \mid \theta_1) \Vert^2, $$

where $t\sim p(t),~ \theta_1\sim p(\theta),~ x \sim p(x \mid \theta),~ \theta_t \sim p_t(\theta_t \mid \theta_1)$. Note that the distributions where $x$ and $\theta_1$ are drawn from are different to the flow matching loss.

By leveraging the flow matching technique to train continuous normalizing flows, density estimators can be trained more efficiently and with more flexibility. To showcase this, [Wil23F] compare the architectures and trajectories of discrete and continuous flows. While the former is constrained in its architecture, the latter can put more emphasis on the observations $x$, on which the targeted distribution is conditioned. This is important as the dimensionality of $x$ is usually high and the data is complex. Furthermore, the proposed usage of optimal transport produces simpler trajectories from $p_B$ to $p_D$.

Figure 2. [Wil23F], Figure 1. Visual comparison of time discrete and continuous flows and the respective flow trajectories. The inset on the lower right shows the base distribution $p_B$. The resulting $p_D$ is shown in the respective rows. The colors of the points indicate the efficient transportation of probability mass when using optimal transport flow matching.

In addition to the above, the authors show that the proposed flow matching formulation for SBI yields a mass-covering behavior. This means that the resulting density estimator $q_{\phi}(\theta \mid x)$ assigns non-zero probability to the full support of $p(\theta \mid x)$. This feature is desirable as it does not exclude parameter values with low probability. Such parameters are unlikely but still possible and should therefore be considered by the posterior.

The authors further show that the mean squared difference between the vector fields $v_t$ and $u_t$ bounds the Kullback-Leibler divergence up to a constant $C > 0$:

$$ D_{\text{KL}}(p(\theta \mid x) \Vert q(\theta \mid x)) \leq C ~ \text{MSE}_p(u,v)^{\frac{1}{2}} $$

Empirical Results

In order to underline the practicality of the proposed method, the authors first demonstrate the performance of flow matching on common SBI benchmarking tasks. In a second step, the authors show that flow matching is capable of tackling high-dimensional problems by comparing the FMPE to a specialized NPE method for gravitational-wave inference.

Benchmarking Results

In accordance with the benchmarks described in [Lue21B], the performance of FMPE is compared against standard NPE methods. The simulation budgets were chosen to be in $\{ 10^3,~10^4,~10^5\}$. For each benchmarking task and simulation budget, the methods are compared by the Classifier-2-Sample-Test (C2ST) and the Maximum Mean Discrepancy (MMD) metric. The results are shown in the accompanying Figure 3.

Figure 3. [Wil23F],Figure 4. Comparison of the proposed FMPE method to standard NPE methods on the SBI benchmarking tasks. The results are shown for different simulation budgets.

High-Dimensional Results

In order to show the performance of FMPE on high-dimensional problems, FMPE is used for gravitational-wave inference where $\theta \in \mathbb{R}^{15}$ and $x \in \mathbb{R}^{15744}$. The authors compare the performance of FMPE to a baseline NPE method. The baseline method consists of a feature reduction model mapping $x$ into a lower dimensional space $\tilde{x} \in \mathbb{R}^{128}$. A neural spline flow is then used to estimate the posterior distribution $p(\theta \mid \tilde{x})$. Both methods are trained with a simulation budget of $5 \cdot 10^6$ samples over 400 epochs. Whilst having more trainable parameters, FMPE trains faster than the baseline NPE using a neural spline flow. Both methods are compared based on the Jensen-Shannon divergence (JSD). Furthermore, a third method (GNPE) is added to the comparison that integrates known conditional symmetries to simplify data. The results are shown in the accompanying Figure 4.

FMPE outperforms the baseline NPE method in all dimensions. Furthermore, it outperforms GNPE in terms of JSD for some of the parameter dimensions, which is specifically designed such a task.

Figure 4. [Wil23F], Figure 5. Performance of the FMPE method compared to a baseline NPE and the GNPE method, based on JSD. On the left side, the results of GNPE are omitted due to different data settings.

FMPE demonstrates the usability of continuous normalizing flows for SBI by yielding comparable results. Thanks to flow matching, such density estimators can be trained efficiently, while not relying on restricted architectures as their discrete counterpart.

References

In this series