Mathematical Background#

Triangular Transport Maps#

Let \(\pi\) and \(\eta\) be two densities on \(\mathbb{R}^d\). In measure transport, our goal is to find a multivariate transformation \(T\) that pushes forward \(\eta\) to \(\pi\), meaning that if \(\mathbf{X} \sim \eta\), then \(T(\mathbf{X}) \sim \pi\). Given such a map, we can easily generate samples from \(\eta\) by pushing samples \(\mathbf{x}^i \sim \eta\) through the map \(T(\mathbf{x}^i) \sim \pi\). Furthermore, we can express the push-forward density of a diffeomorphic map by \(T_{\sharp}\eta(\mathbf{x}) := \eta(T^{-1}(\mathbf{x}))|\nabla T^{-1}(\mathbf{x})|\).

While there are infinitely many transformations that couple densities, if \(\pi\) is absolutely continuous with respect to \(\eta\), there exists a unique lower triangular and monotone function \(T\colon \mathbb{R}^d \rightarrow \mathbb{R}^d\) that pushes forward \(\pi\) to \(\eta\) of the form

\[\begin{split}T(\mathbf{x}) = \begin{bmatrix} T_1(x_1) \\ T_2(x_1,x_2) \\ \vdots \\ T_d(x_1,\dots,x_d) \end{bmatrix}.\end{split}\]

Monotone Parameterizations#

We represent monotone functions as the smooth transformation of an unconstrained function \(f\colon\mathbb{R}^{d} \rightarrow \mathbb{R}\). Let \(g\colon\mathbb{R}\rightarrow \mathbb{R}_{>0}\) be a strictly positive function, such as the softplus \(g(x) = \log(1 + \exp(x))\), and let \(\epsilon \geq 0\) be a non-negative constant. Then, the d-th map component \(T_{d}\) is given by

(1)#\[T_d(\mathbf{x}_{1:d}; \mathbf{w}) = f(x_1,\ldots, x_{d-1},0; \mathbf{w}) + \int_0^{x_d} g( \partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) + \epsilon dt.\]

Notice that the additive “nugget” \(\epsilon\) is the lower bound on the diagonal derivative \(\partial T_d / \partial x_d\). Other choices for the \(g\) include the squared and exponential functions. These choices, however, have implications for the identifiability of the coefficients. If \(g\) is bijective and \(\epsilon=0\), then we can recover \(f\) from \(T_d\) as

\[f(\mathbf{x}_{1:d}; \mathbf{w}) = T_d(x_1,\ldots, x_{d-1},0; \mathbf{w}) + \int_0^{x_d} g^{-1}( \partial_d T_d(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) dt.\]

Using the representation for monotone functions with a bijective \(g\), we can approximate \(T_d\) by finding \(f\).

Tensor Product Expansions#

For a point \(\mathbf{x}\in\mathbb{R}^d\) and coefficients \(\mathbf{w}\), we consider expansions of the form

\[f(\mathbf{x}; \mathbf{w}) = \sum_{\alpha\in \mathcal{A}} w_\alpha \Phi_\alpha(\mathbf{x}),\]

where \(\alpha\in\mathbb{N}^d\) is a multi-index, \(\mathcal{A}\) is a multiindex set, and \(\Phi_{\mathbf{\alpha}}\) is a multivariate function defined as a tensor product of one-dimensional functions \(\phi_{\alpha_i}\colon \mathbb{R}\rightarrow \mathbb{R}\) through

\[\Phi_\mathbf{\alpha}(\mathbf{x}) = \prod_{\alpha_i \in \mathbf{\alpha}} \phi_{\alpha_i}(x_i).\]

Numerical Integration#

Computationally, we approximate the integral in the definition of \(T_d(\mathbf{x}_{1:d}; \mathbf{w})\) using a quadrature rule with \(N\) points \(\{t^{(1)}, \ldots, t^{(N)}\}\) and corresponding weights \(\{c^{(1)}, \ldots, c^{(N)}\}\) designed to approximate integrals over \([0,1]\). Note that these points and weights will often be chosen adaptively. The quadrature rule yields an approximation of the map component in (1) with the form

(2)#\[\tilde{T}_d(\mathbf{x}_{1:d}; \mathbf{w}) = f(x_1,\ldots, x_{d-1},0; \mathbf{w}) + x_d \sum_{i=1}^N c^{(i)} \left[g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) + \epsilon \right],\]

where the \(x_d\) term outside the summation comes from a change of integration domains from \([0,1]\) to \([0,x_d]\).

Diagonal Derivatives#

We will often require derivatives of \(T_d\) with respect to an input \(x_i\) or the parameters \(\mathbf{w}\). When computing these derivatives however, we have a choice of whether to differentiate the continuous map form in (1) or the discretized map in (2). This is similar to the “discretize-then-optimize” or “optimize-then-discretize” choice in PDE-constrained optimization. When the quadrature rule is accurate, there might not be a large practical difference in these approaches. For approximate rules however, using the continuous derivative may cause issues during optimization because the derivative will not be consistent with the discreteized map: a finite difference approximation will not converge to the continuous derivative. In these cases, it is preferrable to differentiate the discrete map in (2).

The derivative \(\partial T_d / \partial x_d\) is particularly important when using the monotone function \(T_d\) in a measure transformation. The continuous version of this derivative is simply

(3)#\[\frac{\partial T_d}{\partial x_d}(\mathbf{x}_{1:d}; \mathbf{w}) = g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, ) + \epsilon.\]

The discrete derivative on the other hand is more complicated:

(4)#\[\begin{split}\frac{\partial \tilde{T}_d}{\partial x_d}(\mathbf{x}_{1:d}; \mathbf{w}) &= \frac{\partial}{\partial x_d} \left[x_d \sum_{i=1}^N c^{(i)} \left(g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) + \epsilon\right)\right]\\ & = \sum_{i=1}^N c^{(i)} \left(g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) )+\epsilon\right) \\ &+ x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) .\end{split}\]

Coefficient Derivatives#

In addition to computing \(\partial T_d/\partial d\), we will also need the gradient of the monotone function \(T_d\) with respect to the parameters \(\mathbf{w}\), denoted by \(\nabla_{\mathbf{w}}T_d\).

(5)#\[\begin{split}\nabla_{\mathbf{w}} T_d(\mathbf{x}_{1:d}; \mathbf{w}) &= \nabla_{\mathbf{w}} f(x_1,\ldots, x_{d-1},0; \mathbf{w})\\ &+ \int_0^{x_d} \partial g( \partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) \nabla_{\mathbf{w}}\left[\partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w})\right] dt \\ &\approx \nabla_{\mathbf{w}} f(x_1,\ldots, x_{d-1},0; \mathbf{w})\\ & + x_d \sum_{i=1}^N c^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \nabla_{\mathbf{w}}\left[\partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})\right]\end{split}\]

If is also possible to compute the gradient of the diagonal derivative \(\nabla_{\mathbf{w}}\left( \partial T_d/\partial d\right)\) with respect to the parameters, but like before, there is a question of whether the derivative of the exact map or the derivative of the quadrature-based approximate map should be used. In the case of the exact map, the mixed coefficient gradient has the simple form

\[\begin{split}\nabla_{\mathbf{w}}\left[ \frac{\partial T_d}{\partial d}\right] & = \nabla_{\mathbf{w}}\left[ g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, ) + \epsilon \right] \\ & = \partial g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, ) \nabla_{\mathbf{w}}\left[\partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\right].\end{split}\]

The gradient of the discrete derivative is more expansive and takes the form

\[\begin{split}\nabla_{\mathbf{w}}\left[ \frac{\partial \tilde{T}_d}{\partial d}\right] &= \sum_{i=1}^N c^{(i)} \nabla_{\mathbf{w}}\left[g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \right] \\ & + x_d \sum_{i=1}^N c^{(i)} t^{(i)} \nabla_{\mathbf{w}}\left[\partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right] \\ &= \sum_{i=1}^N c^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})) \nabla_{\mathbf{w}}\left[ \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right] \\ &+ x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial^2 g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \nabla_{\mathbf{w}}\left[ \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right] \\ & + x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \nabla_{\mathbf{w}}\left[\partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})\right].\end{split}\]