# Mathematical Background#

## Triangular Transport Maps#

Let $$\pi$$ and $$\eta$$ be two densities on $$\mathbb{R}^d$$. In measure transport, our goal is to find a multivariate transformation $$T$$ that pushes forward $$\eta$$ to $$\pi$$, meaning that if $$\mathbf{X} \sim \eta$$, then $$T(\mathbf{X}) \sim \pi$$. Given such a map, we can easily generate samples from $$\eta$$ by pushing samples $$\mathbf{x}^i \sim \eta$$ through the map $$T(\mathbf{x}^i) \sim \pi$$. Furthermore, we can express the push-forward density of a diffeomorphic map by $$T_{\sharp}\eta(\mathbf{x}) := \eta(T^{-1}(\mathbf{x}))|\nabla T^{-1}(\mathbf{x})|$$.

While there are infinitely many transformations that couple densities, if $$\pi$$ is absolutely continuous with respect to $$\eta$$, there exists a unique lower triangular and monotone function $$T\colon \mathbb{R}^d \rightarrow \mathbb{R}^d$$ that pushes forward $$\pi$$ to $$\eta$$ of the form

$\begin{split}T(\mathbf{x}) = \begin{bmatrix} T_1(x_1) \\ T_2(x_1,x_2) \\ \vdots \\ T_d(x_1,\dots,x_d) \end{bmatrix}.\end{split}$

## Monotone Parameterizations#

We represent monotone functions as the smooth transformation of an unconstrained function $$f\colon\mathbb{R}^{d} \rightarrow \mathbb{R}$$. Let $$g\colon\mathbb{R}\rightarrow \mathbb{R}_{>0}$$ be a strictly positive function, such as the softplus $$g(x) = \log(1 + \exp(x))$$. Then, the d-th map component $$T_{d}$$ is given by

(1)#$T_d(\mathbf{x}_{1:d}; \mathbf{w}) = f(x_1,\ldots, x_{d-1},0; \mathbf{w}) + \int_0^{x_d} g( \partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) dt.$

Other choices for the $$g$$ include the squared and exponential functions. These choices, however, have implications for the identifiability of the coefficients. If $$g$$ is bijective, then we can recover $$f$$ from $$T_d$$ as

$f(\mathbf{x}_{1:d}; \mathbf{w}) = T_d(x_1,\ldots, x_{d-1},0; \mathbf{w}) + \int_0^{x_d} g^{-1}( \partial_d T_d(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) dt.$

Using the representation for monotone functions with a bijective $$g$$, we can approximate $$T_d$$ by finding $$f$$.

## Tensor Product Expansions#

For a point $$\mathbf{x}\in\mathbb{R}^d$$ and coefficients $$\mathbf{w}$$, we consider expansions of the form

$f(\mathbf{x}; \mathbf{w}) = \sum_{\alpha\in \mathcal{A}} w_\alpha \Phi_\alpha(\mathbf{x}),$

where $$\alpha\in\mathbb{N}^d$$ is a multi-index, $$\mathcal{A}$$ is a multiindex set, and $$\Phi_{\mathbf{\alpha}}$$ is a multivariate function defined as a tensor product of one-dimensional functions $$\phi_{\alpha_i}\colon \mathbb{R}\rightarrow \mathbb{R}$$ through

$\Phi_\mathbf{\alpha}(\mathbf{x}) = \prod_{\alpha_i \in \mathbf{\alpha}} \phi_{\alpha_i}(x_i).$

## Numerical Integration#

Computationally, we approximate the integral in the definition of $$T_d(\mathbf{x}_{1:d}; \mathbf{w})$$ using a quadrature rule with $$N$$ points $$\{t^{(1)}, \ldots, t^{(N)}\}$$ and corresponding weights $$\{c^{(1)}, \ldots, c^{(N)}\}$$ designed to approximate integrals over $$[0,1]$$. Note that these points and weights will often be chosen adaptively. The quadrature rule yields an approximation of the map component in (1) with the form

(2)#$\tilde{T}_d(\mathbf{x}_{1:d}; \mathbf{w}) = f(x_1,\ldots, x_{d-1},0; \mathbf{w}) + x_d \sum_{i=1}^N c^{(i)} g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ),$

where the $$x_d$$ term outside the summation comes from a change of integration domains from $$[0,1]$$ to $$[0,x_d]$$.

## Diagonal Derivatives#

We will often require derivatives of $$T_d$$ with respect to an input $$x_i$$ or the parameters $$\mathbf{w}$$. When computing these derivatives however, we have a choice of whether to differentiate the continuous map form in (1) or the discretized map in (2). This is similar to the “discretize-then-optimize” or “optimize-then-discretize” choice in PDE-constrained optimization. When the quadrature rule is accurate, there might not be a large practical difference in these approaches. For approximate rules however, using the continuous derivative may cause issues during optimization because the derivative will not be consistent with the discreteized map: a finite difference approximation will not converge to the continuous derivative. In these cases, it is preferrable to differentiate the discrete map in (2).

The derivative $$\partial T_d / \partial x_d$$ is particularly important when using the monotone function $$T_d$$ in a measure transformation. The continuous version of this derivative is simply

(3)#$\frac{\partial T_d}{\partial x_d}(\mathbf{x}_{1:d}; \mathbf{w}) = g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, ).$

The discrete derivative on the other hand is more complicated:

(4)#$\begin{split}\frac{\partial \tilde{T}_d}{\partial x_d}(\mathbf{x}_{1:d}; \mathbf{w}) &= \frac{\partial}{\partial x_d} \left[x_d \sum_{i=1}^N c^{(i)} g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) )\right]\\ & = \sum_{i=1}^N c^{(i)} g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \\ &+ x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) .\end{split}$

## Coefficient Derivatives#

In addition to computing $$\partial T_d/\partial d$$, we will also need the gradient of the monotone function $$T_d$$ with respect to the parameters $$\mathbf{w}$$, denoted by $$\nabla_{\mathbf{w}}T_d$$.

(5)#$\begin{split}\nabla_{\mathbf{w}} T_d(\mathbf{x}_{1:d}; \mathbf{w}) &= \nabla_{\mathbf{w}} f(x_1,\ldots, x_{d-1},0; \mathbf{w})\\ &+ \int_0^{x_d} \partial g( \partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) \nabla_{\mathbf{w}}\left[\partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w})\right] dt \\ &\approx \nabla_{\mathbf{w}} f(x_1,\ldots, x_{d-1},0; \mathbf{w})\\ & + x_d \sum_{i=1}^N c^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \nabla_{\mathbf{w}}\left[\partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})\right]\end{split}$

If is also possible to compute the gradient of the diagonal derivative $$\nabla_{\mathbf{w}}\left( \partial T_d/\partial d\right)$$ with respect to the parameters, but like before, there is a question of whether the derivative of the exact map or the derivative of the quadrature-based approximate map should be used. In the case of the exact map, the mixed coefficient gradient has the simple form

$\begin{split}\nabla_{\mathbf{w}}\left[ \frac{\partial T_d}{\partial d}\right] & = \nabla_{\mathbf{w}}\left[ g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, ) \right] \\ & = \partial g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, ) \nabla_{\mathbf{w}}\left[\partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\right].\end{split}$

The gradient of the discrete derivative is more expansive and takes the form

$\begin{split}\nabla_{\mathbf{w}}\left[ \frac{\partial \tilde{T}_d}{\partial d}\right] &= \sum_{i=1}^N c^{(i)} \nabla_{\mathbf{w}}\left[g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \right] \\ & + x_d \sum_{i=1}^N c^{(i)} t^{(i)} \nabla_{\mathbf{w}}\left[\partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right] \\ &= \sum_{i=1}^N c^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})) \nabla_{\mathbf{w}}\left[ \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right] \\ &+ x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial^2 g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \nabla_{\mathbf{w}}\left[ \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right] \\ & + x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \nabla_{\mathbf{w}}\left[\partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})\right].\end{split}$