# Mathematical Background#

## Triangular Transport Maps#

Let \(\pi\) and \(\eta\) be two densities on \(\mathbb{R}^d\). In measure transport, our goal is to find a multivariate transformation \(T\) that pushes forward \(\eta\) to \(\pi\), meaning that if \(\mathbf{X} \sim \eta\), then \(T(\mathbf{X}) \sim \pi\). Given such a map, we can easily generate samples from \(\eta\) by pushing samples \(\mathbf{x}^i \sim \eta\) through the map \(T(\mathbf{x}^i) \sim \pi\). Furthermore, we can express the push-forward density of a diffeomorphic map by \(T_{\sharp}\eta(\mathbf{x}) := \eta(T^{-1}(\mathbf{x}))|\nabla T^{-1}(\mathbf{x})|\).

While there are infinitely many transformations that couple densities, if \(\pi\) is absolutely continuous with respect to \(\eta\), there exists a unique lower triangular and monotone function \(T\colon \mathbb{R}^d \rightarrow \mathbb{R}^d\) that pushes forward \(\pi\) to \(\eta\) of the form

## Monotone Parameterizations#

We represent monotone functions as the smooth transformation of an unconstrained function \(f\colon\mathbb{R}^{d} \rightarrow \mathbb{R}\). Let \(g\colon\mathbb{R}\rightarrow \mathbb{R}_{>0}\) be a strictly positive function, such as the softplus \(g(x) = \log(1 + \exp(x))\). Then, the d-th map component \(T_{d}\) is given by

Other choices for the \(g\) include the squared and exponential functions. These choices, however, have implications for the identifiability of the coefficients. If \(g\) is bijective, then we can recover \(f\) from \(T_d\) as

Using the representation for monotone functions with a bijective \(g\), we can approximate \(T_d\) by finding \(f\).

## Tensor Product Expansions#

For a point \(\mathbf{x}\in\mathbb{R}^d\) and coefficients \(\mathbf{w}\), we consider expansions of the form

where \(\alpha\in\mathbb{N}^d\) is a multi-index, \(\mathcal{A}\) is a multiindex set, and \(\Phi_{\mathbf{\alpha}}\) is a multivariate function defined as a tensor product of one-dimensional functions \(\phi_{\alpha_i}\colon \mathbb{R}\rightarrow \mathbb{R}\) through

## Numerical Integration#

Computationally, we approximate the integral in the definition of \(T_d(\mathbf{x}_{1:d}; \mathbf{w})\) using a quadrature rule with \(N\) points \(\{t^{(1)}, \ldots, t^{(N)}\}\) and corresponding weights \(\{c^{(1)}, \ldots, c^{(N)}\}\) designed to approximate integrals over \([0,1]\). Note that these points and weights will often be chosen adaptively. The quadrature rule yields an approximation of the map component in (1) with the form

where the \(x_d\) term outside the summation comes from a change of integration domains from \([0,1]\) to \([0,x_d]\).

## Diagonal Derivatives#

We will often require derivatives of \(T_d\) with respect to an input \(x_i\) or the parameters \(\mathbf{w}\). When computing these derivatives however, we have a choice of whether to differentiate the continuous map form in (1) or the discretized map in (2). This is similar to the “discretize-then-optimize” or “optimize-then-discretize” choice in PDE-constrained optimization. When the quadrature rule is accurate, there might not be a large practical difference in these approaches. For approximate rules however, using the continuous derivative may cause issues during optimization because the derivative will not be consistent with the discreteized map: a finite difference approximation will not converge to the continuous derivative. In these cases, it is preferrable to differentiate the discrete map in (2).

The derivative \(\partial T_d / \partial x_d\) is particularly important when using the monotone function \(T_d\) in a measure transformation. The continuous version of this derivative is simply

The discrete derivative on the other hand is more complicated:

## Coefficient Derivatives#

In addition to computing \(\partial T_d/\partial d\), we will also need the gradient of the monotone function \(T_d\) with respect to the parameters \(\mathbf{w}\), denoted by \(\nabla_{\mathbf{w}}T_d\).

If is also possible to compute the gradient of the diagonal derivative \(\nabla_{\mathbf{w}}\left( \partial T_d/\partial d\right)\) with respect to the parameters, but like before, there is a question of whether the derivative of the exact map or the derivative of the quadrature-based approximate map should be used. In the case of the exact map, the mixed coefficient gradient has the simple form

The gradient of the discrete derivative is more expansive and takes the form