.. _mathematics:

Mathematical Background
=========================

Triangular Transport Maps
--------------------------

Let :math:`\pi` and :math:`\eta` be two densities on :math:`\mathbb{R}^d`. In measure transport, our goal is to find a multivariate transformation :math:`T` that pushes forward :math:`\eta` to :math:`\pi`, meaning that if :math:`\mathbf{X} \sim \eta`, then :math:`T(\mathbf{X}) \sim \pi`. Given such a map, we can easily generate samples from :math:`\eta` by pushing samples :math:`\mathbf{x}^i \sim \eta` through the map :math:`T(\mathbf{x}^i) \sim \pi`. Furthermore, we can express the push-forward density of a diffeomorphic map by :math:`T_{\sharp}\eta(\mathbf{x}) := \eta(T^{-1}(\mathbf{x}))|\nabla T^{-1}(\mathbf{x})|`.

While there are infinitely many transformations that couple densities, if :math:`\pi` is absolutely continuous with respect to :math:`\eta`, there exists a unique lower triangular and monotone function :math:`T\colon \mathbb{R}^d \rightarrow \mathbb{R}^d` that pushes forward :math:`\pi` to :math:`\eta` of the form

.. math::
    T(\mathbf{x}) = \begin{bmatrix} T_1(x_1) \\ T_2(x_1,x_2) \\ \vdots \\ T_d(x_1,\dots,x_d) \end{bmatrix}.


Monotone Parameterizations
--------------------------

We represent monotone functions as the smooth transformation of an unconstrained function :math:`f\colon\mathbb{R}^{d} \rightarrow \mathbb{R}`. Let :math:`g\colon\mathbb{R}\rightarrow \mathbb{R}_{>0}` be a strictly positive function, such as the softplus :math:`g(x) = \log(1 + \exp(x))`, and let :math:`\epsilon \geq 0` be a non-negative constant. Then, the d-th map component :math:`T_{d}` is given by

.. math::
    :label: cont_map 

    T_d(\mathbf{x}_{1:d}; \mathbf{w}) = f(x_1,\ldots, x_{d-1},0; \mathbf{w}) + \int_0^{x_d} g( \partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) + \epsilon dt.

Notice that the additive "nugget" :math:`\epsilon` is the lower bound on the diagonal derivative :math:`\partial T_d / \partial x_d`.  Other choices for the :math:`g` include the squared and exponential functions. These choices, however, have implications for the identifiability of the coefficients. If :math:`g` is bijective and :math:`\epsilon=0`, then we can recover :math:`f` from :math:`T_d` as 

.. math::
    f(\mathbf{x}_{1:d}; \mathbf{w}) = T_d(x_1,\ldots, x_{d-1},0; \mathbf{w}) + \int_0^{x_d} g^{-1}( \partial_d T_d(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) dt.

Using the representation for monotone functions with a bijective :math:`g`, we can approximate :math:`T_d` by finding :math:`f`.


Tensor Product Expansions 
--------------------------

For a point :math:`\mathbf{x}\in\mathbb{R}^d` and coefficients :math:`\mathbf{w}`, we consider expansions of the form 

.. math::

    f(\mathbf{x}; \mathbf{w}) = \sum_{\alpha\in \mathcal{A}} w_\alpha \Phi_\alpha(\mathbf{x}),

where :math:`\alpha\in\mathbb{N}^d` is a multi-index, :math:`\mathcal{A}` is a multiindex set, and :math:`\Phi_{\mathbf{\alpha}}` is a
multivariate function defined as a tensor product of one-dimensional functions :math:`\phi_{\alpha_i}\colon  \mathbb{R}\rightarrow \mathbb{R}`
through

.. math::

    \Phi_\mathbf{\alpha}(\mathbf{x}) = \prod_{\alpha_i \in \mathbf{\alpha}} \phi_{\alpha_i}(x_i).


Numerical Integration
----------------------------

Computationally, we approximate the integral in the definition of :math:`T_d(\mathbf{x}_{1:d}; \mathbf{w})` using a quadrature rule with :math:`N` points :math:`\{t^{(1)}, \ldots, t^{(N)}\}` and corresponding weights :math:`\{c^{(1)}, \ldots, c^{(N)}\}` designed to approximate integrals over :math:`[0,1]`.  Note that these points and weights will often be chosen adaptively.    The quadrature rule yields an approximation of the map component in :eq:`cont_map` with the form

.. math::
    :label: discr_map 

    \tilde{T}_d(\mathbf{x}_{1:d}; \mathbf{w}) = f(x_1,\ldots, x_{d-1},0; \mathbf{w}) + x_d \sum_{i=1}^N c^{(i)} \left[g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) + \epsilon \right],

where the :math:`x_d` term outside the summation comes from a change of integration domains from :math:`[0,1]` to :math:`[0,x_d]`. 

.. _diag_deriv_section:

Diagonal Derivatives
----------------------------

We will often require derivatives of :math:`T_d` with respect to an input :math:`x_i` or the parameters :math:`\mathbf{w}`.  When computing these derivatives however, we have a choice of whether to differentiate the continuous map form in :eq:`cont_map` or the discretized map in :eq:`discr_map`.  This is similar to the "discretize-then-optimize" or "optimize-then-discretize" choice in PDE-constrained optimization.  When the quadrature rule is accurate, there might not be a large practical difference in these approaches.  For approximate rules however, using the continuous derivative may cause issues during optimization because the derivative will not be consistent with the discreteized map: a finite difference approximation will not converge to the continuous derivative.   In these cases, it is preferrable to differentiate the discrete map in :eq:`discr_map`.   

The derivative :math:`\partial T_d / \partial x_d` is particularly important when using the monotone function :math:`T_d` in a measure transformation.   The continuous version of this derivative is simply 

.. math::
    :label: cont_deriv 

    \frac{\partial T_d}{\partial x_d}(\mathbf{x}_{1:d}; \mathbf{w}) = g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, ) + \epsilon.

The discrete derivative on the other hand is more complicated: 

.. math::
    :label: discr_deriv 

    \frac{\partial \tilde{T}_d}{\partial x_d}(\mathbf{x}_{1:d}; \mathbf{w}) &= \frac{\partial}{\partial x_d} \left[x_d \sum_{i=1}^N c^{(i)} \left(g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) + \epsilon\right)\right]\\
    & = \sum_{i=1}^N c^{(i)} \left(g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) )+\epsilon\right) \\
    &+ x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) .


Coefficient Derivatives 
----------------------------
In addition to computing :math:`\partial T_d/\partial d`, we will also need the gradient of the monotone function :math:`T_d` with respect to the parameters :math:`\mathbf{w}`, denoted by :math:`\nabla_{\mathbf{w}}T_d`. 

.. math::
    :label: coeff_deriv 

    \nabla_{\mathbf{w}} T_d(\mathbf{x}_{1:d}; \mathbf{w}) &= \nabla_{\mathbf{w}} f(x_1,\ldots, x_{d-1},0; \mathbf{w})\\
     &+ \int_0^{x_d} \partial g( \partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w}) ) \nabla_{\mathbf{w}}\left[\partial_d f(x_1,\ldots, x_{d-1},t; \mathbf{w})\right] dt \\
     &\approx \nabla_{\mathbf{w}} f(x_1,\ldots, x_{d-1},0; \mathbf{w})\\
    & + x_d \sum_{i=1}^N c^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \nabla_{\mathbf{w}}\left[\partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})\right]


If is also possible to compute the gradient of the diagonal derivative :math:`\nabla_{\mathbf{w}}\left( \partial T_d/\partial d\right)` with respect to the parameters, but like before, there is a question of whether the derivative of the exact map or the derivative of the quadrature-based approximate map should be used.    In the case of the exact map, the mixed coefficient gradient has the simple form

.. math::

    \nabla_{\mathbf{w}}\left[ \frac{\partial T_d}{\partial d}\right] & = \nabla_{\mathbf{w}}\left[ g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, ) + \epsilon \right] \\
    & = \partial g(\, \partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\, )  \nabla_{\mathbf{w}}\left[\partial_d f(\mathbf{x}_{1:d}; \mathbf{w})\right].


The gradient of the discrete derivative is more expansive and takes the form

.. math::

    \nabla_{\mathbf{w}}\left[ \frac{\partial \tilde{T}_d}{\partial d}\right] &= \sum_{i=1}^N c^{(i)} \nabla_{\mathbf{w}}\left[g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \right] \\
    & + x_d \sum_{i=1}^N c^{(i)} t^{(i)} \nabla_{\mathbf{w}}\left[\partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right] \\
    &= \sum_{i=1}^N c^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})) \nabla_{\mathbf{w}}\left[ \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right] \\
    &+ x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial^2 g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) ) \partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \nabla_{\mathbf{w}}\left[ \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) \right]  \\
    &  + x_d \sum_{i=1}^N c^{(i)} t^{(i)} \partial g( \partial_d f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w}) )  \nabla_{\mathbf{w}}\left[\partial^2_{dd}f(x_1,\ldots, x_{d-1},x_d t^{(i)}; \mathbf{w})\right].