A prelude to statistics arising from optimal transport theory

Hung T. Nguyen can be contacted at: hunguyen@nmsu.edu

Purpose

This paper aims to offer a tutorial/introduction to new statistics arising from the theory of optimal transport to empirical researchers in econometrics and machine learning.

Design/methodology/approach

Presenting in a tutorial/survey lecture style to help practitioners with the theoretical material.

Findings

The tutorial survey of some main statistical tools (arising from optimal transport theory) should help practitioners to understand the theoretical background in order to conduct empirical research meaningfully.

Originality/value

This study is an original presentation useful for new comers to the field.

1. Introduction

A significant contribution to statistics in general and to econometrics and machine learning in particular from optimal transport theory has surfaced recently. As such it is about time for practitioners to be aware of it to apply it to real-world problems, especially in econometrics, to improve credibility of empirical findings. That is precisely the purpose of this prelude.

This paper is organized as follows. Although this is a prelude where detailed technical material is not spelled out, the main purpose is to call practitioners' attention to new improved statistical tools arising from optimal transport theory, and as such, optimal transport in a nutshell will be presented in Section 2. Section 3 is about the most significant new tool in statistical analysis, namely the notion of multivariate quantiles. Section 4 is devoted to the elaboration of another new tool for statistics, namely the Wasserstein metrics. In Section 5 we elaborate on the interesting connection between partial identification and random set statistics, also thanks to optimal transport.

2. Optimal transport in a nutshell

Monge (1781) was concerned with the problem of finding the cheapest way to transport, say, soil from a collection of mines to a collection of construction sites.

In mathematical language, the problem is formulated as follows. Let $P (X), P (Y)$ denote the spaces of probability measures on $X, Y \subseteq R^{n}$ respectively. Given $μ \in P (X)$ ⁠, $ν \in P (Y)$ ⁠, and a (cost) function c(., .): $X \times Y \to R^{+}$ ⁠. A transport map is a (measurable) function T(.): $X \to Y$ such that ν(.) = μ ○T⁻¹, in symbol ν = T#μ (T pushes μ forward to ν). The transport cost of T is

\int_{X} c (x, T (x)) d μ (x)

The Monge's problem is to find an optimal transport map T*, i.e.

T^{*} = arg min \{\int_{X} c (x, T (x)) d μ (x) : T ∋ T # μ = ν\}

This functional optimization problem might not have a solution in general, e.g. when μ is a Dirac probability measure whereas ν is not. But more importantly, with the optimization variable being T, the objective function $T \to \int_{X} c (x, T (x)) d μ (x)$ is not linear. Also, the constraint set {T : T#μ = ν} is not convex. As such, the computation of a solution is difficult.

Because of these issues, Monge's problem was unsolved until Kantorovich (1942) who reformulated Monge's problem to a setting avoiding the two main difficulties mentioned above.

It is interesting to note that Monge's difficulties have analogies in mathematics. When a quadratic equation does not have real solutions, we enlarge its solution domain (the real line $R$ ⁠) to the complex plane so that the equation has complex solutions. Similarly, we consider mixed (random) strategies in non-cooperative games to establish the existence of Nash equilibria for any such games.

The same methodology could be used here to “solve” Monge's problem. And that is exactly what Kantorovich has done.

If $T (.) : X \times Y \to R^{+}$ is a transport map, and $I_{X} (.) : X \to X$ is the identity map $I_{X} (x) = x$ ⁠, then $(I_{X} \times T) (.) : X \to X \times Y$ ⁠, $(I_{X} \times T) (x) = (x, T (x))$ pushes μ forward to a joint probability measure $μ ○ {(I_{X} \times T)}^{- 1}$ on $X \times Y$ having μ, ν as marginal probability measures. Thus the space of joint probability measures on $X \times Y$ having μ, ν as marginal probability measures, denoted as Π(μ, ν), contains the set of all (Monge) transport maps (by identification). Elements of Π(μ, ν) are referred to as transport plans. Hence, by enlarging transport maps to transport plans, Kantorovich reformulated Monge's problem as follows. Given μ on $X$ ⁠, ν on $Y$ ⁠, and $c (., .) : X \times Y \to R^{+}$ ⁠, find a transport plan λ* ∈ Π(μ, ν) such that

λ^{*} = arg min \{\int_{X \times Y} c (x, y) d λ (x, y) : λ \in Π (μ, ν)\}

The difficulties in Monge's problem are avoided: Kantorovich's problem always has solutions since μ ⊗ ν ∈ Π(μ, ν); with the optimization variable $λ \in Π (μ, ν)\}$ ⁠, the objective function $λ \to \int_{X \times Y} c (x, y) d λ (x, y)$ is linear, and the constraint set {λ : λ ∈ Π(μ, ν)} is convex, so that we are in the domain of convex optimization!

3. Multivariate quantiles

The focus on (univariate) quantile functions as a basis for statistical analysis has been advocated by Parzen (1979). In fact, in the comment to Breiman's paper (2001), Parzen even suggested that there are many possible “cultures” for statistical modeling where “quantile culture” could be one of them.

Without digging into whether Parzen's quantile culture is a culture in Breiman's sense, we could view that the use of quantile functions is part of the standard statistical analysis in which, instead of distribution functions, we focus on quantile functions. The two cultures elaborated in Breiman's paper (2001), namely the data modeling and algorithmic modeling, might be not really disjoint, i.e. they could be combined to form a new culture. That was precisely suggested in “Statistical Modeling: The Three cultures” in 2023 by Daoud and Dubhashi (2023) as a hybrid modeling culture! Perhaps, it could be so as we witness at present the interests of Econometricians in Machine (or Statistical) Learning?

Anyway, the point we want to make is this. It is true that the use of quantile functions, such as in quantile regression, provides more information than that of mean regression. But why Parzen's “quantile culture” did not get off the ground or ring the bell, say, in multivariate analysis?

The answer could be twofold. As mean (linear) regression, in multivariate analysis, is the bread-and-butter tool in statistics, quantile regression, introduced by Koenker and Basett (1978), is also only for one dimension. The second reason is crucial: there is no counterpart of multivariate mean regression, and this is because of the lack of a “correct” notion of multivariate (vector) quantile functions, let alone its associated regression analysis.

Specifically, the mathematical problem of how to generalize the familiar notion of an univariate quantile function to higher dimensions is difficult because the explicit definition of a quantile function on the real line $R$ is based on the total order relation $\leq$ of $R$ ⁠, whereas there is no such order relation on $R^{n}$ with n > 1.

In the literature, among various attempts to “solve” the problem, e.g. Hallin et al. (2010) and Serfling and Zuo (2010), an attempt was to consider the partial order relation of $R^{n}$ when n > 1, exemplified by Belloni and Winkler (2011), leading to the notion of “partial multivariate quantiles”. This is typical of an approach to avoid the lack of the total order relation on $R$ ⁠, but does not really address the original problem, i.e. partial vector quantile functions are not generalizations of univariate quantile functions. They are just substitutes.

Finally, as Koenker (2017) acknowledged, the happy ending arrived in 2016 with the works of Carlier et al. (2016, 2017), and that was inspired from the Theory of Optimal Transport, Villani (2003).

This Section aims at elaborating a bit on the notion of vector (multivariate) quantile functions correctly generalizing the familiar notion of univariate quantile functions.

Let X be a real-valued random variable (the name of some quantity), i.e. a measurable map from a probability space $(Ω, A, P)$ ⁠, its source of uncertainty, to the measurable space $(R, B (R))$ ⁠, its sampling space. Its law is the probability measure P_X on $B (R)$ obtained by pushing forward P by X, i.e. P_X(.)= P ○ X⁻¹, in symbol P_X = X#P. By Lebesgue-Stieltjes theorem, P_X = dF where $F (.) : R \to [0,1]$ is the distribution function of X. The distribution function F of X contains all information about the random evolution of X. If we know F, can we create the data from X? This is the problem known as simulations. Yes, but not directly by using F. Instead, we consider its “pseudo inverse” function F^[−1] known as its (univariate) quantile function defined explicitly as $F^{[- 1]} (.) : (0,1) \to R$ ⁠,

F^{[- 1]} (u) = \inf {x \in R : F (x) \geq u}

and show that F^[−1] will push forward the uniform probability measure du on (0,1) to dF (F^[−1]#du = dF, i.e. $d F (.) = d u ○ {(F^{[- 1]})}^{- 1}$ ⁠) so that $X \overset{D}{=} F^{[- 1]} (U)$ (equal in distribution) where U denotes the (uniform) random variable on (0,1) with law du.

While the pseudo inverse F^[−1](.) provides a reasonable mathematical definition for quantiles, its explicit definition involves the total order relation $\leq$ of the real line $R$ and that is the difficulty to extend it to higher dimensions, say, as a function $G (.) : {(0,1)}^{n} \to R^{n}$ with n > 1.

A short story of this extension problem seems interesting to note. Traditionally, the extension of a concept in one dimension to several dimensions could be done componentwise, such as the concept of the mean of a random vector. But defining a vector quantile function componentwise does not work since the property G#du = dF, for du as uniform law on (0,1)ⁿ, and dF as law of a random vector on $R^{n}$ (for n > 1) is not satisfied.

Remark.

Of course F(.) is characterized by F^[−1](.), since if $Q (.) : (0,1) \to R$ is monotone non decreasing and left continuous then there exists a unique distribution function F(.) such that Q(.)= F^[−1](.). However, such a characterization of F^[−1](.) does not extend to higher dimensions.

Also, traditionally, if we cannot use directly an established concept in one setting to extend it to another setting, we look for a possible equivalent concept (a characterization of the established concept) that can be generalized. For example, to generalize ordinary sets to fuzzy sets, we use the indicator function of an ordinary set (as its membership function) as a characterization of the set from which to extend to the new setting. Here the question is: what is a characterization of F^[−1], i.e. another equivalent way to define it.

Perhaps, previous attempts to generalize univariate quantile functions to vector quantile functions did not ask this question. It turns out that the answer is hidden in plain sight! Besides the property F^[−1]#du = dF, the (explicitly defined) function $F^{[- 1]} (.) : (0,1) \to R$ is monotone non decreasing, and these two properties provide a characterization for F^[−1]. Specifically, F^[−1](.): $(0,1) \to R$ is the unique function that is monotone non decreasing and satisfies G#du = dF :

Lemma.

If $G (.) : (0,1) \to R$ is monotone non decreasing and satisfies G#du = dF, then G(.)= F^[−1], i.e.

Proof. By monotonicity of G, we have

(- \infty, x] \subseteq G^{- 1} ((- \infty, G (x)])

so that

F_{d u} (x) = d u (- \infty, x] \leq d u \{G^{- 1} ((- \infty, G (x)]) = d F (- \infty, G (x)] = F (G (x))

and G(x) ≥ F^[−1](x)

Consider the points x such that G(x) > F^[−1](x). This means that there exists ɛ_o > 0 such that F(G(x) − ɛ) ≥ F_du(x) for every ɛ ∈ [0, ɛ_o]. Also, since $G^{- 1} ((- \infty, G (x) - ε) \subseteq (- \infty, x)$ ⁠, we have F(G(x) − ɛ) < F_du(x). Thus, F(G(x) − ɛ) = F_du(x) for any ɛ ∈ [0, ɛ_o]. Note that F(G(x) − ɛ) is the value of F which F takes on an interval where it is constant. But these intervals are a countable quantity, so that the values y_j of F on these intervals are also countable. Therefore, the points x where G(x) > F^[−1](x) are contained in ∪_j{x : F_du(x) = y_j} which is du − negligible (since du is atomless). As a consequence, G(x) = F^[−1](x), du − almost everywhere. Q.E.D.

As a consequence, the above characterization of F^[−1] can be used to obtain its counterpart in higher dimensions since on $R^{n}$ , with n > 1, the property G#du = dF makes sense and the monotone non decreasing property for $G (.) : {(0,1)}^{n} \to R^{n}$ is equivalent to

< u - v, G (u) - G (v) > \geq 0

where $< ., . >$ denotes the scalar product on $R^{n}$ ⁠.

Remark.

The characterization of F^[−1] brings out the fact that the total order relation on $R$ does not play an essential role in defining it.

The point is this. If $G (.) : {(0,1)}^{n} \to R^{n}$ (for n > 1) is going to be an extension of $F^{[- 1]} (.) : (0,1) \to R$ ⁠, G(.) has to be monotone non decreasing and pushing forward du to dF (in dimension n > 1).

The upshot is that, for n ≥ 1, these two properties are characteristic for the notion of quantiles, in the sense that there is uniquely one such function $G (.) : {(0,1)}^{n} \to R^{n}$ ⁠, so that, for n = 1, it coincides with F^[−1](.). Thus, in dimension 1, the familiar univariate quantile function F^[−1] can be defined without using explicitly the total order relation of $R$ !

This upshot was discovered in the context of Optimal Transport, see Villani (2003), Brenier (1991), McCam (1995), Carlier et al. (2017) and Galichon (2016), where a (n-dimensional) vector quantile function is the unique monotone noncreasing function $G (.) : {(0,1)}^{n} \to R^{n}$ such that G#du = dF.

Clearly, the upshot tells us that the familiar univariate quantile function can be generalized to higher dimensions rigorously. However, except in dimension 1, the vector quantile functions so determined are not obtained in a close form. Practitioners should consult the literature for computational works.

Remark.

The following notes could give a flavor of optimal transport in getting, finally, the correct notion of multivariate quantiles.

In the setting of optimal transport, F^[−1](.) is characterized by a unique “transport map” $T^{*} (.) : (0,1) \to R$ ⁠, monotone non decreasing and T*#du = dF, where

T^{*} = arg min \{\int_{0}^{1} \frac{1}{2} | u - T (u) |^{2} d u : T # d u = d F\}

i.e. the solution of Monge's problem with cost function $c (., .) : (0,1) \times R \to R^{+} : (u, x) \to \frac{1}{2} | u - T (u) |^{2}$ ⁠. On the other hand, the function $φ (.) : (0,1) \to R$

φ (u) = \int_{0}^{u} F^{[- 1]} (v) d v

is convex, so that F^[−1](.) is the derivative of the convex function φ(.) on (0,1).

In dimension n > 1, the above leads to the notion of multivariate quantile function by McCam's theorem (1995): Let $F (.) : R^{n} \to [0,1]$ be a multivariate distribution function, then there exists a unique gradient $\nabla φ (.) : {(0,1)}^{n} \to R^{n}$ of some convex function $φ (.) : {(0,1)}^{n} \to R$ (φ is not unique, but ∇φ is unique) such that ∇φ#du = dF, where du is the uniform law on (0,1)ⁿ.

4. Wasserstein metrics

We elaborate now upon a new improved type of metrics on spaces of probability measures arising from optimal transport theory. The main improvement seems to be that these new metrics, called Wasserstein metrics, do take into account of the geometry of the underlying sample space. Their construction surfaces naturally in the setting of optimal transport theory. Such metrics are useful, e.g. for machine learning.

Recall that, in applications of statistics, we often use a divergence D(., .) on a space of probability measures to “measure” of the difference between two probability measures. Such a divergence is used to compare probability measures, for example D(μ, ν) is the difference between a model μ and a data ν.

The most well-known divergence is the Kullback-Leibler divergence on probability measures on $(R, B (R))$ ⁠:

K L (μ / / ν) = \int_{R} f (x) \log (\frac{f (x)}{g (x)}) d γ (x)

where f(.), g(.) are probability density of μ, ν respectively (with respect to some dominating measure γ on $B (R)$ ⁠). The KL divergence is not a distance since it is not symmetric, but it does have analogous properties which could be used to substitute for a metric, such as the Total Variation metric

T V (μ, ν) = \sup {| μ (A) - ν (A) | : A \in B (R)}

The KL divergence appears in the model selection criterion AIC.

Metrics on spaces of probability measures are viewed as special divergences. Divergences abound. The choice of a divergence or a metric for comparing probability measures depends on its usefulness for the problem at hand. For example, the Kullback-Leibler divergence is used in AIC because of its relation to Maximum Likelihood Estimation.

Consider the case where $X = Y \subseteq R^{n}$ ⁠, we are interested in the following Wasserstein divergence (a priori) on the subset $P_{p} (X)$ of the set $P (X)$ of all (Borel) probability measures on $X$ ⁠, where

P_{p} (X) = \{μ \in P (X) : \int_{X} ‖ x ‖^{p} d μ (x) < \infty\}

namely, W_p(., .): $P_{p} (X) \times P_{p} (X) \to [0, \infty)$

W_{p} (μ, ν) = \inf_{λ \in Π (μ, ν)} {[\int_{X \times X} ‖ x - y ‖^{p} d λ (x, y)]}^{\frac{1}{p}}

Specifically, we are going to show that the Wasserstein divergence W_p(., .) is in fact a bona fide metric on $P_{p} (X)$ ⁠, a well-known fact in the literature.

We will carry out the complete proof that Wasserstein divergence is in fact a bona fide metric to emphasize the interesting notion of disintegration (of measures).

Disintegration is a process of extracting a conditional probability measure from a joint probability measure on a product space.

To be concrete, let $X, Y \subseteq R^{n}$ ⁠, and $(X, B (X))$ ⁠, $(Y, B (Y))$ ⁠, $(X \times Y,, B (X \times Y))$ ⁠, be (Borel) measurable spaces. We denote by $P (X)$ ⁠, $P (Y)$ ⁠, $P (X \times Y)$ the set of all probability measures on these spaces.

For $λ \in P (X \times Y)$ ⁠, its marginal probability measure on $X$ is $μ \in P (X)$ ⁠, defined as, for any $A \in B (X)$ ⁠, $μ (A) = λ (A \times Y)$ ⁠.

A disintegration of λ with respect to μ is a family of probability measures $ν_{x} \in P (Y)$ ⁠, for any $x \in X$ ⁠, such that, for $A \in B (X)$ and $B \in B (Y)$ ⁠, we have

λ (A \times B) = \int_{A} ν_{x} (B) d μ (x)

Symbolically,

λ = \int_{X} (δ_{x} \otimes ν_{x}) d μ (x)

where δ_x is the Diract probability measure on $X$ ⁠, at $x \in X$ ⁠, and δ_x ⊗ ν_x denotes the product measure (δ_x ⊗ ν_x)(A × B) = δ_x(A)ν_x(B).

The representation of λ is so written since

λ (A \times B) = \int_{X} (δ_{x} \otimes ν_{x}) (A \times B) d μ (x) =

\int_{X} (δ_{x} (A) ν_{x}) (B) d μ (x) = \int_{A} ν_{x} (B) d μ (x)

Below is a tutorial on disintegration, just enough for using it in proving the triangle inequality for Wasserstein metrics. A reference could be Dudley (2003) or Graf and Mauldin (1989).

Now, for $X = R^{n}$ ⁠, with norm ‖.‖, and p ≥ 1, the pth- Wasserstein metric is

W_{p}^{p} (d F, d G) = \inf {E_{π} ‖ X - Y ‖^{p} : X \sim d F, Y \sim d G, π \in Π (d F, d G)}

where F, G are n − dimensional distribution functions of X, Y, respectively, and π has 2n − dimensional distribution function H with F, G as marginals, i.e.

H (x_{1}, \dots x_{n}, \infty, \dots, \infty) = F (x_{1}, \dots x_{n}), H (\infty, \dots, \infty, y_{1}, \dots y_{n},) = G (y_{1}, \dots y_{n})

More generally, Wasserstein distance is a metric on spaces of probability measures. Let $(X, ρ)$ be a metric space. Consider the situation where we are interested in probability measures governing the random evolution of random elements taking values in $X$ (i.e. their “laws” operating on Borel σ − field $B (X)$ ⁠). Comparisons of probability measures are standard concerns in applications, such as in the so-called empirical processes.

For μ, ν two probability measures on $(X, B (X))$ ⁠, consider the nonnegative quantity

W (μ, ν) = \inf \{\int_{X \times X} ρ (x, y) d π (x, y)\} \leq + \infty

where the infimum is taken over all joint probability measure π with marginals (projections) μ, ν.

We will denote by Π(μ, ν) the set of probability measures π on the product space $X \times X$ having μ, ν as marginal measures, i.e. $μ (.) = π (. \times X), ν (.) = π (X \times .)$ ⁠.

Note that the above quantity can be written as:

W (μ, ν) = \inf {E ρ (X, Y) : X \sim μ, Y \sim ν}

i.e. the infimum is taken over all random variables X, Y with values in $(X, B (X))$ ⁠, and X, Y are distributed as μ, ν, respectively.

On a subset of $P (X)$ where W(μ, ν) < ∞, for μ, ν in it, W(., .) is a bona fide metric.

We come now to the main investigation of Wasserstein divergences (a priori) on a metric space $X$ for which disintegration exists, such as $R^{n}$ or a polish space.

Let $P (X)$ denotes the set of all (Borel) probability measures on $B (X)$ ⁠. For p ≥ 1, let $P_{p} (X) \subseteq P (X)$ ⁠, be the subset of probability measures with finite p − moment, i.e.

P_{p} (X) = \{μ \in P (X) : \int_{X} ‖ x ‖^{p} d μ (x) < \infty\}

where ‖.‖ is the Euclidean norm of $R^{n}$ ⁠.

Consider the Wasserstein divergence on $P_{p} (X)$ ⁠: For $μ, ν \in P_{p} (X)$ ⁠, and p ≥ 1, let

W_{p} (μ, ν) = \inf_{λ \in Π (μ, ν)} {[\int_{X \times X} ‖ x - y ‖^{p} d λ (x, y)]}^{\frac{1}{p}}

This is just an exercise to verify that W_p(., .) does satisfy the axioms of a metric, i.e. $W_{p} (., .) : P_{p} (X) \times P_{p} (X) \to R^{+} = [0, \infty)$ is such that

W_p(μ, ν) = W_p(ν, μ)
W_p(μ, ν) = 0⇔μ = ν
For any $μ, ν, γ \in P_{p} (X)$ ⁠, W_p(μ, ν) ≤ W_p(μ, γ) + W_p(γ, ν)

First, since, for $x, y \in R^{n}$ ⁠, ‖x − y‖^p ≤ c(‖x‖^p + ‖y‖^p), so that , for $μ, ν \in P_{p} (R^{n})$ ⁠, we have

W_{p} (μ, ν) \leq c [\int_{X} ‖ x ‖^{p} d μ (x) + \int_{X} ‖ x ‖^{p} d ν (x)] < \infty

While (i) is obvious (since the function (x, y) → ‖x − y‖^p is symmetric, and Π(μ, ν) ≃Π(ν, μ)) and (ii) can be seen as follows.

For ν = μ , the optimal transport map $T : X \to X$ is the identity map I(x) = x, so that the optimal transport plan is λ = (I, I)#μ concentrated on {(x, y) : x = y} and hence

W_{p} (μ, μ) = \int_{X \times X} ‖ x - y ‖^{p} d λ (x, y) = 0

Conversely, if W_p(μ, ν) = 0, then, since

\inf_{λ \in Π (μ, ν)} {[\int_{X \times X} ‖ x - y ‖^{p} d λ (x, y)]}^{\frac{1}{p}}

is attained, there exists λ ∈ Π(μ, ν) such that $\int_{X \times X} ‖ x - y ‖^{p} d λ (x, y) = 0$ so that λ is concentrated on {(x, y) : x = y} which, in turn, implies that, for any $A \in B (R^{n})$ ⁠,

μ (A) = λ (A \times R^{n}) = λ (A \times A) = λ (R^{n} \times A) = ν (A)

i.e. μ = ν.

Remark.

For p ≤ q, we have W_p(μ, ν) ≤ W_q(μ, ν), since, by Jensen's inequality (with respect to the convex function $t \to t^{\frac{q}{p}}$ ⁠), for any λ ∈ Π(μ, ν),

W_{p}^{q} (μ, ν) \leq {[\int_{X \times X} ‖ x - y ‖^{p} d λ (x, y)]}^{\frac{q}{p}} \leq

[\int_{X \times X} ‖ x - y ‖^{q} d λ (x, y)] = W_{q}^{q} (μ, ν)

However, the triangle inequality (iii) is not so obvious!

Interestingly, it is the notion of disintegration which will provide a method to verify it.

We wish to show that, for any μ_j, j = 1, 2, 3 in $P_{p} (R^{n})$ ⁠, for p ≥ 1, with support $X_{j} \subseteq R^{n}$ ⁠, j = 1, 2, 3, respectively, we should have

W_{p} (μ_{1}, μ_{2}) \leq W_{p} (μ_{1}, μ_{3}) + W_{p} (μ_{3}, μ_{2})

For this, we follow Villani (2003).

Lemma.

Let μ_j, j = 1, 2, 3 in $P_{p} (R^{n})$ ⁠, for p ≥ 1, with support $X_{j} \subseteq R^{n}$ ⁠, j = 1, 2, 3, respectively. Let λ₁₂ ∈ Π(μ₁, μ₂), and λ₂₃ ∈ Π(μ₂, μ₃).

Then there exists a probability measure λ on $X_{1} \times X_{2} \times X_{3}$ having marginals λ₁₂ and λ₂₃ on $X_{1} \times X_{2}$ ⁠, and $X_{2} \times X_{3}$ ⁠, respectively.

Proof. Disintegrate both λ₁₂ and λ₂₃ with respect to their common marginal μ₂, and denote their disintegrations as $ν_{x_{1}}, γ_{x_{3}}$ ⁠, respectively, so that

λ_{12} = \int_{X_{2}} (ν_{x_{1}} \otimes δ_{x_{2}}) d μ_{2} (x_{2})

λ_{23} = \int_{X_{2}} (δ_{x_{2}} \otimes γ_{x_{3}}) d μ_{2} (x_{2})

Then $λ \in P (X_{1} \times X_{2} \times X_{3})$ constructed as

λ = \int_{X_{2}} (ν_{x_{1}} \otimes δ_{x_{2}} \otimes γ_{x_{3}}) d μ_{2} (x_{2})

Then, for $A_{1} \in B (X_{1})$ ⁠, $A_{2} \in B (X_{2})$ ⁠, $A_{3} \in B (X_{3})$ ⁠, we have

λ (A_{1} \times A_{2} \times X_{3}) = \int_{X_{2}} (ν_{x_{1}} \otimes δ_{x_{2}} \otimes γ_{x_{3}}) (A_{1} \times A_{2} \times X_{3}) d μ_{2} (x_{2}) =

\int_{X_{2}} (ν_{x_{1}} (A_{1}) δ_{x_{2}} (A_{2}) γ_{x_{3}} (X_{3}) d μ_{2} (x_{2}) = \int_{X_{2}} (ν_{x_{1}} (A_{1}) δ_{x_{2}} (A_{2}) d μ_{2} (x_{2}) =

\int_{X_{2}} (ν_{x_{1}} \otimes δ_{x_{2}}) (A_{1} \times A_{2}) d μ_{2} (x_{2}) = λ_{12} (A_{1} \times A_{2})

Similarly for

λ (X_{1} \times A_{2} \times A_{3}) = λ_{23} (A_{2} \times A_{3})

QED.

Then the proof of the triangle inequality for Wasserstein metrics follows:

Let μ_j, j = 1, 2, 3 in $P_{p} (R^{n})$ ⁠, for p ≥ 1, with support $X_{j} \subseteq R^{n}$ ⁠, j = 1, 2, 3, respectively.

Note that, from OT theory (existence of solutions of Kantorovich's problem),

W_{p} (μ_{i}, μ_{j}) = \inf_{λ \in Π (μ_{i}, μ_{j})} {[\int_{X_{i} \times X_{j}} ‖ x_{i} - x_{j} ‖^{p} d λ (x_{i}, x_{j})]}^{\frac{1}{p}}

is attained with some optimal transport plan λ_ij ∈ Π(μ_i, μ_j). Thus,

W_{p} (μ_{i}, μ_{j}) = {[\int_{X_{i} \times X_{j}} ‖ x_{i} - x_{j} ‖^{p} d λ_{i j} (x_{i}, x_{j})]}^{\frac{1}{p}}

Now, let λ, in the above Lemma corresponding to μ_j, j = 1, 2, 3 , be the probability measure on $X_{1} \times X_{2} \times X_{3}$ having marginals λ₁₂ and λ₂₃ on $X_{1} \times X_{2}$ ⁠, and $X_{2} \times X_{3}$ ⁠, respectively.

We then have

W_{p} (μ_{1}, μ_{3}) = {[\int_{X_{1} \times X_{3}} ‖ x_{1} - x_{3} ‖^{p} d λ_{13} (x_{1}, x_{3})]}^{\frac{1}{p}} =

{[\int_{X_{1} \times X_{3} \times X_{3}} ‖ x_{1} - x_{3} ‖^{p} d λ (x_{1}, x_{2}, x_{3})]}^{\frac{1}{p}} \leq

{[\int_{X_{1} \times X_{3} \times X_{3}} {(‖ x_{1} - x_{2} ‖ + ‖ x_{2} - x_{3} ‖)}^{p} d λ (x_{1}, x_{2}, x_{3})]}^{\frac{1}{p}} \leq

{[\int_{X_{1} \times X_{3} \times X_{3}} ‖ x_{1} - x_{2} ‖^{p} d λ (x_{1}, x_{2}, x_{3})]}^{\frac{1}{p}} + {[\int_{X_{1} \times X_{3} \times X_{3}} ‖ x_{2} - x_{3} ‖^{p} d λ (x_{1}, x_{2}, x_{3})]}^{\frac{1}{p}} =

{[\int_{X_{1} \times X_{3}} ‖ x_{1} - x_{2} ‖^{p} d λ_{12} (x_{1}, x_{2})]}^{\frac{1}{p}} + {[\int_{X_{3} \times X_{3}} ‖ x_{2} - x_{3} ‖^{p} d λ_{23} (x_{1}, x_{2})]}^{\frac{1}{p}} =

W_{p} (μ_{1}, μ_{2}) + W_{p} (μ_{2}, μ_{3})

Q.E.D.

5. Connection with random set statistics

One more useful statistical methodology arising from optimal transport theory was the unexpected connection between the current topic of partial identification (of statistical models) and random set statistics via optimal transport, as pointed out by Galichon (2016).

First, it seems here is a good place to spell out briefly what is statistics and how statisticians should conduct statistics! Roughly speaking, statistics is about finding the truth from data, and statistical works should be credible.

Unlike physical science, we need models to conduct statistics. Based upon observed data, statisticians propose models. A model is a subjective (stochastic) equation together with a set of assumptions supporting it. Of course each model contains unknown “parameters” which need to be estimated (from data) to specify it for, e.g. prediction and decision-making.

As we all know (since we “follow” the traditional approach to without any hesitation) that the maintained assumptions (whether they are justified or not) are there to allow us to use available data to consistently estimate the model parameters, noting that estimability of parameters in this sense is related to the notion of identification.

In order to justify our statistical estimation of our model parameter, say, in the model {F_θ θ ∈ Θ}, we impose assumptions to make the true (but unknown) parameter θ_o identifiable (i.e. point identifiable) in the sense that the map θ → F_θ is injective. A well-known example for all is the linear supply and demand model in microeconomics. General supply and demand models are provided by economic theory, but when a text book advises us to use a linear model (for simplicity?), it puts down assumptions without justifications to make sure that the model parameter of interest is point identifiable.

If the maintained assumptions are not plausible, the map θ → F_θ might not be injective, i.e. there are θ′ ≠ θ such that F_θ = F_θ′ (θ and θ′ are said to be observationally equivalent) so that the model parameter is not point-identifiable, such as in games with multiple Nash equilibria. In such a situation, should we give up the analysis or the empirical attempt ? No, as Manski (e.g. 2007) put it, we could live with it and look for a new way to estimate the model parameter, not as a point but as a subset of the parameter space, called the identified set. Thus, estimating an identified set is the main goal for partially identified statistical models.

In this improved statistical setting, we are facing partially identified models where point estimation becomes set estimation. But when the estimation target is a set, the identified set (i.e. set of observationally equivalent parameters), its estimated set is a random set (a set-valued function of the data). Thus, we are facing a natural extension of classical statistics, namely statistics with random sets rather with random points.

Now, the general theory of probability supporting statistical analysis should cover the theory of random sets (as an extension of random vectors) which are well defined random elements. See Matheron (1975) or Nguyen (2006). In other words, in view of credible statistics, statistics of random sets should take a central stage in empirical research. However, the statistical theory of set-valued statistics is still young. In some contexts, e.g. estimating the level sets of an unknown probability density function, the estimation method is Hartigan's (1987) excess mass which is the counterpart of maximum likelihood method in traditional statistics. See also Nguyen (2006).

What is “interesting” is that some partial identification problems can be formulated as an optimal transport problem which in turn provides a connection with random sets useful for computational purposes. See Galichon (2016) for details. Here we elaborate a bit on the theory of random sets since after all as the identified set is a set, its estimator will be a random set statistic, and we need to investigate its properties just like the special case of random vector statistics. The point is this. While random set statistics is the natural approach to inference about set parameters in partially identified models, the context in which these partially identified models can be formulated as optimal transport problems brings out specific ways for conducting inference.

Now, in spirit, partial identification setting is somewhat similar to statistics with coarse data where the data from the desired DGP (an unknown distribution) are not observable, but instead the data from a random set containing it are observed, i.e. the latent random variable of interest is an almost sure selector of the observed random set. As such, it is related to the estimation of the identified set from a random set viewpoint.

In Galichon's analysis (2016) the focus is the identification of an identified set of a partially identified model, and the connection with random set is based upon a result of Artstein (1983) which is generalized by Norberg (1992) as follows.

First of all, capacity functionals play the role of probability laws of random (closed) sets on $R^{d}$ by Choquet's Theorem (the counterpart of Lebesgue- Stieltjes Theorem for random vectors), see Nguyen (2006) for an introduction. Two capacity functionals T₁, T₂ are said to form an ordered coupling if there exists a common probability space $(Ω, A, P)$ on which are defined two random closed sets S₁, S₂ such that S₂ ⊆ S₁ P, i.e. where S₁, S₂ have T₁, T₂ as capacity functionals, respectively. When the random set S₂ is single-valued, a special case which is identified with a random vector, it becomes an a.s. selector of S₁, i.e. P(S₂ ∈ S₁) = 1. This special case corresponds to the situation in coarse data analysis as well as in partial identification estimation (of identified sets). An useful result from random set theory for it is the following which allows us to characterize an identified set as the core of a capacity functional of a random set.

Theorem (Norberg, 1992). Let μ be a probability measure on $B (R^{d})$ and T be a capacity functional, then the following are equivalent:

μ ≤ T on compact sets of $R^{d}$
There exists a common probability space $(Ω, A, P)$ on which are defined a random closed set S with capacity functional T and a random vector X with law μ, and which is an a.s. selector of S.

We elaborate a bit on the essentials of random set theory to introduce statisticians to random set statistics.

Just like traditional or standard way to start a probability theory for statistical applications, we consider the simple situation where random quantities take values in a finite set.

Let U be a finite set with n elements. The power set of U is denoted as 2^U (set of functions U → {0, 1}). For A ⊆ U, #(A) denotes the number of elements of the subset A.

The source of uncertainty is a probability space $(Ω, A, P)$ ⁠. A map X(.) : Ω → 2^U is a finite random set (a set obtained at random).

The law of X is the probability measure P_X on the power set of 2^U , where P_X(.)= P ○ X⁻¹ (the pushforward of P by X).

As in the case of finite random variables, P_X is completely determined by the probability density of X, namely f(.) : 2^U → [0, 1] where f(A) = P(X = A). Alternatively, P_X is characterized by the distribution function F(.) : 2^U → [0, 1], where F(A) = P(X ⊆ A). The counterpart of the characterization of distribution functions of random variables is this.

A set-function F(.) : 2^U → [0, 1] is a distribution function of a (finite) random set X if it satisfies the following conditions:

F(∅)= 0, F(U) = 1
For any k ≥ 2, and A_i, i = 1, 2, …, k, subsets of U,

F (\cup_{i = 1}^{k} A_{i}) \geq \sum_{\emptyset \neq I \subseteq {1,2, \dots, k}} {(- 1)}^{# (I) + 1} F (\cap_{i \in I} A_{i})

Alternatively, since T(A) = P(X ∩ A ≠ ∅) = 1 − F(A^c), the law of X can be also characterized by the set function T(.) : 2^U → [0, 1], called the capacity functional of X.

Axiomatically, a capacity function is a function T(.) : 2^U → [0, 1] satisfying the following:

T(∅)= 0, T(U) = 1
For any k ≥ 2, and A_i, i = 1, 2, …, k, subsets of U, we have

T (\cap_{i = 1}^{k} A_{i}) \leq \sum_{\emptyset \neq I \subseteq {1,2, \dots, k}} {(- 1)}^{# (I) + 1} T (\cup_{i \in I} A_{i})

Now let X be a non-empty random set on the finite set U (i.e. P(X = ∅) = 0). The core $C (T)$ of its capacity functional T is the set of probability measures μ on U such that μ(.) ≤ T(.).

We extend all the above to the case where the sampling space $U = R^{d}$ ⁠.

Remember that, for random vectors, i.e. random elements taking values in $R^{d}$ ⁠, their probabilistic background was based on the theory of measures on the Borel measurable space $(R^{d}, B (R^{d}))$ where the Borel σ − field $B (R^{d})$ is constructed using the topology of $R^{d}$ ⁠. For random sets taking values as subsets of $R^{d}$ ⁠, we need a topology on $2^{R^{d}}$ ⁠. Now a random vector is identified as a random set taking singletons as values. But each {x} is a closed set of $R^{d}$ ⁠. Thus, following Matheron (1975), we consider random sets taking values as closed subsets of $R^{d}$ ⁠, denoted as $F (R^{d})$ on which a “hit-or-miss topology” is established to obtain its Borel σ − field, denoted as $B (F)$ ⁠.

A random closed set, defined on a probability space $(Ω, A, P)$ (its source of uncertainty), is a map $X (.) : Ω \to F (R^{d})$ ⁠, $A - B (F)$ - measurable. Its probability law is the probability P_X on $B (F)$ obtained as P_X(.)= P○X⁻¹(.).

The notion of capacity functionals in the finite case is extended as follows. Let $K (R^{d})$ denote the set of compact subsets of $R^{d}$ ⁠. Then $T (.) : K (R^{d}) \to R$ is called a capacity functional if it satisfies:

0 ≤ T(.) ≤ 1, T(∅)= 0
For any k ≥ 2, and A_i, i = 1, 2, …, k, subsets of U, we have

T (\cap_{i = 1}^{k} A_{i}) \leq \sum_{\emptyset \neq I \subseteq {1,2, \dots, k}} {(- 1)}^{# (I) + 1} T (\cup_{i \in I} A_{i})

If $K_{n} \in K (R^{d})$ and $K_{n} ↘ K \in K (R^{d})$ then T(K_n) ↘ T(K).

The counterpart of Lebesgue-Stieltjes Theorem is the Choquet's Theorem: If $T (.) : K (R^{d}) \to R$ is a capacity functional, then there exists a unique probability measure Q on $B (F)$ such that, for all $K \in K (R^{d})$ ⁠, $Q (F_{K}) = T (K)$ ⁠, where $F_{K} = {A \in F (R^{d}) : A \cap K \neq \emptyset}$ ⁠.

In other words, the capacity functional characterizes the probability law of a random closed set. The core of a capacity functional is the set of probability measures μ on $(R^{d}, B (R^{d}))$ such that μ ≤ T on $K (R^{d})$ ⁠. Norberg's Theorem (1992) is valid for $R^{d}$ so that the core of a capacity functional (of a random closed set on $R^{d}$ ⁠) is related to identified sets in partially identified statistical models.

References

Artstein

(

1983

), “

Distributions of random sets and random selections

”,

Israel Journal of Mathematics

, Vol.

, pp.

313

324

A prelude to statistics arising from optimal transport theory

1. Introduction

2. Optimal transport in a nutshell

3. Multivariate quantiles

4. Wasserstein metrics

5. Connection with random set statistics

References

Data & Figures

Contents

Supplements

References

Related

New and popular articles

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Languages

Sharing Unavailable