This paper aims to offer a tutorial/introduction to new statistics arising from the theory of optimal transport to empirical researchers in econometrics and machine learning.
Presenting in a tutorial/survey lecture style to help practitioners with the theoretical material.
The tutorial survey of some main statistical tools (arising from optimal transport theory) should help practitioners to understand the theoretical background in order to conduct empirical research meaningfully.
This study is an original presentation useful for new comers to the field.
1. Introduction
A significant contribution to statistics in general and to econometrics and machine learning in particular from optimal transport theory has surfaced recently. As such it is about time for practitioners to be aware of it to apply it to real-world problems, especially in econometrics, to improve credibility of empirical findings. That is precisely the purpose of this prelude.
This paper is organized as follows. Although this is a prelude where detailed technical material is not spelled out, the main purpose is to call practitioners' attention to new improved statistical tools arising from optimal transport theory, and as such, optimal transport in a nutshell will be presented in Section 2. Section 3 is about the most significant new tool in statistical analysis, namely the notion of multivariate quantiles. Section 4 is devoted to the elaboration of another new tool for statistics, namely the Wasserstein metrics. In Section 5 we elaborate on the interesting connection between partial identification and random set statistics, also thanks to optimal transport.
2. Optimal transport in a nutshell
Monge (1781) was concerned with the problem of finding the cheapest way to transport, say, soil from a collection of mines to a collection of construction sites.
In mathematical language, the problem is formulated as follows. Let denote the spaces of probability measures on respectively. Given , , and a (cost) function c(., .): . A transport map is a (measurable) function T(.): such that ν(.) = μ ○T−1, in symbol ν = T#μ (T pushes μ forward to ν). The transport cost of T is
The Monge's problem is to find an optimal transport map T*, i.e.
This functional optimization problem might not have a solution in general, e.g. when μ is a Dirac probability measure whereas ν is not. But more importantly, with the optimization variable being T, the objective function is not linear. Also, the constraint set {T : T#μ = ν} is not convex. As such, the computation of a solution is difficult.
Because of these issues, Monge's problem was unsolved until Kantorovich (1942) who reformulated Monge's problem to a setting avoiding the two main difficulties mentioned above.
It is interesting to note that Monge's difficulties have analogies in mathematics. When a quadratic equation does not have real solutions, we enlarge its solution domain (the real line ) to the complex plane so that the equation has complex solutions. Similarly, we consider mixed (random) strategies in non-cooperative games to establish the existence of Nash equilibria for any such games.
The same methodology could be used here to “solve” Monge's problem. And that is exactly what Kantorovich has done.
If is a transport map, and is the identity map , then , pushes μ forward to a joint probability measure on having μ, ν as marginal probability measures. Thus the space of joint probability measures on having μ, ν as marginal probability measures, denoted as Π(μ, ν), contains the set of all (Monge) transport maps (by identification). Elements of Π(μ, ν) are referred to as transport plans. Hence, by enlarging transport maps to transport plans, Kantorovich reformulated Monge's problem as follows. Given μ on , ν on , and , find a transport plan λ* ∈ Π(μ, ν) such that
The difficulties in Monge's problem are avoided: Kantorovich's problem always has solutions since μ ⊗ ν ∈ Π(μ, ν); with the optimization variable , the objective function is linear, and the constraint set {λ : λ ∈ Π(μ, ν)} is convex, so that we are in the domain of convex optimization!
3. Multivariate quantiles
The focus on (univariate) quantile functions as a basis for statistical analysis has been advocated by Parzen (1979). In fact, in the comment to Breiman's paper (2001), Parzen even suggested that there are many possible “cultures” for statistical modeling where “quantile culture” could be one of them.
Without digging into whether Parzen's quantile culture is a culture in Breiman's sense, we could view that the use of quantile functions is part of the standard statistical analysis in which, instead of distribution functions, we focus on quantile functions. The two cultures elaborated in Breiman's paper (2001), namely the data modeling and algorithmic modeling, might be not really disjoint, i.e. they could be combined to form a new culture. That was precisely suggested in “Statistical Modeling: The Three cultures” in 2023 by Daoud and Dubhashi (2023) as a hybrid modeling culture! Perhaps, it could be so as we witness at present the interests of Econometricians in Machine (or Statistical) Learning?
Anyway, the point we want to make is this. It is true that the use of quantile functions, such as in quantile regression, provides more information than that of mean regression. But why Parzen's “quantile culture” did not get off the ground or ring the bell, say, in multivariate analysis?
The answer could be twofold. As mean (linear) regression, in multivariate analysis, is the bread-and-butter tool in statistics, quantile regression, introduced by Koenker and Basett (1978), is also only for one dimension. The second reason is crucial: there is no counterpart of multivariate mean regression, and this is because of the lack of a “correct” notion of multivariate (vector) quantile functions, let alone its associated regression analysis.
Specifically, the mathematical problem of how to generalize the familiar notion of an univariate quantile function to higher dimensions is difficult because the explicit definition of a quantile function on the real line is based on the total order relation of , whereas there is no such order relation on with n > 1.
In the literature, among various attempts to “solve” the problem, e.g. Hallin et al. (2010) and Serfling and Zuo (2010), an attempt was to consider the partial order relation of when n > 1, exemplified by Belloni and Winkler (2011), leading to the notion of “partial multivariate quantiles”. This is typical of an approach to avoid the lack of the total order relation on , but does not really address the original problem, i.e. partial vector quantile functions are not generalizations of univariate quantile functions. They are just substitutes.
Finally, as Koenker (2017) acknowledged, the happy ending arrived in 2016 with the works of Carlier et al. (2016, 2017), and that was inspired from the Theory of Optimal Transport, Villani (2003).
This Section aims at elaborating a bit on the notion of vector (multivariate) quantile functions correctly generalizing the familiar notion of univariate quantile functions.
Let X be a real-valued random variable (the name of some quantity), i.e. a measurable map from a probability space , its source of uncertainty, to the measurable space , its sampling space. Its law is the probability measure PX on obtained by pushing forward P by X, i.e. PX(.)= P ○ X−1, in symbol PX = X#P. By Lebesgue-Stieltjes theorem, PX = dF where is the distribution function of X. The distribution function F of X contains all information about the random evolution of X. If we know F, can we create the data from X? This is the problem known as simulations. Yes, but not directly by using F. Instead, we consider its “pseudo inverse” function F[−1] known as its (univariate) quantile function defined explicitly as ,
and show that F[−1] will push forward the uniform probability measure du on (0,1) to dF (F[−1]#du = dF, i.e. ) so that (equal in distribution) where U denotes the (uniform) random variable on (0,1) with law du.
While the pseudo inverse F[−1](.) provides a reasonable mathematical definition for quantiles, its explicit definition involves the total order relation of the real line and that is the difficulty to extend it to higher dimensions, say, as a function with n > 1.
A short story of this extension problem seems interesting to note. Traditionally, the extension of a concept in one dimension to several dimensions could be done componentwise, such as the concept of the mean of a random vector. But defining a vector quantile function componentwise does not work since the property G#du = dF, for du as uniform law on (0,1)n, and dF as law of a random vector on (for n > 1) is not satisfied.
Of course F(.) is characterized by F[−1](.), since if is monotone non decreasing and left continuous then there exists a unique distribution function F(.) such that Q(.)= F[−1](.). However, such a characterization of F[−1](.) does not extend to higher dimensions.
Also, traditionally, if we cannot use directly an established concept in one setting to extend it to another setting, we look for a possible equivalent concept (a characterization of the established concept) that can be generalized. For example, to generalize ordinary sets to fuzzy sets, we use the indicator function of an ordinary set (as its membership function) as a characterization of the set from which to extend to the new setting. Here the question is: what is a characterization of F[−1], i.e. another equivalent way to define it.
Perhaps, previous attempts to generalize univariate quantile functions to vector quantile functions did not ask this question. It turns out that the answer is hidden in plain sight! Besides the property F[−1]#du = dF, the (explicitly defined) function is monotone non decreasing, and these two properties provide a characterization for F[−1]. Specifically, F[−1](.): is the unique function that is monotone non decreasing and satisfies G#du = dF :
If is monotone non decreasing and satisfies G#du = dF, then G(.)= F[−1], i.e.
Proof. By monotonicity of G, we have
Consider the points x such that G(x) > F[−1](x). This means that there exists ɛo > 0 such that F(G(x) − ɛ) ≥ Fdu(x) for every ɛ ∈ [0, ɛo]. Also, since , we have F(G(x) − ɛ) < Fdu(x). Thus, F(G(x) − ɛ) = Fdu(x) for any ɛ ∈ [0, ɛo]. Note that F(G(x) − ɛ) is the value of F which F takes on an interval where it is constant. But these intervals are a countable quantity, so that the values yj of F on these intervals are also countable. Therefore, the points x where G(x) > F[−1](x) are contained in ∪j{x : Fdu(x) = yj} which is du − negligible (since du is atomless). As a consequence, G(x) = F[−1](x), du − almost everywhere. Q.E.D.
As a consequence, the above characterization of F[−1] can be used to obtain its counterpart in higher dimensions since on , with n > 1, the property G#du = dF makes sense and the monotone non decreasing property for is equivalent to
where denotes the scalar product on .
The characterization of F[−1] brings out the fact that the total order relation on does not play an essential role in defining it.
The point is this. If (for n > 1) is going to be an extension of , G(.) has to be monotone non decreasing and pushing forward du to dF (in dimension n > 1).
The upshot is that, for n ≥ 1, these two properties are characteristic for the notion of quantiles, in the sense that there is uniquely one such function , so that, for n = 1, it coincides with F[−1](.). Thus, in dimension 1, the familiar univariate quantile function F[−1] can be defined without using explicitly the total order relation of !
This upshot was discovered in the context of Optimal Transport, see Villani (2003), Brenier (1991), McCam (1995), Carlier et al. (2017) and Galichon (2016), where a (n-dimensional) vector quantile function is the unique monotone noncreasing function such that G#du = dF.
Clearly, the upshot tells us that the familiar univariate quantile function can be generalized to higher dimensions rigorously. However, except in dimension 1, the vector quantile functions so determined are not obtained in a close form. Practitioners should consult the literature for computational works.
The following notes could give a flavor of optimal transport in getting, finally, the correct notion of multivariate quantiles.
In the setting of optimal transport, F[−1](.) is characterized by a unique “transport map” , monotone non decreasing and T*#du = dF, where
i.e. the solution of Monge's problem with cost function . On the other hand, the function
is convex, so that F[−1](.) is the derivative of the convex function φ(.) on (0,1).
In dimension n > 1, the above leads to the notion of multivariate quantile function by McCam's theorem (1995): Let be a multivariate distribution function, then there exists a unique gradient of some convex function (φ is not unique, but ∇φ is unique) such that ∇φ#du = dF, where du is the uniform law on (0,1)n.
4. Wasserstein metrics
We elaborate now upon a new improved type of metrics on spaces of probability measures arising from optimal transport theory. The main improvement seems to be that these new metrics, called Wasserstein metrics, do take into account of the geometry of the underlying sample space. Their construction surfaces naturally in the setting of optimal transport theory. Such metrics are useful, e.g. for machine learning.
Recall that, in applications of statistics, we often use a divergence D(., .) on a space of probability measures to “measure” of the difference between two probability measures. Such a divergence is used to compare probability measures, for example D(μ, ν) is the difference between a model μ and a data ν.
The most well-known divergence is the Kullback-Leibler divergence on probability measures on :
where f(.), g(.) are probability density of μ, ν respectively (with respect to some dominating measure γ on ). The KL divergence is not a distance since it is not symmetric, but it does have analogous properties which could be used to substitute for a metric, such as the Total Variation metric
The KL divergence appears in the model selection criterion AIC.
Metrics on spaces of probability measures are viewed as special divergences. Divergences abound. The choice of a divergence or a metric for comparing probability measures depends on its usefulness for the problem at hand. For example, the Kullback-Leibler divergence is used in AIC because of its relation to Maximum Likelihood Estimation.
Consider the case where , we are interested in the following Wasserstein divergence (a priori) on the subset of the set of all (Borel) probability measures on , where
namely, Wp(., .):
Specifically, we are going to show that the Wasserstein divergence Wp(., .) is in fact a bona fide metric on , a well-known fact in the literature.
We will carry out the complete proof that Wasserstein divergence is in fact a bona fide metric to emphasize the interesting notion of disintegration (of measures).
Disintegration is a process of extracting a conditional probability measure from a joint probability measure on a product space.
To be concrete, let , and , , , be (Borel) measurable spaces. We denote by , , the set of all probability measures on these spaces.
For , its marginal probability measure on is , defined as, for any , .
A disintegration of λ with respect to μ is a family of probability measures , for any , such that, for and , we have
Symbolically,
where δx is the Diract probability measure on , at , and δx ⊗ νx denotes the product measure (δx ⊗ νx)(A × B) = δx(A)νx(B).
The representation of λ is so written since
Below is a tutorial on disintegration, just enough for using it in proving the triangle inequality for Wasserstein metrics. A reference could be Dudley (2003) or Graf and Mauldin (1989).
Now, for , with norm ‖.‖, and p ≥ 1, the pth- Wasserstein metric is
where F, G are n − dimensional distribution functions of X, Y, respectively, and π has 2n − dimensional distribution function H with F, G as marginals, i.e.
More generally, Wasserstein distance is a metric on spaces of probability measures. Let be a metric space. Consider the situation where we are interested in probability measures governing the random evolution of random elements taking values in (i.e. their “laws” operating on Borel σ − field ). Comparisons of probability measures are standard concerns in applications, such as in the so-called empirical processes.
For μ, ν two probability measures on , consider the nonnegative quantity
where the infimum is taken over all joint probability measure π with marginals (projections) μ, ν.
We will denote by Π(μ, ν) the set of probability measures π on the product space having μ, ν as marginal measures, i.e. .
Note that the above quantity can be written as:
i.e. the infimum is taken over all random variables X, Y with values in , and X, Y are distributed as μ, ν, respectively.
On a subset of where W(μ, ν) < ∞, for μ, ν in it, W(., .) is a bona fide metric.
We come now to the main investigation of Wasserstein divergences (a priori) on a metric space for which disintegration exists, such as or a polish space.
Let denotes the set of all (Borel) probability measures on . For p ≥ 1, let , be the subset of probability measures with finite p − moment, i.e.
where ‖.‖ is the Euclidean norm of .
Consider the Wasserstein divergence on : For , and p ≥ 1, let
This is just an exercise to verify that Wp(., .) does satisfy the axioms of a metric, i.e. is such that
Wp(μ, ν) = Wp(ν, μ)
Wp(μ, ν) = 0⇔μ = ν
For any , Wp(μ, ν) ≤ Wp(μ, γ) + Wp(γ, ν)
First, since, for , ‖x − y‖p ≤ c(‖x‖p + ‖y‖p), so that , for , we have
While (i) is obvious (since the function (x, y) → ‖x − y‖p is symmetric, and Π(μ, ν) ≃Π(ν, μ)) and (ii) can be seen as follows.
For ν = μ , the optimal transport map is the identity map I(x) = x, so that the optimal transport plan is λ = (I, I)#μ concentrated on {(x, y) : x = y} and hence
Conversely, if Wp(μ, ν) = 0, then, since
is attained, there exists λ ∈ Π(μ, ν) such that so that λ is concentrated on {(x, y) : x = y} which, in turn, implies that, for any ,
i.e. μ = ν.
For p ≤ q, we have Wp(μ, ν) ≤ Wq(μ, ν), since, by Jensen's inequality (with respect to the convex function ), for any λ ∈ Π(μ, ν),
However, the triangle inequality (iii) is not so obvious!
Interestingly, it is the notion of disintegration which will provide a method to verify it.
We wish to show that, for any μj, j = 1, 2, 3 in , for p ≥ 1, with support , j = 1, 2, 3, respectively, we should have
For this, we follow Villani (2003).
Let μj, j = 1, 2, 3 in , for p ≥ 1, with support , j = 1, 2, 3, respectively. Let λ12 ∈ Π(μ1, μ2), and λ23 ∈ Π(μ2, μ3).
Then there exists a probability measure λ on having marginals λ12 and λ23 on , and , respectively.
Proof. Disintegrate both λ12 and λ23 with respect to their common marginal μ2, and denote their disintegrations as , respectively, so that
Then constructed as
Then, for , , , we have
Similarly for
QED.
Then the proof of the triangle inequality for Wasserstein metrics follows:
Let μj, j = 1, 2, 3 in , for p ≥ 1, with support , j = 1, 2, 3, respectively.
Note that, from OT theory (existence of solutions of Kantorovich's problem),
is attained with some optimal transport plan λij ∈ Π(μi, μj). Thus,
Now, let λ, in the above Lemma corresponding to μj, j = 1, 2, 3 , be the probability measure on having marginals λ12 and λ23 on , and , respectively.
We then have
Q.E.D.
5. Connection with random set statistics
One more useful statistical methodology arising from optimal transport theory was the unexpected connection between the current topic of partial identification (of statistical models) and random set statistics via optimal transport, as pointed out by Galichon (2016).
First, it seems here is a good place to spell out briefly what is statistics and how statisticians should conduct statistics! Roughly speaking, statistics is about finding the truth from data, and statistical works should be credible.
Unlike physical science, we need models to conduct statistics. Based upon observed data, statisticians propose models. A model is a subjective (stochastic) equation together with a set of assumptions supporting it. Of course each model contains unknown “parameters” which need to be estimated (from data) to specify it for, e.g. prediction and decision-making.
As we all know (since we “follow” the traditional approach to without any hesitation) that the maintained assumptions (whether they are justified or not) are there to allow us to use available data to consistently estimate the model parameters, noting that estimability of parameters in this sense is related to the notion of identification.
In order to justify our statistical estimation of our model parameter, say, in the model {Fθ θ ∈ Θ}, we impose assumptions to make the true (but unknown) parameter θo identifiable (i.e. point identifiable) in the sense that the map θ → Fθ is injective. A well-known example for all is the linear supply and demand model in microeconomics. General supply and demand models are provided by economic theory, but when a text book advises us to use a linear model (for simplicity?), it puts down assumptions without justifications to make sure that the model parameter of interest is point identifiable.
If the maintained assumptions are not plausible, the map θ → Fθ might not be injective, i.e. there are θ′ ≠ θ such that Fθ = Fθ′ (θ and θ′ are said to be observationally equivalent) so that the model parameter is not point-identifiable, such as in games with multiple Nash equilibria. In such a situation, should we give up the analysis or the empirical attempt ? No, as Manski (e.g. 2007) put it, we could live with it and look for a new way to estimate the model parameter, not as a point but as a subset of the parameter space, called the identified set. Thus, estimating an identified set is the main goal for partially identified statistical models.
In this improved statistical setting, we are facing partially identified models where point estimation becomes set estimation. But when the estimation target is a set, the identified set (i.e. set of observationally equivalent parameters), its estimated set is a random set (a set-valued function of the data). Thus, we are facing a natural extension of classical statistics, namely statistics with random sets rather with random points.
Now, the general theory of probability supporting statistical analysis should cover the theory of random sets (as an extension of random vectors) which are well defined random elements. See Matheron (1975) or Nguyen (2006). In other words, in view of credible statistics, statistics of random sets should take a central stage in empirical research. However, the statistical theory of set-valued statistics is still young. In some contexts, e.g. estimating the level sets of an unknown probability density function, the estimation method is Hartigan's (1987) excess mass which is the counterpart of maximum likelihood method in traditional statistics. See also Nguyen (2006).
What is “interesting” is that some partial identification problems can be formulated as an optimal transport problem which in turn provides a connection with random sets useful for computational purposes. See Galichon (2016) for details. Here we elaborate a bit on the theory of random sets since after all as the identified set is a set, its estimator will be a random set statistic, and we need to investigate its properties just like the special case of random vector statistics. The point is this. While random set statistics is the natural approach to inference about set parameters in partially identified models, the context in which these partially identified models can be formulated as optimal transport problems brings out specific ways for conducting inference.
Now, in spirit, partial identification setting is somewhat similar to statistics with coarse data where the data from the desired DGP (an unknown distribution) are not observable, but instead the data from a random set containing it are observed, i.e. the latent random variable of interest is an almost sure selector of the observed random set. As such, it is related to the estimation of the identified set from a random set viewpoint.
In Galichon's analysis (2016) the focus is the identification of an identified set of a partially identified model, and the connection with random set is based upon a result of Artstein (1983) which is generalized by Norberg (1992) as follows.
First of all, capacity functionals play the role of probability laws of random (closed) sets on by Choquet's Theorem (the counterpart of Lebesgue- Stieltjes Theorem for random vectors), see Nguyen (2006) for an introduction. Two capacity functionals T1, T2 are said to form an ordered coupling if there exists a common probability space on which are defined two random closed sets S1, S2 such that S2 ⊆ S1 P, i.e. where S1, S2 have T1, T2 as capacity functionals, respectively. When the random set S2 is single-valued, a special case which is identified with a random vector, it becomes an a.s. selector of S1, i.e. P(S2 ∈ S1) = 1. This special case corresponds to the situation in coarse data analysis as well as in partial identification estimation (of identified sets). An useful result from random set theory for it is the following which allows us to characterize an identified set as the core of a capacity functional of a random set.
Theorem (Norberg, 1992). Let μ be a probability measure on and T be a capacity functional, then the following are equivalent:
μ ≤ T on compact sets of
There exists a common probability space on which are defined a random closed set S with capacity functional T and a random vector X with law μ, and which is an a.s. selector of S.
We elaborate a bit on the essentials of random set theory to introduce statisticians to random set statistics.
Just like traditional or standard way to start a probability theory for statistical applications, we consider the simple situation where random quantities take values in a finite set.
Let U be a finite set with n elements. The power set of U is denoted as 2U (set of functions U → {0, 1}). For A ⊆ U, #(A) denotes the number of elements of the subset A.
The source of uncertainty is a probability space . A map X(.) : Ω → 2U is a finite random set (a set obtained at random).
The law of X is the probability measure PX on the power set of 2U , where PX(.)= P ○ X−1 (the pushforward of P by X).
As in the case of finite random variables, PX is completely determined by the probability density of X, namely f(.) : 2U → [0, 1] where f(A) = P(X = A). Alternatively, PX is characterized by the distribution function F(.) : 2U → [0, 1], where F(A) = P(X ⊆ A). The counterpart of the characterization of distribution functions of random variables is this.
A set-function F(.) : 2U → [0, 1] is a distribution function of a (finite) random set X if it satisfies the following conditions:
F(∅)= 0, F(U) = 1
For any k ≥ 2, and Ai, i = 1, 2, …, k, subsets of U,
Alternatively, since T(A) = P(X ∩ A ≠ ∅) = 1 − F(Ac), the law of X can be also characterized by the set function T(.) : 2U → [0, 1], called the capacity functional of X.
Axiomatically, a capacity function is a function T(.) : 2U → [0, 1] satisfying the following:
T(∅)= 0, T(U) = 1
For any k ≥ 2, and Ai, i = 1, 2, …, k, subsets of U, we have
Now let X be a non-empty random set on the finite set U (i.e. P(X = ∅) = 0). The core of its capacity functional T is the set of probability measures μ on U such that μ(.) ≤ T(.).
We extend all the above to the case where the sampling space .
Remember that, for random vectors, i.e. random elements taking values in , their probabilistic background was based on the theory of measures on the Borel measurable space where the Borel σ − field is constructed using the topology of . For random sets taking values as subsets of , we need a topology on . Now a random vector is identified as a random set taking singletons as values. But each {x} is a closed set of . Thus, following Matheron (1975), we consider random sets taking values as closed subsets of , denoted as on which a “hit-or-miss topology” is established to obtain its Borel σ − field, denoted as .
A random closed set, defined on a probability space (its source of uncertainty), is a map , - measurable. Its probability law is the probability PX on obtained as PX(.)= P○X−1(.).
The notion of capacity functionals in the finite case is extended as follows. Let denote the set of compact subsets of . Then is called a capacity functional if it satisfies:
0 ≤ T(.) ≤ 1, T(∅)= 0
For any k ≥ 2, and Ai, i = 1, 2, …, k, subsets of U, we have
If and then T(Kn) ↘ T(K).
The counterpart of Lebesgue-Stieltjes Theorem is the Choquet's Theorem: If is a capacity functional, then there exists a unique probability measure Q on such that, for all , , where .
In other words, the capacity functional characterizes the probability law of a random closed set. The core of a capacity functional is the set of probability measures μ on such that μ ≤ T on . Norberg's Theorem (1992) is valid for so that the core of a capacity functional (of a random closed set on ) is related to identified sets in partially identified statistical models.
