Skip to Main Content
Purpose

This paper aims to offer a tutorial/introduction to new statistics arising from the theory of optimal transport to empirical researchers in econometrics and machine learning.

Design/methodology/approach

Presenting in a tutorial/survey lecture style to help practitioners with the theoretical material.

Findings

The tutorial survey of some main statistical tools (arising from optimal transport theory) should help practitioners to understand the theoretical background in order to conduct empirical research meaningfully.

Originality/value

This study is an original presentation useful for new comers to the field.

A significant contribution to statistics in general and to econometrics and machine learning in particular from optimal transport theory has surfaced recently. As such it is about time for practitioners to be aware of it to apply it to real-world problems, especially in econometrics, to improve credibility of empirical findings. That is precisely the purpose of this prelude.

This paper is organized as follows. Although this is a prelude where detailed technical material is not spelled out, the main purpose is to call practitioners' attention to new improved statistical tools arising from optimal transport theory, and as such, optimal transport in a nutshell will be presented in Section 2. Section 3 is about the most significant new tool in statistical analysis, namely the notion of multivariate quantiles. Section 4 is devoted to the elaboration of another new tool for statistics, namely the Wasserstein metrics. In Section 5 we elaborate on the interesting connection between partial identification and random set statistics, also thanks to optimal transport.

Monge (1781) was concerned with the problem of finding the cheapest way to transport, say, soil from a collection of mines to a collection of construction sites.

In mathematical language, the problem is formulated as follows. Let P(X),P(Y) denote the spaces of probability measures on X,YRn respectively. Given μP(X), νP(Y), and a (cost) function c(., .): X×YR+. A transport map is a (measurable) function T(.): XY such that ν(.) = μT−1, in symbol ν = T#μ (T pushes μ forward to ν). The transport cost of T is

The Monge's problem is to find an optimal transport map T*, i.e.

This functional optimization problem might not have a solution in general, e.g. when μ is a Dirac probability measure whereas ν is not. But more importantly, with the optimization variable being T, the objective function TXc(x,T(x))dμ(x) is not linear. Also, the constraint set {T : T#μ = ν} is not convex. As such, the computation of a solution is difficult.

Because of these issues, Monge's problem was unsolved until Kantorovich (1942) who reformulated Monge's problem to a setting avoiding the two main difficulties mentioned above.

It is interesting to note that Monge's difficulties have analogies in mathematics. When a quadratic equation does not have real solutions, we enlarge its solution domain (the real line R) to the complex plane so that the equation has complex solutions. Similarly, we consider mixed (random) strategies in non-cooperative games to establish the existence of Nash equilibria for any such games.

The same methodology could be used here to “solve” Monge's problem. And that is exactly what Kantorovich has done.

If T(.):X×YR+ is a transport map, and IX(.):XX is the identity map IX(x)=x, then (IX×T)(.):XX×Y, (IX×T)(x)=(x,T(x)) pushes μ forward to a joint probability measure μ(IX×T)1 on X×Y having μ, ν as marginal probability measures. Thus the space of joint probability measures on X×Y having μ, ν as marginal probability measures, denoted as Π(μ, ν), contains the set of all (Monge) transport maps (by identification). Elements of Π(μ, ν) are referred to as transport plans. Hence, by enlarging transport maps to transport plans, Kantorovich reformulated Monge's problem as follows. Given μ on X, ν on Y, and c(.,.):X×YR+, find a transport plan λ* ∈ Π(μ, ν) such that

The difficulties in Monge's problem are avoided: Kantorovich's problem always has solutions since μν ∈ Π(μ, ν); with the optimization variable λΠ(μ,ν), the objective function λX×Yc(x,y)dλ(x,y) is linear, and the constraint set {λ : λ ∈ Π(μ, ν)} is convex, so that we are in the domain of convex optimization!

The focus on (univariate) quantile functions as a basis for statistical analysis has been advocated by Parzen (1979). In fact, in the comment to Breiman's paper (2001), Parzen even suggested that there are many possible “cultures” for statistical modeling where “quantile culture” could be one of them.

Without digging into whether Parzen's quantile culture is a culture in Breiman's sense, we could view that the use of quantile functions is part of the standard statistical analysis in which, instead of distribution functions, we focus on quantile functions. The two cultures elaborated in Breiman's paper (2001), namely the data modeling and algorithmic modeling, might be not really disjoint, i.e. they could be combined to form a new culture. That was precisely suggested in “Statistical Modeling: The Three cultures” in 2023 by Daoud and Dubhashi (2023) as a hybrid modeling culture! Perhaps, it could be so as we witness at present the interests of Econometricians in Machine (or Statistical) Learning?

Anyway, the point we want to make is this. It is true that the use of quantile functions, such as in quantile regression, provides more information than that of mean regression. But why Parzen's “quantile culture” did not get off the ground or ring the bell, say, in multivariate analysis?

The answer could be twofold. As mean (linear) regression, in multivariate analysis, is the bread-and-butter tool in statistics, quantile regression, introduced by Koenker and Basett (1978), is also only for one dimension. The second reason is crucial: there is no counterpart of multivariate mean regression, and this is because of the lack of a “correct” notion of multivariate (vector) quantile functions, let alone its associated regression analysis.

Specifically, the mathematical problem of how to generalize the familiar notion of an univariate quantile function to higher dimensions is difficult because the explicit definition of a quantile function on the real line R is based on the total order relation of R, whereas there is no such order relation on Rn with n > 1.

In the literature, among various attempts to “solve” the problem, e.g. Hallin et al. (2010) and Serfling and Zuo (2010), an attempt was to consider the partial order relation of Rn when n > 1, exemplified by Belloni and Winkler (2011), leading to the notion of “partial multivariate quantiles”. This is typical of an approach to avoid the lack of the total order relation on R, but does not really address the original problem, i.e. partial vector quantile functions are not generalizations of univariate quantile functions. They are just substitutes.

Finally, as Koenker (2017) acknowledged, the happy ending arrived in 2016 with the works of Carlier et al. (2016, 2017), and that was inspired from the Theory of Optimal Transport, Villani (2003).

This Section aims at elaborating a bit on the notion of vector (multivariate) quantile functions correctly generalizing the familiar notion of univariate quantile functions.

Let X be a real-valued random variable (the name of some quantity), i.e. a measurable map from a probability space (Ω,A,P), its source of uncertainty, to the measurable space (R,B(R)), its sampling space. Its law is the probability measure PX on B(R) obtained by pushing forward P by X, i.e. PX(.)= PX−1, in symbol PX = X#P. By Lebesgue-Stieltjes theorem, PX = dF where F(.):R[0,1] is the distribution function of X. The distribution function F of X contains all information about the random evolution of X. If we know F, can we create the data from X? This is the problem known as simulations. Yes, but not directly by using F. Instead, we consider its “pseudo inverse” function F[−1] known as its (univariate) quantile function defined explicitly as F[1](.):(0,1)R,

and show that F[−1] will push forward the uniform probability measure du on (0,1) to dF (F[−1]#du = dF, i.e. dF(.)=du(F[1])1) so that X=DF[1](U) (equal in distribution) where U denotes the (uniform) random variable on (0,1) with law du.

While the pseudo inverse F[−1](.) provides a reasonable mathematical definition for quantiles, its explicit definition involves the total order relation of the real line R and that is the difficulty to extend it to higher dimensions, say, as a function G(.):(0,1)nRn with n > 1.

A short story of this extension problem seems interesting to note. Traditionally, the extension of a concept in one dimension to several dimensions could be done componentwise, such as the concept of the mean of a random vector. But defining a vector quantile function componentwise does not work since the property G#du = dF, for du as uniform law on (0,1)n, and dF as law of a random vector on Rn (for n > 1) is not satisfied.

Remark.

Of course F(.) is characterized by F[−1](.), since if Q(.):(0,1)R is monotone non decreasing and left continuous then there exists a unique distribution function F(.) such that Q(.)= F[−1](.). However, such a characterization of F[−1](.) does not extend to higher dimensions.

Also, traditionally, if we cannot use directly an established concept in one setting to extend it to another setting, we look for a possible equivalent concept (a characterization of the established concept) that can be generalized. For example, to generalize ordinary sets to fuzzy sets, we use the indicator function of an ordinary set (as its membership function) as a characterization of the set from which to extend to the new setting. Here the question is: what is a characterization of F[−1], i.e. another equivalent way to define it.

Perhaps, previous attempts to generalize univariate quantile functions to vector quantile functions did not ask this question. It turns out that the answer is hidden in plain sight! Besides the property F[−1]#du = dF, the (explicitly defined) function F[1](.):(0,1)R is monotone non decreasing, and these two properties provide a characterization for F[−1]. Specifically, F[−1](.): (0,1)R is the unique function that is monotone non decreasing and satisfies G#du = dF :

Lemma.

If G(.):(0,1)R is monotone non decreasing and satisfies G#du = dF, then G(.)= F[−1], i.e.

Proof. By monotonicity of G, we have

so that
and G(x) ≥ F[−1](x)

Consider the points x such that G(x) > F[−1](x). This means that there exists ɛo > 0 such that F(G(x) − ɛ) ≥ Fdu(x) for every ɛ ∈ [0, ɛo]. Also, since G1(,G(x)ε)(,x), we have F(G(x) − ɛ) < Fdu(x). Thus, F(G(x) − ɛ) = Fdu(x) for any ɛ ∈ [0, ɛo]. Note that F(G(x) − ɛ) is the value of F which F takes on an interval where it is constant. But these intervals are a countable quantity, so that the values yj of F on these intervals are also countable. Therefore, the points x where G(x) > F[−1](x) are contained in ∪j{x : Fdu(x) = yj} which is du − negligible (since du is atomless). As a consequence, G(x) = F[−1](x), du − almost everywhere. Q.E.D.

As a consequence, the above characterization of F[−1] can be used to obtain its counterpart in higher dimensions since on Rn , with n > 1, the property G#du = dF makes sense and the monotone non decreasing property for G(.):(0,1)nRn is equivalent to

where <.,.> denotes the scalar product on Rn.

Remark.

The characterization of F[−1] brings out the fact that the total order relation on R does not play an essential role in defining it.

The point is this. If G(.):(0,1)nRn (for n > 1) is going to be an extension of F[1](.):(0,1)R, G(.) has to be monotone non decreasing and pushing forward du to dF (in dimension n > 1).

The upshot is that, for n ≥ 1, these two properties are characteristic for the notion of quantiles, in the sense that there is uniquely one such function G(.):(0,1)nRn, so that, for n = 1, it coincides with F[−1](.). Thus, in dimension 1, the familiar univariate quantile function F[−1] can be defined without using explicitly the total order relation of R !

This upshot was discovered in the context of Optimal Transport, see Villani (2003), Brenier (1991), McCam (1995), Carlier et al. (2017) and Galichon (2016), where a (n-dimensional) vector quantile function is the unique monotone noncreasing function G(.):(0,1)nRn such that G#du = dF.

Clearly, the upshot tells us that the familiar univariate quantile function can be generalized to higher dimensions rigorously. However, except in dimension 1, the vector quantile functions so determined are not obtained in a close form. Practitioners should consult the literature for computational works.

Remark.

The following notes could give a flavor of optimal transport in getting, finally, the correct notion of multivariate quantiles.

In the setting of optimal transport, F[−1](.) is characterized by a unique “transport map” T*(.):(0,1)R, monotone non decreasing and T*#du = dF, where

i.e. the solution of Monge's problem with cost function c(.,.):(0,1)×RR+:(u,x)12|uT(u)|2. On the other hand, the function φ(.):(0,1)R

is convex, so that F[−1](.) is the derivative of the convex function φ(.) on (0,1).

In dimension n > 1, the above leads to the notion of multivariate quantile function by McCam's theorem (1995): Let F(.):Rn[0,1] be a multivariate distribution function, then there exists a unique gradient φ(.):(0,1)nRn of some convex function φ(.):(0,1)nR (φ is not unique, but ∇φ is unique) such that ∇φ#du = dF, where du is the uniform law on (0,1)n.

We elaborate now upon a new improved type of metrics on spaces of probability measures arising from optimal transport theory. The main improvement seems to be that these new metrics, called Wasserstein metrics, do take into account of the geometry of the underlying sample space. Their construction surfaces naturally in the setting of optimal transport theory. Such metrics are useful, e.g. for machine learning.

Recall that, in applications of statistics, we often use a divergence D(., .) on a space of probability measures to “measure” of the difference between two probability measures. Such a divergence is used to compare probability measures, for example D(μ, ν) is the difference between a model μ and a data ν.

The most well-known divergence is the Kullback-Leibler divergence on probability measures on (R,B(R)):

where f(.), g(.) are probability density of μ, ν respectively (with respect to some dominating measure γ on B(R)). The KL divergence is not a distance since it is not symmetric, but it does have analogous properties which could be used to substitute for a metric, such as the Total Variation metric

The KL divergence appears in the model selection criterion AIC.

Metrics on spaces of probability measures are viewed as special divergences. Divergences abound. The choice of a divergence or a metric for comparing probability measures depends on its usefulness for the problem at hand. For example, the Kullback-Leibler divergence is used in AIC because of its relation to Maximum Likelihood Estimation.

Consider the case where X=YRn, we are interested in the following Wasserstein divergence (a priori) on the subset Pp(X) of the set P(X) of all (Borel) probability measures on X, where

namely, Wp(., .): Pp(X)×Pp(X)0,

Specifically, we are going to show that the Wasserstein divergence Wp(., .) is in fact a bona fide metric on Pp(X), a well-known fact in the literature.

We will carry out the complete proof that Wasserstein divergence is in fact a bona fide metric to emphasize the interesting notion of disintegration (of measures).

Disintegration is a process of extracting a conditional probability measure from a joint probability measure on a product space.

To be concrete, let X,YRn, and (X,B(X)), (Y,B(Y)), (X×Y,,B(X×Y)), be (Borel) measurable spaces. We denote by P(X), P(Y), P(X×Y) the set of all probability measures on these spaces.

For λP(X×Y), its marginal probability measure on X is μP(X), defined as, for any AB(X), μ(A)=λ(A×Y).

A disintegration of λ with respect to μ is a family of probability measures νxP(Y), for any xX, such that, for AB(X) and BB(Y), we have

Symbolically,

where δx is the Diract probability measure on X, at xX, and δxνx denotes the product measure (δxνx)(A × B) = δx(A)νx(B).

The representation of λ is so written since

Below is a tutorial on disintegration, just enough for using it in proving the triangle inequality for Wasserstein metrics. A reference could be Dudley (2003) or Graf and Mauldin (1989).

Now, for X=Rn, with norm ‖.‖, and p ≥ 1, the pth- Wasserstein metric is

where F, G are n − dimensional distribution functions of X, Y, respectively, and π has 2n − dimensional distribution function H with F, G as marginals, i.e.

More generally, Wasserstein distance is a metric on spaces of probability measures. Let (X,ρ) be a metric space. Consider the situation where we are interested in probability measures governing the random evolution of random elements taking values in X (i.e. their “laws” operating on Borel σ − field B(X)). Comparisons of probability measures are standard concerns in applications, such as in the so-called empirical processes.

For μ, ν two probability measures on (X,B(X)), consider the nonnegative quantity

where the infimum is taken over all joint probability measure π with marginals (projections) μ, ν.

We will denote by Π(μ, ν) the set of probability measures π on the product space X×X having μ, ν as marginal measures, i.e. μ(.)=π(.×X),ν(.)=π(X×.).

Note that the above quantity can be written as:

i.e. the infimum is taken over all random variables X, Y with values in (X,B(X)), and X, Y are distributed as μ, ν, respectively.

On a subset of P(X) where W(μ, ν) < , for μ, ν in it, W(., .) is a bona fide metric.

We come now to the main investigation of Wasserstein divergences (a priori) on a metric space X for which disintegration exists, such as Rn or a polish space.

Let P(X) denotes the set of all (Borel) probability measures on B(X). For p ≥ 1, let Pp(X)P(X), be the subset of probability measures with finite p − moment, i.e.

where ‖.‖ is the Euclidean norm of Rn.

Consider the Wasserstein divergence on Pp(X): For μ,νPp(X), and p ≥ 1, let

This is just an exercise to verify that Wp(., .) does satisfy the axioms of a metric, i.e. Wp(.,.):Pp(X)×Pp(X)R+=[0,) is such that

  1. Wp(μ, ν) = Wp(ν, μ)

  2. Wp(μ, ν) = 0⇔μ = ν

  3. For any μ,ν,γPp(X), Wp(μ, ν) ≤ Wp(μ, γ) + Wp(γ, ν)

First, since, for x,yRn, ‖x − ypc(‖xp + ‖yp), so that , for μ,νPp(Rn), we have

While (i) is obvious (since the function (x, y) → ‖x − yp is symmetric, and Π(μ, ν) ≃Π(ν, μ)) and (ii) can be seen as follows.

For ν = μ , the optimal transport map T:XX is the identity map I(x) = x, so that the optimal transport plan is λ = (I, I) concentrated on {(x, y) : x = y} and hence

Conversely, if Wp(μ, ν) = 0, then, since

is attained, there exists λ ∈ Π(μ, ν) such that X×Xxypdλ(x,y)=0 so that λ is concentrated on {(x, y) : x = y} which, in turn, implies that, for any AB(Rn),

i.e. μ = ν.

Remark.

For pq, we have Wp(μ, ν) ≤ Wq(μ, ν), since, by Jensen's inequality (with respect to the convex function ttqp), for any λ ∈ Π(μ, ν),

However, the triangle inequality (iii) is not so obvious!

Interestingly, it is the notion of disintegration which will provide a method to verify it.

We wish to show that, for any μj, j = 1, 2, 3 in Pp(Rn), for p ≥ 1, with support XjRn, j = 1, 2, 3, respectively, we should have

For this, we follow Villani (2003).

Lemma.

Let μj, j = 1, 2, 3 in Pp(Rn), for p ≥ 1, with support XjRn, j = 1, 2, 3, respectively. Let λ12 ∈ Π(μ1, μ2), and λ23 ∈ Π(μ2, μ3).

Then there exists a probability measure λ on X1×X2×X3 having marginals λ12 and λ23 on X1×X2, and X2×X3, respectively.

Proof. Disintegrate both λ12 and λ23 with respect to their common marginal μ2, and denote their disintegrations as νx1,γx3, respectively, so that

Then λP(X1×X2×X3) constructed as

Then, for A1B(X1), A2B(X2), A3B(X3), we have

Similarly for

QED.

Then the proof of the triangle inequality for Wasserstein metrics follows:

Let μj, j = 1, 2, 3 in Pp(Rn), for p ≥ 1, with support XjRn, j = 1, 2, 3, respectively.

Note that, from OT theory (existence of solutions of Kantorovich's problem),

is attained with some optimal transport plan λij ∈ Π(μi, μj). Thus,

Now, let λ, in the above Lemma corresponding to μj, j = 1, 2, 3 , be the probability measure on X1×X2×X3 having marginals λ12 and λ23 on X1×X2, and X2×X3, respectively.

We then have

Q.E.D.

One more useful statistical methodology arising from optimal transport theory was the unexpected connection between the current topic of partial identification (of statistical models) and random set statistics via optimal transport, as pointed out by Galichon (2016).

First, it seems here is a good place to spell out briefly what is statistics and how statisticians should conduct statistics! Roughly speaking, statistics is about finding the truth from data, and statistical works should be credible.

Unlike physical science, we need models to conduct statistics. Based upon observed data, statisticians propose models. A model is a subjective (stochastic) equation together with a set of assumptions supporting it. Of course each model contains unknown “parameters” which need to be estimated (from data) to specify it for, e.g. prediction and decision-making.

As we all know (since we “follow” the traditional approach to without any hesitation) that the maintained assumptions (whether they are justified or not) are there to allow us to use available data to consistently estimate the model parameters, noting that estimability of parameters in this sense is related to the notion of identification.

In order to justify our statistical estimation of our model parameter, say, in the model {Fθ  θ ∈ Θ}, we impose assumptions to make the true (but unknown) parameter θo identifiable (i.e. point identifiable) in the sense that the map θFθ is injective. A well-known example for all is the linear supply and demand model in microeconomics. General supply and demand models are provided by economic theory, but when a text book advises us to use a linear model (for simplicity?), it puts down assumptions without justifications to make sure that the model parameter of interest is point identifiable.

If the maintained assumptions are not plausible, the map θFθ might not be injective, i.e. there are θ′ ≠ θ such that Fθ = Fθ (θ and θ′ are said to be observationally equivalent) so that the model parameter is not point-identifiable, such as in games with multiple Nash equilibria. In such a situation, should we give up the analysis or the empirical attempt ? No, as Manski (e.g. 2007) put it, we could live with it and look for a new way to estimate the model parameter, not as a point but as a subset of the parameter space, called the identified set. Thus, estimating an identified set is the main goal for partially identified statistical models.

In this improved statistical setting, we are facing partially identified models where point estimation becomes set estimation. But when the estimation target is a set, the identified set (i.e. set of observationally equivalent parameters), its estimated set is a random set (a set-valued function of the data). Thus, we are facing a natural extension of classical statistics, namely statistics with random sets rather with random points.

Now, the general theory of probability supporting statistical analysis should cover the theory of random sets (as an extension of random vectors) which are well defined random elements. See Matheron (1975) or Nguyen (2006). In other words, in view of credible statistics, statistics of random sets should take a central stage in empirical research. However, the statistical theory of set-valued statistics is still young. In some contexts, e.g. estimating the level sets of an unknown probability density function, the estimation method is Hartigan's (1987) excess mass which is the counterpart of maximum likelihood method in traditional statistics. See also Nguyen (2006).

What is “interesting” is that some partial identification problems can be formulated as an optimal transport problem which in turn provides a connection with random sets useful for computational purposes. See Galichon (2016) for details. Here we elaborate a bit on the theory of random sets since after all as the identified set is a set, its estimator will be a random set statistic, and we need to investigate its properties just like the special case of random vector statistics. The point is this. While random set statistics is the natural approach to inference about set parameters in partially identified models, the context in which these partially identified models can be formulated as optimal transport problems brings out specific ways for conducting inference.

Now, in spirit, partial identification setting is somewhat similar to statistics with coarse data where the data from the desired DGP (an unknown distribution) are not observable, but instead the data from a random set containing it are observed, i.e. the latent random variable of interest is an almost sure selector of the observed random set. As such, it is related to the estimation of the identified set from a random set viewpoint.

In Galichon's analysis (2016) the focus is the identification of an identified set of a partially identified model, and the connection with random set is based upon a result of Artstein (1983) which is generalized by Norberg (1992) as follows.

First of all, capacity functionals play the role of probability laws of random (closed) sets on Rd by Choquet's Theorem (the counterpart of Lebesgue- Stieltjes Theorem for random vectors), see Nguyen (2006) for an introduction. Two capacity functionals T1, T2 are said to form an ordered coupling if there exists a common probability space (Ω,A,P) on which are defined two random closed sets S1, S2 such that S2S1P, i.e. where S1, S2 have T1, T2 as capacity functionals, respectively. When the random set S2 is single-valued, a special case which is identified with a random vector, it becomes an a.s. selector of S1, i.e. P(S2 ∈ S1) = 1. This special case corresponds to the situation in coarse data analysis as well as in partial identification estimation (of identified sets). An useful result from random set theory for it is the following which allows us to characterize an identified set as the core of a capacity functional of a random set.

Theorem (Norberg, 1992). Let μ be a probability measure on B(Rd) and T be a capacity functional, then the following are equivalent:

  1. μ ≤ T on compact sets of Rd

  2. There exists a common probability space (Ω,A,P) on which are defined a random closed set S with capacity functional T and a random vector X with law μ, and which is an a.s. selector of S.

We elaborate a bit on the essentials of random set theory to introduce statisticians to random set statistics.

Just like traditional or standard way to start a probability theory for statistical applications, we consider the simple situation where random quantities take values in a finite set.

Let U be a finite set with n elements. The power set of U is denoted as 2U (set of functions U → {0, 1}). For AU, #(A) denotes the number of elements of the subset A.

The source of uncertainty is a probability space (Ω,A,P). A map X(.) : Ω → 2U is a finite random set (a set obtained at random).

The law of X is the probability measure PX on the power set of 2U , where PX(.)= PX−1 (the pushforward of P by X).

As in the case of finite random variables, PX is completely determined by the probability density of X, namely f(.) : 2U → [0, 1] where f(A) = P(X = A). Alternatively, PX is characterized by the distribution function F(.) : 2U → [0, 1], where F(A) = P(XA). The counterpart of the characterization of distribution functions of random variables is this.

A set-function F(.) : 2U → [0, 1] is a distribution function of a (finite) random set X if it satisfies the following conditions:

  1. F()= 0, F(U) = 1

  2. For any k ≥ 2, and Ai, i = 1, 2, …, k, subsets of U,

Alternatively, since T(A) = P(XA) = 1 − F(Ac), the law of X can be also characterized by the set function T(.) : 2U → [0, 1], called the capacity functional of X.

Axiomatically, a capacity function is a function T(.) : 2U → [0, 1] satisfying the following:

  1. T()= 0, T(U) = 1

  2. For any k ≥ 2, and Ai, i = 1, 2, …, k, subsets of U, we have

Now let X be a non-empty random set on the finite set U (i.e. P(X = ) = 0). The core C(T) of its capacity functional T is the set of probability measures μ on U such that μ(.) ≤ T(.).

We extend all the above to the case where the sampling space U=Rd.

Remember that, for random vectors, i.e. random elements taking values in Rd, their probabilistic background was based on the theory of measures on the Borel measurable space (Rd,B(Rd)) where the Borel σ − field B(Rd) is constructed using the topology of Rd. For random sets taking values as subsets of Rd, we need a topology on 2Rd. Now a random vector is identified as a random set taking singletons as values. But each {x} is a closed set of Rd. Thus, following Matheron (1975), we consider random sets taking values as closed subsets of Rd, denoted as F(Rd) on which a “hit-or-miss topology” is established to obtain its Borel σ − field, denoted as B(F).

A random closed set, defined on a probability space (Ω,A,P) (its source of uncertainty), is a map X(.):ΩF(Rd), AB(F)- measurable. Its probability law is the probability PX on B(F) obtained as PX(.)= PX−1(.).

The notion of capacity functionals in the finite case is extended as follows. Let K(Rd) denote the set of compact subsets of Rd. Then T(.):K(Rd)R is called a capacity functional if it satisfies:

  1. 0 ≤ T(.) ≤ 1, T()= 0

  2. For any k ≥ 2, and Ai, i = 1, 2, …, k, subsets of U, we have

  1. If KnK(Rd) and KnKK(Rd) then T(Kn) ↘ T(K).

The counterpart of Lebesgue-Stieltjes Theorem is the Choquet's Theorem: If T(.):K(Rd)R is a capacity functional, then there exists a unique probability measure Q on B(F) such that, for all KK(Rd), Q(FK)=T(K), where FK={AF(Rd):AK}.

In other words, the capacity functional characterizes the probability law of a random closed set. The core of a capacity functional is the set of probability measures μ on (Rd,B(Rd)) such that μ ≤ T on K(Rd). Norberg's Theorem (1992) is valid for Rd so that the core of a capacity functional (of a random closed set on Rd) is related to identified sets in partially identified statistical models.

Artstein
,
Z.
(
1983
), “
Distributions of random sets and random selections
”,
Israel Journal of Mathematics
, Vol. 
46
, pp. 
313
-
324
.
Belloni
,
A.
and
Winkler
,
R.
(
2011
), “
On multivariate quantiles under partial order
”,
The Annals of Statistics
, Vol. 
39
No. 
2
, pp. 
1125
-
1179
.
Brenier
,
Y.
(
1991
), “
Polar factorization and monotone rearrangement of vector-valued functions
”,
Communications on Pure and Applied Mathematics
, Vol. 
44
No. 
4
, pp. 
375
-
417
.
Breiman
,
L.
(
2001
), “
Statistical modeling: the two cultures
”,
Statistical Science
, Vol. 
16
No. 
4
, pp. 
199
-
231
.
Carlier
,
G.
,
Chernozukov
,
V.
and
Galichon
,
A.
(
2016
), “
Vector quantile regression: an optimal transport approach
”,
The Annals of Statistics
, Vol. 
44
No. 
3
, pp. 
1165
-
1192
.
Carlier
,
G.
,
Chernozukov
,
V.
and
Galichon
,
A.
(
2017
), “
Vector quantile regression beyond the specific case
”,
Journal of Multivariate Analysis
, Vol. 
161
, pp. 
96
-
102
.
Daoud
,
A.
and
Dubhashi
,
D.
(
2023
), “
Statistical modeling: the three cultures
”. doi: .
Dudley
,
R.M.
(
2003
),
Real Analysis and Probability
,
Cambridge University Press
,
Cambridge, MA
.
Galichon
,
A.
(
2016
),
Optimal Transport Methods in Economics
,
Princeton Univ. Press.
,
Princeton, NJ
.
Graf
,
S.
and
Mauldin
,
R.D.
(
1989
), “
A classification of disintegration of measures
”, in
Measure and Measurable Dynamics
,
No 94, Contemp.Math
, pp. 
147
-
158
,
AMS
.
Hallin
,
M.
,
Paindaveine
,
D.
and
Siman
,
M.
(
2010
), “
Multivariate quantiles and multiple-output regression quantiles: from L1 optimization to half-space depth
”,
The Annals of Statistics
, Vol. 
38
No. 
2
, pp. 
635
-
669
.
Hartigan
,
J.A.
(
1987
), “
Estimation of a convex density contour in two dimensions
”,
Journal of the American Statistical Association
, Vol. 
82
No.
397
, pp. 
267
-
270
.
Kantorovich
,
L.V.
(
1942
), “
On the translocation of masses
”,
C.R. Academy of Sciences of URSS
, Vol. 
77
, pp. 
199
-
201
.
Koenker
,
R.
(
2017
), “
Quantile regression 40 years on
”,
Annual Review of Economics
, Vol. 
9
No. 
1
, pp. 
155
-
176
.
Koenker
,
R.
and
Basett
,
G.
(
1978
), “
Regression quantiles
”,
Econometrica
, Vol. 
46
, pp. 
33
-
50
.
Manski
,
C.
(
2007
),
Identification for Prediction and Decision
,
Harvard University Press
,
Cambridge, MA
.
Matheron
,
G.
(
1975
),
Random Sets and Integral Geometry
,
J. Wiley
,
New York
.
McCam
,
R.
(
1995
), “
Existence and uniqueness of monotone measure-preserving maps
”,
Duke Math. J.
, Vol. 
80
No. 
2
, pp. 
309
-
323
.
Monge
,
G.
(
1781
), “Memoire sur la theorie des deblais et des remblais”, in
Histoire de l’ Academie Royale des Sciences de Paris
, pp. 
666
-
704
.
Nguyen
,
H.T.
(
2006
),
A Introduction to Random Sets
,
Chapman & Hall/CRC
,
Boca Raton, FL
.
Norberg
,
T.
(
1992
), “
On the existence of ordered coupling of random sets with applications
”,
Israel Journal of Mathematics
, Vol. 
77
No.
3
, pp. 
241
-
264
.
Parzen
,
E.
(
1979
), “
Nonparametric data modeling
”,
Journal of the American Statistical Association
, Vol. 
74
No.
365
, pp. 
105
-
121
.
Serfling
,
R.
and
Zuo
,
Y.
(
2010
), “
Discussion
”,
The Annals of Statistics
, Vol. 
38
No. 
2
, pp. 
676
-
684
.
Villani
,
C.
(
2003
),
Topics in Optimal Transportation
,
American Mathematical Society
,
Providence, RI
.
Published in Asian Journal of Economics and Banking. Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at http://creativecommons.org/licences/by/4.0/legalcode

or Create an Account

Close Modal
Close Modal