Strong consistency of a kernel-based rule for spatially dependent data

Ahmad Younso can be contacted at: ahyounso@yahoo.fr

We consider the kernel-based classifier proposed by Younso (2017). This nonparametric classifier allows for the classification of missing spatially dependent data. The weak consistency of the classifier has been studied by Younso (2017). The purpose of this paper is to establish strong consistency of this classifier under mild conditions. The classifier is discussed in a multi-class case. The results are illustrated with simulation studies and real applications.

1. Introduction

In many applications one needs to classify spatial data that have been collected incompletely. The classification of incomplete-data problem, in which certain features are missing from particular feature vectors, exists in a wide range of fields, including image labeling, computer vision and others. For example, in the remote sensing technology, because of the internal malfunction of satellite sensors and poor atmospheric conditions such as thick cloud, the acquired remote sensing images often suffer from missing information at certain pixels and one wants to classify these pixels using the information in the nearest identified pixels. Many existing classification algorithms assume either certain parametric distributions for the data or certain forms of separating curves or surfaces. These parametric classifiers are suboptimal and of limited use in practical applications where little information about the underlying distributions is available a priori. In comparison, nonparametric classifiers are usually more flexible in accommodating different data structures, and are hence more desirable. [21] has proposed a nonparametric approach allowing to include contextual features for classifying missing spatial data and has investigated the consistency of the classifier under mild conditions. In nonparametric spatial estimation, the existing works concern mainly the estimation of a probability density and regression functions, see the key references: [2–4,15] and [14]. More recently, [5] has proposed a kernel spatial density estimator allowing for the analysis of spatial clustering. In this work, we establish strong consistency of the classifier proposed by [21] and then, we check its performance with simulation studies and applications. We consider a strictly stationary random field ${(X_{i}, Y_{i})}_{i \in ℤ^{N}}$ defined on some probability space $(Ω, ℱ, ℙ)$ and taking values in $ℝ^{d} \times {0, \dots, M}$ ⁠, for some integer $M \geq 1$ ⁠. In the problem of classification, for each $i \in ℤ^{N}$ ⁠, $X_{i}$ is a vector of features and $Y_{i}$ is the label (class) of $X_{i}$ ⁠. A point $i = (i_{1}, \dots, i_{N}) \in ℤ^{N}$ will be referred to as a site. For $n = (n_{1}, \dots, n_{N}) \in {(ℕ^{*})}^{N}$ ⁠, we define the rectangular region $I_{n}$ by $I_{n} = {i \in ℤ^{N} : 1 \leq i_{k} \leq n_{k}, \forall k = 1, \dots, N}$ ⁠. We will write $n \to \infty$ if $\min_{k = 1, \dots, N} n_{k} \to \infty$ ⁠. Define $\hat{n} = n_{1} \times \dots \times n_{N} = card (I_{n})$ and assume that the random field is observed on a subset $S_{n} \subset I_{n}$ with $I_{n} - S_{n}$ is a bounded set for $\hat{n}$ large enough. When processing a particular site, its features are not used at all, but only the features of its neighbors will be considered. In other words, we wish to predict the label $Y_{j}$ of a new site $j$ based only on observations in a vicinity, say $ν_{j} \subset S_{n}$ ⁠, where the set $ν_{j}$ is not containing $j$ ⁠. Let $ν_{j} = j + ν$ ⁠, where $ν \subset ℤ^{N}$ is a fixed bounded set of sites not containing $0$ with $card (ν) = l$ (⁠ $l$ is also the cardinal of each $ν_{j}$ ⁠). We assume that $X_{(j)} = {X_{i} : i \in ν_{j}}$ is a random vector taking values in $ℝ^{\tilde{d}}$ with $\tilde{d} = l d$ ⁠, and that the components of $X_{(j)}$ are ordered according to an arbitrary order on indices, for example the lexicographic order. The pair $(X_{(j)}, Y_{j})$ may be completely described by $μ$ ⁠, the probability measure for $X_{(j)}$ ⁠, and $η (x)$ ⁠, the regression of $Y_{j}$ on $X_{(j)} = x$ ⁠. Assume that for each $i \in ℤ^{N}$ ⁠, $(X_{(i)}, Y_{i})$ has the same distribution as the pair $(X_{(1)}, Y_{1})$ ⁠. We will create a classifier $g : ℝ^{\tilde{d}} \to {0, \dots, M}$ mapping $X_{(j)}$ into the predicted label of $X_{j}$ ⁠. The error rate, or risk, of a rule $g$ is $L (g) = ℙ {g (X_{(j)}) \neq Y_{j}}$ ⁠. This is minimized by the rule

g^{*} (x) = \underset{0 \leq k \leq M}{\arg \max} ℙ (Y_{j} = k | X_{(j)} = x),

(1.1)

whose error rate $L^{*} = L (g^{*})$ is called the Bayes-optimal risk and $g^{*} (x)$ is called the Bayes rule. Clearly, $g^{*} (x)$ predicts the label $Y_{j}$ of the site $j$ using only $x$ ⁠, the value of $X_{(j)}$ ⁠, while the features vector $X_{j}$ does not affect the classification procedure at all. This means that $g^{*} (x)$ well work event if $X_{j}$ is completely missing. Unfortunately, we cannot use (1.1) directly because it depends on the distribution of $(X_{(j)}, Y_{j})$ which is generally unknown. So, we take $J_{n} = {i \in S_{n} : ν_{i} \subset S_{n}}$ and we use the training data $D_{n} = {(X_{i}, Y_{i}) : i \in J_{n}}$ to construct a classifier $g_{n} (x)$ ⁠. We consider the classifier $g_{n} (x)$ obtained by extending the classifier of [21] to the multi-class case as follows:

g_{n} (x) = \arg \max_{0 \leq k \leq M} \sum_{i \in J_{n}} 1_{{Y_{i} = k}} K (\frac{x - X_{(i)}}{b_{n}}) .

(1.2)

where $1_{A}$ denotes the indicator of the set $A$ ⁠, the kernel $K : ℝ^{\tilde{d}} \to ℝ_{+}$ is a density function on $ℝ^{\tilde{d}}$ ⁠, and $b_{n}$ is a sequence of bandwidths tending to zero as $n$ tends to infinity. In one hand, the sum in (1.2) is taken over $J_{n}$ instead of $S_{n}$ just to ensure that $X_{(i)}$ always exists and that the sums make sense. On the other hand, for each new site $j \notin S_{n}$ ⁠, the classifier $g_{n} (x)$ predicts the missing label $Y_{j}$ independently of its features vector $X_{j}$ which does not belong neither to the training sample $D_{n}$ nor to the components set of $X_{(j)}$ ⁠. Consequently, $g_{n} (x)$ may classify $j$ even if its own features vector $X_{j}$ is completely missing and that makes our method exhibit good performance in comparison with the classical spatial Markovian model. [6] proposes a nonparametric approach to extend the result of [2] to the non-Markovian case by using two kernels in the estimator in order to control both the distance between observations and that between spatial locations without using a specific vicinity for the non-observed site. This latter approach may be developed to classify spatial data but it does not work when one wants to classify sites with missing or incomplete features. Let $L_{n} = ℙ {g_{n} (X_{(j)}) \neq Y_{j} | D_{n}}$ be the error probability of $g_{n} (x)$ ⁠. Generally, we cannot hope to design a classifier that achieve the Bayes error probability $L^{*}$ but it is possible that the limit behavior of $L_{n}$ compares favorably to $L^{*}$ ⁠. This idea is encapsulated in the notion of consistency.

Definition 1.1.

The classifier $g_{n} (x)$ is called weakly consistent if

E L_{n} \to L^{*} as n \to \infty

and strongly consistent if

L_{n} \to L^{*} as n \to \infty with probability one .

The classifier is called universally (weakly or strongly) consistent if it is (weakly or strongly) consistent for all distribution of

(X_{1}, Y_{1})

⁠.

Remark 1.1.

Since $L_{n}$ is bounded, the weak consistency of $L_{n}$ is equivalent to the convergence of $L_{n}$ towards $L^{*}$ in probability which means that strong consistency implies the weak consistency.

In this paper, we investigate the strong consistency of $g_{n} (x)$ under some mild mixing conditions.

2. Notation and general hypotheses

Let $(Ω, ℱ, ℙ)$ be a probability space and let $A$ and $B$ be two sub $σ$ -fields of $ℱ$ ⁠. The $α$ -mixing coefficient between $A$ and $B$ is defined by

α = α (A, B) = \sup_{A \in A, B \in B} | ℙ (A \cap B) - ℙ (A) ℙ (B) |

and the $β$ -mixing coefficient is defined by

β = β (A, B) = E {\sup_{A \in A} | ℙ (A | B) - ℙ (A) |} .

Let ${(Z_{i})}_{i \in ℤ^{N}}$ be a random field on (Ω $, ℱ, ℙ)$ and taking values in some space (Ω^′ $, ℱ^{'}$ ⁠).

Definition 2.1.

The random field ${(Z_{i})}_{i \in ℤ^{N}}$ is called strongly mixing if there exists $χ : ℝ \to ℝ^{+}$ with $χ (t) ↘ 0$ as $t \to \infty$ ⁠, and for any $E, E^{'} \subset ℤ^{N}$ with finite cardinals,

α (B (E), B (E^{'})) \leq χ (dist (E, E^{'})),

where

dist (E, E^{'})

denotes the Euclidean distance between

E

and

E^{'}

⁠.

The $α$ -mixing condition is one of the most popular mixing conditions. This condition is satisfied by many spatial models. Examples can be found in [17,19] and [11].

Definition 2.2.

The random field ${(Z_{i})}_{i \in ℤ^{N}}$ is called $β$ -mixing if there exists $ϕ : ℝ \to ℝ^{+}$ with $ϕ (t) ↘ 0$ as $t \to \infty$ ⁠, and for any $E, E^{'} \subset ℤ^{N}$ with finite cardinals,

β (B (E), B (E^{'})) \leq ϕ (dist (E, E^{'})) .

Linear processes or more generally Markov chains may be

β

-mixing (see [9]). Similar mixing coefficient is used by [2] to establish some asymptotic properties of the kernel regression estimator in the spatial case. The two mixing coefficients

α

and

β

are related by the inequality

2 α \leq β

(see [18]). It means that any

β

-mixing random field is a strongly mixing one.Now, we need some regularity assumptions.

Assumption 1.

$K$ is a regular kernel, that is, there exist $δ > 0$ and $c > 0$ such that $c 1_{B (0, δ)} \leq K (x) for all x \in ℝ^{\tilde{d}} and \int_{ℝ^{\tilde{d}}} {Sup}_{u \in v + B (0, δ)} K (u) d v < \infty$ ⁠, where $B (x, δ)$ is the closed ball of radius $δ > 0$ and center at $x$ ⁠.

Assumption 2.

For each $i$ ⁠, $X_{(i)}$ has a density $f$ with respect to Lebesgue measure and for each $i \neq j$ with $ν_{i} \cap ν_{j} = φ$ ⁠, $(X_{(i)}, X_{(j)})$ has a density $f_{i, j}$ such that $\sup_{u, v \in ℝ^{\tilde{d}}} | f_{i, j} (u, v) - f (u) f (v) | \leq C$ ⁠, for some $C > 0$ ⁠.

Assumption 3.

The random field ${(X_{i}, Y_{i})}_{i \in ℤ^{N}}$ is $β$ -mixing and there exists $θ > 0$ such that $ϕ (t) = O (t^{- θ}) for all t \in ℝ_{+}^{*}$ ⁠.

Assumption 1 is used by [8] and [7] in the i.i.d. case. It may be satisfied if $K (x) = ξ (‖ x ‖)$ where $ξ$ is a non-negative and decreasing function on $[0, + \infty]$ and $‖ . ‖$ is the Euclidean norm. Hence, the Gaussian kernel is regular. Assumption 2, used by [21] to prove the weak consistency, is similar to that used by [3]. It is satisfied for example if $f$ and $f_{i, j}$ are uniformly bounded. Assumption 3 means that the random field is arithmetically $β$ -mixing which implies that it is also strongly mixing with $α (B (E), B (E^{'})) \leq ϕ (dist (E, E^{'}))$ since $2 α \leq β$ ⁠.

3. Preliminary lemmas

This section is a collection of technical lemmas which will be used to prove the strong consistency result stated in Theorem 4.1. Let ${‖ . ‖}_{r}$ denote the $L_{r}$ -norm for any real $r \geq 1$ ⁠. The following lemma is a direct consequence of the covariance inequality of Ibragimov [12] and the inequality $2 α \leq β$ ⁠.

Lemma 3.1.

If $r$ , $s$ and $t$ are strictly positive reals such that $r^{- 1} + s^{- 1} + t^{- 1} = 1$ and $Z_{1}$ and $Z_{2}$ are two $ℝ$ -valued random variables such that ${‖ Z_{1} ‖}_{s} < \infty$ and ${‖ Z_{2} ‖}_{t} < \infty$ , then

| cov (Z_{1}, Z_{2}) | \leq 2 {β (σ (Z_{1}), σ (Z_{2}))}^{1 / r} {‖ Z_{1} ‖}_{s} {‖ Z_{2} ‖}_{t},

where

σ (Z_{i})

is the

σ

-field generated by

Z_{i}

for

i = 1,2

For any sub $σ$ -fields $A$ and $B$ of $ℱ$ ⁠, we denote by $A \lor B$ the $σ$ -field generated by $A \cup B$ ⁠. The following coupling lemma of Berbee [1] will be needed to establish the asymptotic results.

Lemma 3.2.

Let $Z$ be a random variable on (Ω $, ℱ, ℙ)$ with values in some Polish space Ω^′ and $M$ a sub $σ$ -field of $ℱ$ . Assume that there exists a random variable $U$ uniformly distributed over $[0, 1]$ , independent of $σ (Z) \lor M$ . Then, there exists a random variable $\tilde{Z}$ measurable with respect to $σ (U) \lor σ (Z) \lor M$ , distributed as $Z$ and independent of $M$ , such that

ℙ (Z \neq \tilde{Z}) = β (M, σ (Z)) .

Remark 3.1.

We recall that a Polish space Ω^′ is a topological space which is separable and completely metrizable (see [13]) and that most of the familiar objects of study in analysis involve Polish spaces. For example, $ℝ^{d}$ for each integer $d \geq 1$ ⁠, is Polish with the usual topology and ${0, 1, \dots, n}$ ⁠, for all $n \in ℕ$ ⁠, is Polish with discrete topology. We also recall that a countable product of Polish spaces is Polish.

The following covering lemma can be found in [8].

Lemma 3.3.

Let $K$ be a regular kernel on $ℝ^{\tilde{d}}$ and $b_{n}$ be a sequence of bandwidths. Denote $K_{n} (x) = b_{n}^{- \tilde{d}} K (x / b_{n})$ . Then, for any probability measure $μ$ ,

\sup_{u \in ℝ^{\tilde{d}}} \int_{ℝ^{\tilde{d}}} \frac{K_{n} (x - u)}{E K_{n} (x - X_{(1)})} μ (d x) < ρ,

for some

ρ > 0

dependent only on

K

The proof of the following lemma is in [4] (see also [21]).

Lemma 3.4.

Let $ζ = - N - ϵ + (1 - γ) N a^{- 1}$ for some $0 < a < 1 / 2$ , with $γ$ and $ϵ$ being small positive numbers such that $a^{- 1} - (N + ϵ) {(1 - γ)}^{- 1} N^{- 1} > 1$ . If Assumption 3 holds for some $θ > 2 N$ , then for any $δ > 0$ ,

\sum_{‖ i ‖ \geq δ} {‖ i ‖}^{ζ} {ϕ (‖ i ‖)}^{1 - γ} < \infty .

The proof of the following lemma follows from the reverse triangle inequality.

Lemma 3.5.

For each $i, j \in J_{n}$ ⁠, $dist (ν_{i}, ν_{j}) \geq \max {‖ i - j ‖ - \tilde{r}, 0}$ ⁠, where $\tilde{r} = \max {‖ i - j ‖, i, j \in ν}$ is the diameter of $ν \subset ℤ^{N}$ ⁠.

4. Main result

The weak consistency of the classifier (1.2) has been established by [21]. In this section we study the strong consistency of (1.2). The following theorem states the strong consistency under mild conditions.

Theorem 4.1.

Assume that Assumptions 1–3 hold for some $θ > 2 N$ ⁠. If $\hat{n} b_{n}^{\tilde{d}} \to \infty$ as $n \to \infty$ ⁠, then

L_{n} \to L^{*} as n \to \infty with probability one .

Remark 4.1.

Note that the assumption on the bandwidth, using by [21] to prove the weak consistency, is similar to the classical assumption used by [7] and [8] in the independent case. In addition, the condition on $b_{n}$ is minimal compared to that used by [4] and [3] since they have studied the rate of uniform convergence for the estimators. However, the restrictive constraints on the bandwidth in [4] and [3] are related to $θ$ and one has to let $θ \to \infty$ in order to attain the classical assumption.

5. Simulation study including comparison with the classical kernel rule

Our aim in this section is to look at how the classifier (1.2) behaves on simulated samples by comparing it with the classical kernel rule. We use the R statistical programming environment to run a simulation study for $N = 2$ ⁠. Let ${(X_{(i, j)}, Y_{(i, j)})}$ be the field of interest and suppose that the simulated data are observed on the area $I_{(n, n)} = {(i, j) \in ℤ^{2} : 1 \leq i, j \leq n}$ ⁠. Let

J_{(n, n)} = I_{(n, n)} \ {{ν_{(i, j)} \cup {(i, j)}, (i, j) \in M} \cup {(1, j), (k, 1), (n, l), (m, n) : 1 \leq j, k, l, m \leq n}},

where $M = {(2 k, 2 l), 1 \leq k, l \leq 10}$ is the set of non-observed sites which need to be classified. In this particular case, the vicinity of any missing site $(i, j)$ may be taken as in Figure 1.

It is important to note that the vicinity $ν_{(i, j)}$ may be designed depending on the location of the missing site (see some typical examples in Figure 2) and that samples with larger size give more freedom to design vicinities.

Figure 2 shows some examples of vicinities that can be used when the missing sites are not completely surrounded by already labeled sites (located at the edges of $S_{n}$ for example).

We suppose that the simulated fields have the covariance function

C (u) = 4 {‖ u ‖}^{- 4.5} for each u \in ℝ^{*^{2}} .

We use the classifier (1.2) with $K (x) = \prod_{i = 1}^{8} K_{i} (x_{i})$ for $x = (x_{1}, \dots, x_{8}) \in ℝ^{8}$ where $K_{i} (x_{i})$ is the standard Gaussian density (Gaussian kernel). We suppose that ${X_{(i, j)}, 1 \leq i, j \leq n}$ are observations of a Gaussian mixture model:

π_{0}  (μ_{0}, σ_{0}^{2}) + π_{1}  (μ_{1}, σ_{1}^{2}) + π_{2}  (μ_{2}, σ_{2}^{2}),

with $μ_{0} < μ_{1} < μ_{2}$ and $π_{1} + π_{2} + π_{3} = 1$ ⁠. In order to illustrate the fact that our method works for multi-class, the data set ${X_{(i, j)}, 1 \leq i, j \leq n}$ is partitioned in three clusters as follows:

class (Y_{(i, j)} = 0) : X_{(i, j)} < (μ_{0} + μ_{1}) / 2

class (Y_{(i, j)} = 1) : (μ_{0} + μ_{1}) / 2 \leq X_{(i, j)} \leq (μ_{1} + μ_{2}) / 2

class (Y_{(i, j)} = 2) : X_{(i, j)} > (μ_{1} + μ_{2}) / 2 .

For each $n = 50,75,100$ ⁠, we generate $100$ samples on the region $I_{(n, n)}$ with $μ_{0} = 5$ ⁠, $μ_{1} = 15$ ⁠, $μ_{2} = 25$ ⁠, $π_{0} = π_{1} = π_{2} = 1 / 3$ and $σ_{0}^{2} = σ_{1}^{2} = σ_{2}^{2} = 4$ ⁠. In each replication, we use the classifier (1.2), constructed on the basis of the training data observed on $J_{(n, n)}$ ⁠, to re-predict the labels of sites in the test set $M$ ⁠. Figure 3 displays one replication for $n = 50$ ⁠.

The optimal bandwidth ${\hat{b}}_{o p t}$ is obtained by minimizing the cross-validation criterion on a training sample and the misclassification error rate (⁠ $E R$ ⁠) is evaluated based on the associated test sample. The average error rate (⁠ $A E R$ ⁠) is obtained by averaging the error rates associated with the corresponding $100$ test samples.

Table 1 shows that the estimated optimal bandwidth and the average error rate decrease when the training sample size increases. This means that the practical results in the simulation study are in line with the theoretical results. Now, let us compare the average error rate (⁠ $A E R$ ⁠) resulting from application of the proposed classifier with that resulting from application of the classical kernel rule.

5.1 Comparison with the classical kernel rule

The classical kernel rule is given, for any unlabeled site $j$ with $X_{j} = x$ ⁠, by

{\tilde{g}}_{n} (x) = \arg \max_{0 \leq k \leq M} \sum_{i \in I_{n}} 1_{{Y_{i} = k}} \tilde{K} (\frac{x - X_{i}}{h_{n}}) .

where $\tilde{K} : ℝ^{d} \to ℝ_{+}$ is a kernel on $ℝ^{d}$ (the Gaussian kernel is considered here), and $h_{n}$ is a sequence of bandwidths. In order for the classical kernel classifier to be usable in our case, we have to adjust it slightly by taking the sum over $I_{n} - M$ instead of $I_{n}$ ⁠, $i . e .$ ⁠, for each $j \in M$ with $X_{j} = x$ ⁠,

{\tilde{g}}_{n} (x) = \arg \max_{0 \leq k \leq M} \sum_{i \in I_{n} - M} 1_{{Y_{i} = k}} \tilde{K} (\frac{x - X_{i}}{h_{n}}) .

From the theoretical point of view, this is justified by the fact that ${\tilde{g}}_{n}$ has the same asymptotic behavior on $I_{n}$ as on $I_{n} - M$ since $M$ is bounded. In this classical kernel method, we consider knowing the features vector $X_{j}$ of each element $j$ of $M$ and we use $x$ ⁠, the value of $X_{j}$ ⁠, to predict its class while we needed only observations in nearby sites to predict the label of $j$ by the classifier (1.2). We apply the classical kernel classifier to re-classify the elements of $M$ using the same training samples generated above and taking into account all the replications for each size $n = 50,75,100$ ⁠. Similar to what we have done in application of (1.2), the optimal bandwidth ${\hat{h}}_{o p t}$ is chosen by minimizing the cross-validation criterion on a training sample and the misclassification error rate (⁠ $E R$ ⁠) is evaluated based on the associated test sample. Table 2 reports the average error rate (AER), obtained by averaging the error rates associated with the corresponding $100$ test samples.

By comparing Tables 1 and 2, we observe that the corresponding error values in the two tables begin to be close as $n$ increases. This supports the possibility of using the classifier (1.2) as an alternative to the classical kernel classifier when we have to classify sites with missing features.

6. Application to a real data

A digital image is nothing than data numbers indicating variation of $r e d$ ⁠, $g r e e n$ and $b l u e$ (⁠ $R G B$ ⁠) at a particular location on a grid of pixels. An $R G B$ color value is specified with: $r g b (r e d, g r e e n, b l u e)$ ⁠. Each parameter $(r e d, g r e e n, b l u e)$ defines the intensity of the color as an integer between $0$ and $255$ ⁠. For example, $r g b (0, 0, 255)$ is rendered as blue, because the blue parameter is set to its highest value $255$ and the others are set to $0$ ⁠. One can divide $R G B$ color values by $255$ in order to provide values in the interval $[0, 1]$ ⁠. Let us have an image of Eiffel tower with $100$ missing pixels as in Figure 3.

We use the R package $j p e g$ to convert a $j p g$ image into $3$ -d array of numbers. The package $j p e g$ offers the $r e a d J P E G ()$ function which can read raster graphics (consisting of “pixel matrices”) in $j p g$ format into $R$ ⁠. It returns either a single matrix with gray values in $[0, 1]$ or $3$ -d array with the $R G B$ values in $[0, 1]$ ⁠, say $E$ ⁠. In our example of Figure 3, the dimensions of $E$ are $306 \times 165 \times 3$ ⁠. Thus, the elements of $E [, j]$ represent the intensities of the color $j$ ⁠, for $j = “ r e d ”, “ g r e e n ” or “ b l u e ”$ ⁠, at all pixels of the grid $I_{(306,165)}$ ⁠. For example, the matrix $E [55 : 60,1 : 6,1]$ displays the intensities of $r e d$ in each pixel of the region:

{(i, j), 55 \leq i \leq 60,1 \leq j \leq 6} .

Le $X_{(i, j)} = (X_{(i, j)}^{(1)}, X_{(i, j)}^{(2)}, X_{(i, j)}^{(3)})$ where $X_{(i, j)}^{(k)}$ is the intensity of the color $k$ at the pixel $(i, j)$ ⁠. Since our purpose is to classify new sites with completely missing features, we set an arbitrary threshold of 0.4 and we define labels as follow:

\begin{matrix} Y_{(i, j)} = {\begin{matrix} 1, & if min_{1 \leq k \leq 3} X_{(i, j)}^{(k)} > 0.4 \\ 0, & otherwise . \end{matrix} \end{matrix}

The set of $100$ missing pixels is taken as a test set, say $M$ ⁠. We use the classifier (1.2) (see (1.7) for the binary version) to classify each element of $M$ based on its eight-neighbors. The optimal bandwidth is evaluated by minimizing the cross-validation criterion on the known sites where we get ${\hat{b}}_{o p t} \approx 0.72$ ⁠. The misclassification error rate (⁠ $E R$ ⁠) is evaluated on $M$ where we obtain $E R = 0.04$ which indicates that there are only four misclassified cases out of $100$ classified cases (see Figure 4).

Now let us use the support vector machine (S V M) classifier to re-classify the elements of $M$ ⁠. In this case we should suppose that the RGB value is known for each element of $M$ ⁠. For implementing support vector machine in R programming language, we use the package $e 1071$ ⁠. According to this classifier, we get a misclassification error of $E R = 0.11$ and this permits to conclude that our kernel classifier in this example proceeds well compared to the (SVM) procedure.

7. Proof of Theorem 4.1

Without loss of generality, we prove the theorem in the binary case where $Y_{j}$ takes values in ${0, 1}$ since no additional argument is required to prove it in the multi-class case. However, the Bayes classifier (1.1) in the binary case is given by

g^{*} (x) = {\begin{array}{l} 0 if ℙ {Y_{j} = 0 | X_{(j)} = x} \geq ℙ {Y_{j} = 1 | X_{(j)} = x} \\ 1 otherwise, \end{array}

and the classifier (1.2) is given by

\begin{matrix} g_{n} (x) = {\begin{array}{l} 0 if \sum_{i \in J_{n}} 1_{{Y_{i} = 0}} K (\frac{x - X_{(i)}}{b_{n}}) \geq \sum_{i \in J_{n}} 1_{{Y_{i} = 1}} K (\frac{x - X_{(i)}}{b_{n}}) \\ 1 otherwise . \end{array} \end{matrix}

(7.1)

Define

η_{n} (x) = \frac{\sum_{i \in J_{n}} Y_{i} K_{n} (x - X_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})} .

Consequently, the classifier (7.1) can be written as

g_{n} (x) = {\begin{array}{l} 0 if η_{n} (x) \leq \frac{\sum_{i \in J_{n}} (1 - Y_{i}) K_{n} (x - X_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})} \\ 1 otherwise . \end{array}

By Theorem 2.3 in [7], the consistency will be proved if we show that

\int_{ℝ^{\tilde{d}}} | η (x) - η_{n} (x) | μ (d x) \to 0 as n \to \infty with probability one .

(7.2)

But

| η (x) - η_{n} (x) | \leq | η (x) - E η_{n} (x) | + | η_{n} (x) - E η_{n} (x) |, \forall x \in ℝ^{\tilde{d}} .

Hence, in order to prove (7.1), it suffices to show that

\int_{ℝ^{\tilde{d}}} | η (x) - E η_{n} (x) | μ (d x) \to 0 as n \to \infty

(7.3)

and

\int_{ℝ^{\tilde{d}}} | η_{n} (x) - E η_{n} (x) | μ (d x) \to 0 as n \to \infty with probability one .

(7.4)

The proof of (7.3) is the same as in the i.i.d. case (see [7], pp. 156–157 ). So, it suffices to prove (7.4). To do that, we will employ the blocking technique used in [4]. Let $p = p_{n} = [{\hat{n}}^{γ}]$ for some $1 / θ < γ < 1 / (2 N)$ (where $[.]$ stands for the integer part). Without loss of generality, we suppose that there exists a positive integer $q_{k}$ such that $n_{k} = 2 p q_{k}$ for each $k = 1, \dots, N$ ⁠. Let

J_{q} = {j = (j_{1}, \dots, j_{N}) \in ℕ^{N} : 0 \leq j_{k} \leq q_{k} - 1, \forall k = 1, \dots, N} .

We define blocks as follow, for each $j \in J_{q}$ ⁠,

S_{j}^{(1)} = {i \in I_{n} : 2 j_{k} p + 1 \leq i_{k} \leq (2 j_{k} + 1) p, k = 1, \dots, N}

S_{j}^{(2)} = {i \in I_{n} : 2 j_{k} p + 1 \leq i_{k} \leq (2 j_{k} + 1) p, k = 1, \dots, N - 1

a n d (2 j_{N} + 1) p + 1 \leq i_{N} \leq 2 (j_{N} + 1) p}

\dots

S_{j}^{(2^{N} - 1)} = {i \in I_{n} : (2 j_{k} + 1) p + 1 \leq i_{k} \leq 2 (j_{k} + 1) p, k = 1, \dots, N - 1

a n d 2 j_{N} p + 1 \leq i_{N} \leq (2 j_{N} + 1) p}

S_{j}^{(2^{N})} = {i \in I_{n} : (2 j_{k} + 1) p + 1 \leq i_{k} \leq 2 (j_{k} + 1) p, k = 1, \dots, N} .

As a consequence, we have $I_{n} = ⋃_{k = 1}^{2^{N}} ⋃_{j \in J_{q}} S_{j}^{(k)}$ ⁠, and for each $k = 1, \dots, 2^{N}$ ⁠, $card (S_{j}^{(k)}) = p^{N}$ and $dist (S_{j}^{(k)}, S_{j^{'}}^{(k)}) \geq p$ for any $j \neq j^{'}$ ⁠. Let $Γ_{j}^{(k)} = {i \in S_{j}^{(k)} : ν_{i} \subset S_{n}}$ ⁠, for each $k = 1, \dots, 2^{N}$ and $j \in J_{q}$ ⁠. Hence, for a fixed $k$ ⁠, we have $dist (Γ_{j}^{(k)}, Γ_{j^{'}}^{k}) \geq p$ for any $j \neq j$ ⁠, $card (Γ_{j}^{(k)}) \leq card (S_{j}^{(k)}) = p^{N}$ and

J_{n} = ⋃_{k = 1}^{2^{N}} ⋃_{j \in J_{q}} Γ_{j}^{(k)} .

(7.5)

Let ${(X_{(i)}^{*}, Y_{i}^{*})}_{i \in I_{n} - J_{n}}$ be a set of independent and identically distributed random vectors such that they are independent of ${(X_{(i)}, Y_{i})}_{i \in J_{n}}$ and $(X_{(i)}^{*}, Y_{i}^{*})$ is identically distributed with $(X_{(1)}, Y_{1})$ ⁠. In order to make sense to the blocking technique, we define random vectors as follow: for each $i \in I_{n}$ ⁠,

\begin{matrix} (X_{(i)}, Y_{i}) \end{matrix} = {\begin{matrix} (X_{(i)}, Y_{i}) if ν_{i} \subset S_{n} \\ (X_{(i)}^{*}, Y_{i}^{*}) if ν_{i} ⊄ S_{n} . \end{matrix}

It is clear that ${(X_{(i)}, Y_{i}), i \in J_{n}} = {(X_{(i)}, Y_{i}), i \in J_{n}}$ and ${(X_{(i)}, Y_{i}), i \in Γ_{j}^{(k)}} = {(X_{(i)}, Y_{i}), i \in Γ_{j}^{(k)}}$ ⁠. Now, for a fixed $k$ and each $j \in J_{q}$ ⁠, let $W_{j}^{(k)} = {(X_{(i)}, Y_{i}), i \in S_{j}^{(k)}}$ be a vector whose components are ordered according to a given order on indices. Applying Lemma 3.2 together with the blocks decomposition introduced by [10] (see also [20]) on the family of vectors ${W_{j}^{(k)}, j \in J_{q}}$ ⁠, we can generate independent copies ${{\tilde{W}}_{j}^{(k)}, j \in J_{q}}$ such that: they are mutually independent, and for each $j \in J_{q}$ ⁠, ${\tilde{W}}_{j}^{(k)} = {({\tilde{X}}_{(i)}, {\tilde{Y}}_{i}), i \in S_{j}^{(k)}}$ has the same distribution as $W_{j}^{(k)} = {({\tilde{X}}_{(i)}, {\tilde{Y}}_{i}), i \in S_{j}^{(k)}}$ ⁠. Furthermore, by Lemma 3.5, we have $P (W_{j}^{(k)} \neq {\tilde{W}}_{j}^{(k)}) \leq ϕ (p - \tilde{r})$ since $p \geq \tilde{r}$ for $\hat{n}$ large enough. Thus, the two vectors $({\tilde{X}}_{(i)}, {\tilde{Y}}_{(i)})$ and $({\tilde{X}}_{(i')}, {\tilde{Y}}_{i'})$ are independent for each $i \in S_{j}^{(k)}$ and $i' \in S_{j'}^{(k)}$ with $j \neq j'$ ⁠. Now, for each $i \in J_{n}$ ⁠, there exists $j \in J_{q}$ such that ${(X_{(i)}, Y_{i}) \neq ({\tilde{X}}_{(i)}, {\tilde{Y}}_{i})} \subseteq (W_{j}^{(k)} \neq {\tilde{W}}_{j}^{(k)})$ ⁠. Since $({\tilde{X}}_{(i)}, {\tilde{Y}}_{i}) = ({\tilde{X}}_{(i)}, {\tilde{Y}}_{i})$ for each $i \in J_{n}$ ⁠, denote $({\tilde{X}}_{(i)}, {\tilde{Y}}_{i}) = ({\tilde{X}}_{(i)}, {\tilde{Y}}_{i})$ ⁠, for each $i \in J_{n}$ (or $i \in Γ_{j}^{(k)}$ ⁠). As a consequence

P {(X_{(i)}, Y_{i}) \neq ({\tilde{X}}_{(i)}, {\tilde{Y}}_{i})} \leq ϕ (p - \tilde{r}), for each i \in J_{n} .

(7.6)

By (7.5), we can write

\sum_{i \in J_{n}} {\tilde{Y}}_{i} K_{n} (x - {\tilde{X}}_{(i)}) = \sum_{k = 1}^{2^{N}} \sum_{j \in J_{q}} \sum_{i \in Γ_{j}^{(k)}} {\tilde{Y}}_{i} K_{n} (x - {\tilde{X}}_{(i)}) .

If we denote

{\tilde{η}}_{n} (x) = \frac{\sum_{i \in J_{n}} {\tilde{Y}}_{i} K_{n} (x - {\tilde{X}}_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})} and {\tilde{η}}_{n, k} (x) = \frac{\sum_{j \in J_{q}} \sum_{i \in Γ_{j}^{(k)}} {\tilde{Y}}_{i} K_{n} (x - {\tilde{X}}_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})},

(7.7)

then

{\tilde{η}}_{n} (x) = \sum_{k = 1}^{2^{N}} {\tilde{η}}_{n, k} (x) .

(7.8)

Using Markov’s inequality and Lemma 3.3 together with (7.7), we have for any $ϵ > 0$ ⁠,

\begin{array}{l} ℙ (| \int_{ℝ^{\tilde{d}}} | η_{n} (x) - E η_{n} (x) | μ (d x) - \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - E {\tilde{η}}_{n} (x) | μ (d x) | > ϵ) \\ \leq ϵ^{- 1} E | \int_{ℝ^{\tilde{d}}} | η_{n} (x) - E η_{n} (x) | μ (d x) - \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - E {\tilde{η}}_{n} (x) | μ (d x) | \\ \leq ϵ^{- 1} E (\int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - η_{n} (x) | μ (d x) + E \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - η_{n} (x) | μ (d x)) \\ = 2 ϵ^{- 1} E \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - η_{n} (x) | μ (d x) \\ = 2 ϵ^{- 1} E \int_{ℝ^{\tilde{d}}} | \frac{\sum_{i \in J_{n}} {\tilde{Y}}_{i} K_{n} (x - {\tilde{X}}_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})} - \frac{\sum_{i \in J_{n}} Y_{i} K_{n} (x - X_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})} | μ (d x) \\ \leq 4 ϵ^{- 1} \sum_{i \in J_{n}} E 1_{{({\tilde{X}}_{(i)}, {\tilde{Y}}_{i}) \neq (X_{(i)}, Y_{i})}} \sup_{u \in ℝ^{\tilde{d}}} \int_{ℝ^{\tilde{d}}} \frac{K_{n} (x - u)}{\hat{n} E K_{n} (x - X_{(1)})} μ (d x) \\ \leq 4 {(ϵ \hat{n})}^{- 1} ρ \sum_{i \in J_{n}} E 1_{{({\tilde{X}}_{(i)}, {\tilde{Y}}_{i}) \neq (X_{(i)}, Y_{i})}} \leq 4 ϵ^{- 1} ρ ϕ (p - \tilde{r}), \end{array}

where $ρ > 0$ is the constant defined in Lemma 3.3. Since $\tilde{r}$ is bounded and $p \to \infty$ as $n \to \infty$ ⁠, so $p - \tilde{r} \geq p / 2$ for $\hat{n}$ large enough. Therefore, we get

\begin{array}{l} ℙ (| \int_{ℝ^{\tilde{d}}} | η_{n} (x) - E η_{n} (x) | μ (d x) - \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - E {\tilde{η}}_{n} (x) | μ (d x) | > ϵ) \\ \leq 4 ϵ^{- 1} ρ ϕ (p / 2) \leq C ϵ^{- 1} ρ {\hat{n}}^{- γ^{θ}}, \end{array}

for some generic positive constant $C > 0$ ⁠. Since $γ θ > 1$ ⁠, by Borel–Cantelli lemma, we have

\int_{ℝ^{\tilde{d}}} | η_{n} (x) - E η_{n} (x) | μ (d x) - \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - E {\tilde{η}}_{n} (x) | μ (d x) \to 0,

(7.9)

with probability one. Now, we will show that

\int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - E {\tilde{η}}_{n} (x) | μ (d x) \to 0 with  probability  one .

(7.10)

By (7.7) and (7.8), we have

\int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n} (x) - E {\tilde{η}}_{n} (x) | μ (d x) \leq \sum_{k = 1}^{2^{N}} \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n, k} (x) - E {\tilde{η}}_{n, k} (x) | μ (d x) .

(7.11)

Consequently, in order to establish (7.10), it is sufficient to show that

\int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n, k} (x) - E {\tilde{η}}_{n, k} (x) | μ (d x) \to 0 as n \to \infty with probability one ,

(7.12)

for each $1 \leq k \leq 2^{N}$ ⁠. Without loss of generality, we show (7.12) for $k = 1$ ⁠. If the elements of $J_{q}$ are enumerated in an arbitrary manner, we can write $J_{q} = {1, \dots, m}$ with $m = card (J_{q}) = \prod_{k = 1}^{N} q_{k}$ ⁠. Denote ${\tilde{Z}}_{j} = {({\tilde{X}}_{(i)}, {\tilde{Y}}_{i}), i \in S_{j}^{(1)}}$ ⁠, for each $j = 1, \dots, m$ ⁠, where the components of ${\tilde{Z}}_{j}$ are ordered according to an arbitrary order on indices. Recall that $({\tilde{X}}_{(i)}, {\tilde{Y}}_{i}) = ({\tilde{X}}_{(i)}, {\tilde{Y}}_{i})$ for $i \in Γ_{j}^{(1)}$ and suppose that $({\tilde{X}}_{(i)}, Y_{i})$ is replaced by $(0_{\tilde{d}}, 0)$ if $i \notin Γ_{j}^{(1)}$ where $0_{\tilde{d}} = (0, \dots, 0) \in ℝ^{\tilde{d}}$ ⁠. Hence, by the blocks decomposition, the random vectors ${\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{m}$ are independent. Let $F : {({(ℝ^{\tilde{d}} \times {0, 1})}^{p^{N}})}^{m} \to ℝ$ be a real function defined as follows

\begin{array}{l} F ({\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{m}) = \int_{ℝ^{\tilde{d}}} | \sum_{j = 1}^{m} \sum_{i \in S_{j}^{(1)}} (\frac{{\tilde{Y}}_{i} K_{n} (x - {\tilde{X}}_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})} - \frac{E {\tilde{Y}}_{1} K_{n} (x - {\tilde{X}}_{(1)})}{\hat{n} E K_{n} (x - X_{(1)})}) | μ (d x) \\ = \int_{ℝ^{\tilde{d}}} | \sum_{j = 1}^{m} \sum_{i \in Γ_{j}^{(1)}} (\frac{{\tilde{Y}}_{i} K_{n} (x - {\tilde{X}}_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})} - \frac{E {\tilde{Y}}_{1} K_{n} (x - {\tilde{X}}_{(1)})}{\hat{n} E K_{n} (x - X_{(1)})}) | μ (d x) \\ = \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n, 1} (x) - E {\tilde{η}}_{n, 1} (x) | μ (d x) . \end{array}

For ${\tilde{z}}_{j} \neq {\tilde{z}}_{j}^{'}$ where ${\tilde{z}}_{j} = {({\tilde{x}}_{(i)}, {\tilde{y}}_{i}), i \in S_{j}^{(1)}}, {\tilde{z}}_{j}^{'} = {({\tilde{x}}_{(i)}^{'}, {\tilde{y}}_{i}^{'}), i \in S_{j}^{(1)}} \in {(ℝ^{\tilde{d}} \times {0, 1})}^{p^{N}}$ and $({\tilde{x}}_{(i)}, {\tilde{y}}_{i}) = ({\tilde{x}}_{(i)}^{'}, {\tilde{y}}_{i}^{'}) = (0_{\tilde{d}}, 0)$ for each $i \notin Γ_{j}^{(1)}$ ⁠, using Lemma 3.3, we have

\begin{array}{l} | F ({\tilde{Z}}_{1}, \dots, {\tilde{z}}_{j}, \dots, {\tilde{Z}}_{m}) - F ({\tilde{Z}}_{1}, \dots, {\tilde{z}}_{j}^{'}, \dots, {\tilde{Z}}_{m}) | \\ \leq \int_{ℝ^{\tilde{d}}} | \sum_{i \in Γ_{j}^{(1)}} \frac{{\tilde{y}}_{i} K_{n} (x - {\tilde{x}}_{(i)})}{\hat{n} E K_{n} (x - X_{(1)})} - \sum_{i \in Γ_{j}^{(1)}} \frac{{\tilde{y}}_{i}^{'} K_{n} (x - {\tilde{x}}_{(i)}^{'})}{\hat{n} E K_{n} (x - X_{(1)})} | μ (d x) \\ \leq 2 p^{N} \sup_{u \in ℝ^{\tilde{d}}} \int_{ℝ^{\tilde{d}}} \frac{K_{n} (x - u)}{\hat{n} E K_{n} (x - X_{(1)})} μ (d x) \leq 2 ρ p^{N} {\hat{n}}^{- 1} . \end{array}

Hence, since $\hat{n} = 2^{N} p^{N} m$ with $m = \prod_{k = 1}^{N} q_{k}$ ⁠, by McDiarmid’s inequality [16], we have for every $ϵ > 0$ ⁠,

ℙ (| F ({\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{m}) - E F ({\tilde{Z}}_{1} ..., {\tilde{Z}}_{m}) | > ϵ) \leq 2 exp (- \frac{2^{N - 1} ϵ^{2} \hat{n}}{ρ^{2} p^{N}}) .

Since $p = [{\hat{n}}^{γ}]$ with $1 / θ < γ < 1 / (2 N)$ ⁠, then ${\hat{n}}^{1 - γ^{N}} / \log (\hat{n}) \to \infty$ and Borel–Cantelli lemma yields

F ({\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{m}) - E F ({\tilde{Z}}_{1} ..., {\tilde{Z}}_{m}) \to 0 with probability one .

As a consequence

\int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n, 1} (x) - E {\tilde{η}}_{n, 1} (x) | μ (d x) - E \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n, 1} (x) - E {\tilde{η}}_{n, 1} (x) | μ (d x) \to 0

(7.13)

with probability one. In order to complete the proof of (7.12) for $k = 1$ ⁠, it remains to show that

E F ({\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{m}) = E \int_{ℝ^{\tilde{d}}} | {\tilde{η}}_{n, 1} (x) - E {\tilde{η}}_{n, 1} (x) | μ (d x) \to 0 .

(7.14)

The proof of (7.14) can be achieved by the same arguments used by ([21], Section 5), in addition to benefiting from Lemmas 3.1, 3.4 and 3.5. Combining (7.9), (7.10), (7.12)–(7.14), we get (7.4). Finally, (7.3) and (7.4) yield (7.2) and the proof is completed. □

The author would like to thank the anonymous referees whose valuable comments led to an improved version of the paper. The publisher wishes to inform readers that the article “Strong consistency of a kernel-based rule for spatially dependent data” was originally published by the previous publisher of the Arab Journal of Mathematical Sciences and the pagination of this article has been subsequently changed. There has been no change to the content of the article. This change was necessary for the journal to transition from the previous publisher to the new one. The publisher sincerely apologises for any inconvenience caused. To access and cite this article, please use “Younso, A., Kanaya, Z., Azhari, N. (2019), “Strong consistency of a kernel-based rule for spatially dependent data”, Arab Journal of Mathematical Sciences, Vol. 26 No. 1/2, pp. 211-225. The original publication date for this paper was 13/11/2019.

References

[1]

H.C.P.

Berbee

Random Walks with Stationary Increments and Renewal Theory

Math. Cent. Tract.

Amsterdam

1979

$n$	50	75	100
${\hat{b}}_{o p t}$	2.04	1.93	1.77
$A E R$	28.1%	21.2%	14.8%

$n$	50	75	100
${\hat{b}}_{o p t}$	2.04	1.93	1.77
$A E R$	28.1%	21.2%	14.8%

$n$	25	50	80
${\hat{h}}_{o p t}$	1.85	1.72	1.69
$A E R$	23.4%	18.7%	13.2%

$n$	25	50	80
${\hat{h}}_{o p t}$	1.85	1.72	1.69
$A E R$	23.4%	18.7%	13.2%

Strong consistency of a kernel-based rule for spatially dependent data Open Access

1. Introduction

2. Notation and general hypotheses

3. Preliminary lemmas

4. Main result

5. Simulation study including comparison with the classical kernel rule

5.1 Comparison with the classical kernel rule

6. Application to a real data

7. Proof of Theorem 4.1

References

Data & Figures

Contents

Supplements

References

Related

Email Alerts

Suggested Reading

Recommended for you

Cited By

Languages

Sharing Unavailable

Strong consistency of a kernel-based rule for spatially dependent data