The application of machine learning to image and video data often yields a high dimensional feature space. Effective feature selection techniques identify a discriminant feature subspace that lowers computational and modeling costs with little performance degradation. A novel supervised feature selection methodology is proposed for machine learning decisions in this work. The resulting tests are called the discriminant feature test (DFT) and the relevant feature test (RFT) for the classification and regression problems, respectively. The DFT and RFT procedures are described in detail. Furthermore, we compare the effectiveness of DFT and RFT with several classic feature selection methods. To this end, we use deep features obtained by LeNet-5 for MNIST and Fashion-MNIST datasets as illustrative examples. Other datasets with handcrafted and gene expressions features are also included for performance evaluation. It is shown by experimental results that DFT and RFT can select a lower dimensional feature subspace distinctly and robustly while maintaining high decision performance.
1 Introduction
Traditional machine learning algorithms are susceptible to the curse of feature dimensionality [18]. Their computational complexity increases with high dimensional features. Redundant features may not be helpful in discriminating classes or reducing regression error, and they should be removed. Sometimes, redundant features may even produce negative effects as their number grows. Their detrimental impact should be minimized or controlled. To deal with these problems, feature selection techniques [29, 37, 39] are commonly applied as a data pre-processing step or part of the data analysis to simplify the complexity of the model. Feature selection techniques involve the identification of a subspace of discriminant features from the input, which describe the input data efficiently, reduce effects from noise or irrelevant features, and provide good prediction results [16].
For machine learning with image/video data, the deep learning technology, which adopts a pre-defined network architecture and optimizes the network parameters using an end-to-end optimization procedure, is dominating nowadays. Yet, an alternative that returns to the traditional pattern recognition paradigm based on feature extraction and classification two modules in cascade has also been studied, e.g., [8−10, 23, 24, 27, 28, 33, 41, 42]. The feature extraction module contains two steps: unsupervised representation learning and supervised feature selection. Examples of unsupervised representation learning include multi-stage Saab [24] and Saak transforms [10]. Here, we focus on the second step; namely, supervised feature selection from a high dimensional feature space.
Inspired by information theory and the decision tree, a novel supervised feature selection method is proposed in this work. The resulting tests are called the discriminant feature test (DFT) and the relevant feature test (RFT), respectively, for the classification and regression problems. The DFT and RFT procedures are described in detail. We compare the effectiveness of DFT and RFT with several classic feature selection methods. Experimental results show that DFT and RFT can select a significantly lower dimensional feature subspace distinctly and robustly while maintaining high decision performance.
2 Review of Previous Work
Feature selection methods can be categorized into unsupervised [5, 30, 32, 36], semi-supervised [35, 43], and supervised [20] three types. Unsupervised methods focus on the statistics of input features while ignoring the target class or value. Straightforward unsupervised methods can be fast, e.g., removing redundant features using correlation, removing features of low variance. However, their power is limited and less effective than supervised methods. More advanced unsupervised methods adopt clustering. Examples include [1, 19, 26]. Their complexity is higher, their behavior is not well understood, and their performance is not easy to evaluate systematically. Overall, this is an open research field.
Existing semi-supervised and supervised feature selection methods can be classified into wrapper, filter and embedded three classes [35]. Wrapper methods [22] create multiple models with different subsets of input features and select the model containing the features that yield the best performance. One example is recursive feature elimination [17]. This process can be computationally expensive. Filter methods involve evaluating the relationship between input and target variables using statistics and selecting those variables that have the strongest relation with the target ones. One example is the analysis of variance (ANOVA) [34]. This approach is computationally efficient with robust performance. Another example is feature selection based on linear discriminant analysis (LDA). It finds the most separable projection directions. The objective function of LDA is used to select discriminant features from the existing feature dimensions by measuring the ratio between the between-class scatter matrix and the within-class scatter matrix. It can be generalized from the 2-class problem to the multi-class problem. Embedded methods perform feature selection in the process of training and are usually specific to a single learner. One example is “feature importance” (FI) obtained from the training process of the XGBoost classifier/regressor [7], which is also known as “feature selection from model.”
Inspired by information theory and the decision tree, a novel supervised feature selection methodology is proposed in this work. The resulting tests are called the DFT and the RFT for classification and regression tasks, respectively. Our proposed methods belong to the filter methods, which give a score to each dimension and select features based on feature ranking. The scores are measured by the weighted entropy and the weighted MSE for DFT and RFT, which reflect the discriminant power and relevance degree to classification and regression targets, respectively.
To demonstrate the power of DFT and RFT, we conduct performance benchmarking between DFT/RFT, ANOVA and FI from XGBoost in the experimental section. To this end, we use deep features obtained by LeNet-5 for MNIST and Fashion-MNIST datasets as illustrative examples. Other datasets with handcrafted features and gene expressions features are also used for performance benchmarking. Comparison with the minimal-redundancy-maximal-relevance (mRMR) criterion [13, 31], which is a more advanced feature selection method, is also conducted. It is shown by experimental results that DFT and RFT can select a lower dimensional feature subspace distinctly and robustly while maintaining high decision performance.
3 Proposed Feature Selection Methods
Being motivated by the feature selection process in the decision tree classifier, we propose two feature selection methods, DFT and RFT, in this section, as illustrated in Figure 1. They will be detailed in Sections 3.1 and 3.2, respectively. Finally, robustness of DFT and RFT will be discussed in Section 3.3.
An overview of the proposed feature selection methods: DFT and RFT. For the i-th feature, DFT measures the class distribution in and to compute the weighted entropy as the DFT loss, while RFT measures the weighted estimated regression MSE in both sets as the RFT loss.
An overview of the proposed feature selection methods: DFT and RFT. For the i-th feature, DFT measures the class distribution in and to compute the weighted entropy as the DFT loss, while RFT measures the weighted estimated regression MSE in both sets as the RFT loss.
3.1 Discriminant Feature Test
Consider a classification problem with N data samples, P features and C classes. Let fi, 1 ≤ i ≤ P, be a feature dimension and its minimum and maximum are and , respectively. DFT is used to measure the discriminant power of each feature dimension out of a P-dimensional feature space independently. If feature fi is a discriminant one, we expect data samples projected to it should be classified more easily. To check it, one idea is to partition into M nonoverlapping subintervals and adopt the maximum likelihood rule to assign the class label to samples inside each subinterval. Then, we can compute the percentage of correct predictions. The higher the prediction accuracy, the higher the discriminant power. Although prediction accuracy may serve as an indicator for purity, it does not tell the distribution of the remaining C – 1 classes if C > 2. Thus, it is desired to consider other purity measures.
In our design, we use the weighted entropy of the left and right subsets as the DFT loss to measure the discriminant power of each dimension. The reason of choosing the weighted entropy as the cost is that it considers the probability distribution of all classes instead of the maximum likelihood rule in prediction accuracy as stated above. A lower entropy value is obtained from a more biased distribution of classes, indicating the subinterval is dominated by fewer classes.
By following the practice of a binary decision tree, we consider the case, M = 2, as shown in the left subfigure of Figure 1, where denotes the threshold position of two sub-intervals. If a sample with its ith dimension, , it goes to the subset associated with the left subinterval. Otherwise, it will go to the subset associated with the right subinterval. Formally, the procedure of DFT consists of three steps for each dimension as detailed below.
3.1.1 Training Sample Partitioning
For the ith feature, fi, we need to search for the optimal threshold, , between and partition training samples into two subsets and via
where represents the i-th feature of the n-th training sample xn, and is selected automatically to optimize a certain purity measure. To limit the search space of , we partition the entire feature range, , into B uniform segments and search the optimal threshold among the following B – 1 candidates:
where B = 2j, j = 1, 2 ·, is examined in Section 3.3.
3.1.2 DFT Loss Measured by Entropy
Samples of different classes belong to or . Without loss of generality, the following discussion is based on the assumption that each class has the same number of samples in the full training set; namely ⋃ . To measure the purity of subset corresponding to the partition point , we use the following entropy metric:
where is the probability of class c in . Similarly, we can compute entropy for subset . Then, the entropy of the full training set against partition is the weighted average of HL,t and HR,t in form of
where and are the sample numbers in subsets and , respectively, and is the total number of training samples. The optimized entropy for the i-th feature is given by
where T indicates the set of partition points.
3.1.3 Feature Selection Based on Optimized Loss
We conduct search for optimized entropy values, , of all feature dimensions, ƒi, 1 ≤ i ≤ P and order the values of from the smallest to the largest ones. The lower the value, the more discriminant the ith-dimensional feature, ƒi. Then, we select the top K features with the lowest entropy values as discriminant features. To choose the value of K with little ambiguity, it is critical the rank-ordered curve of should satisfy one important criterion. That is, it should have a distinct and narrow elbow region. We will show that this is indeed the case in Section 4.
3.2 Relevant Feature Test
For regression tasks, the mapping between an input feature and a target scalar function can be more efficiently built if the feature dimension has the ability to separate samples into segments with smaller variances. This is because the regressor can use the mean of each segment as the target value, and its corresponding variance indicates the prediction mean squared-error (MSE) of the segment. Motivated by this observation and the binary decision tree, RFT partitions a feature dimension into left and right two segments and evaluates the total MSE from them. We use this approximation error as the RFT loss function. The smaller the RFT loss, the better the feature dimension. Again, the RFT loss depends on the threshold . The process of selecting more powerful feature dimensions for regression is named RFT. Similar to DFT, RFT has three steps. They are elaborated below. Here, we adopt the same notations as those in Section 3.1.
3.2.1 Training Sample Partitioning
By following the first step in DFT, we search for the optimal threshold, , between and partition training samples into two subsets and for the ith feature, ƒi. Again, we partition the feature range, , into B uniform segments and search the optimal threshold among the following B – 1 candidates as given in Equation (3).
3.2.2 RFT Loss Measured by Estimated Regression MSE
We use y to denote the regression target value. For the ith feature dimension, ƒi we partition the sample space into two disjoint ones and . Let and be the mean of target values in and , and we use and as the estimated regression value of all samples in and , respectively. Then, the RFT loss is defined as the sum of estimated regression MSEs of and . Mathematically, we have
where and denote the sample numbers and the estimated regression MSEs in subsets and , respectively. Feature ƒi is characterized by its optimized estimated regression MSE over the set, T, of candidate partition points:
3.2.3 Feature Selection Based on Optimized Loss
We order the optimized estimated regression MSE value, across all feature dimensions, ƒi, 1 ≤ i ≤ P, from the smallest to the largest ones. The lower the value, the more relevant the ith-dimensional feature, ƒi. Afterwards, we select the top K features with the lowest estimated regression MSE values as relevant features.
3.3 Robustness Against Bin Numbers
For smooth DFT/RFT loss curves with a sufficiently large bin number (say, B ≥ 16), the optimized loss value does not vary much by increasing B furthermore as shown in Figure 2. Figures 2(a) and (b) show the DFT and RFT loss functions for an exemplary feature, ƒi, under two binning schemes; i.e., B =16 and B = 64, respectively. We see that the binning B =16 is fine enough to locate the optimal partition . If B = 2j, j = 1,2, ·, the set of partition points in a small B value is a subset of those of a larger B value. Generally, we have the following observations. The difference of the DFT/RFT loss between adjacent candidate points changes smoothly. Since the global minimum has a flat bottom, the loss function is low for a range of partition thresholds. The feature will achieve a similar loss level with multiple binning schemes. For example, Figure 2(a) shows that B =16 reaches the global minimum at fi = 5.21 while B = 64 reaches the global minimum at fi = 5.78. The difference is about 3% of the full dynamic range of fi. Similar characteristics are observed for all feature dimensions in DFT/RFT, indicating the robustness of DFT/RFT. For lower computational complexity and avoiding overfitting, we typically choose B = 16 or B = 32.
Comparison of two binning schemes with B = 16 and B = 64: (a) DFT and (b) RFT.
4 Experimental Results
4.1 Image Datasets with High Dimensional Feature Space
To demonstrate the power of DFT and RFT, we consider several classical datasets. They include MNIST [25], Fashion-MNIST [40], the Multiple Features (MultiFeat) dataset [4, 21, 38], the Arrhythmia (ARR) dataset [15] from the UCI machine learning archive [14], and the Colon cancer dataset [2]. The latter three are used to measure DFT in the classification problem setting.
Classification test accuracy (%) of LeNet-5 on MNIST and Fashion-MNIST.
| . | Clean . | Noisy . |
|---|---|---|
| MNIST | 99.18 | 98.85 |
| Fashion-MNIST | 90.19 | 86.95 |
| . | Clean . | Noisy . |
|---|---|---|
| MNIST | 99.18 | 98.85 |
| Fashion-MNIST | 90.19 | 86.95 |
Dataset-1: MNIST and Fashion-MNIST. Both datasets contain grayscale images of resolution 28 × 28, with 60K training and 10K test images. MNIST has 10 classes of hand-written digits (from 0 to 9) while Fashion-MNIST has 10 classes of fashion products. In order to get deep features for each dataset, we train the LeNet-5 network [25] for the two corresponding classification problems and adopt the 400-D feature vector before the two FC layers as raw features to evaluate several feature selection methods. Besides original clean training images, we add additive zero-mean Gaussian noise with different standard deviation values to evaluate the robustness of feature selection methods against noisy data. The LeNet-5 network is re-trained for these noisy images and the corresponding deep features are extracted. For the performance benchmarking purpose, we list the test classification accuracy of the trained LeNet-5 for MNIST and Fashion-MNIST in Table 1 to illustrate the quality of the deep features.
Dataset-2: MultiFeat. This dataset contains features of hand-written digits (from 0 to 9) extracted from a collection of Dutch utility maps [14], including 649 dimensional features for 200 images per class. Different from the deep features in Dataset-1, the 649 features are extracted from six perspectives such as Fourier coefficients of character shapes and morphological features. Since the number of samples is small, we use 10-fold cross-validation and compute the mean accuracy to evaluate the classification performance.
Dataset-3: Colon. This gene expression dataset contains 62 samples with 2000 features each. It has a binary classification label; namely, the normal tissue or the cancerous tissue. There are 22 normal tissue and 40 cancer tissue samples. Considering its small sample size, we use the leave-one-out validation to get the classification predictions for each sample.
Dataset-4: ARR. This cardiac arrhythmia dataset has binary labels for 237 normal and 183 abnormal samples. Each sample contains 278 features. The 10-fold cross-validation is adopted to evaluate the classification performance.
4.2 DFT for Classification Problems
We compare the effectiveness of four feature selection methods: (1) F scores from ANOVA (ANOVA F Scores), (2) absolute correlation coefficient w.r.t the class labels (Abs. Corr. Coeff.), (3) feature importance (Feat. Imp.) from a pre-trained XGBoost classifier, and (4) DFT. We adopt four classifiers to validate the classification performance. They are the Logistic Regression (LR) classifier [12], the Support Vector Machine (SVM) classifier [11], the Random Forest (RF) classifier [3], and the XGBoost classifier [7]. We have the following two observations.
4.2.1 DFT Offers an Obvious Elbow Point
Figure 3 compares the ranked scores of four feature selection methods on Fashion-MNIST dataset. The lower DFT loss, the higher importance of a feature. The other three have a reversed relation, namely, the higher the score, the higher the importance. Thus, we search for the elbow point for DFT but the knee point for the other methods. Clearly, the feature importance curve from the pre-trained XGBoost classifier has a clearer knee point and the DFT curve has a more obvious elbow point. In contrast, ANOVA and correlation-coefficient-based methods are not as effective in selecting discriminant features since their knee points are less obvious.
4.2.2 Features Selected by DFT Achieves Comparable and Stable Classification Performance
Tables 2, 3, 4, and 5 summarize the classification accuracy using four classifiers at two reduced dimensions selected by the DFT loss curve based on early and late elbow points on Dataset-1. The RBF kernel is used for SVM. We see that DFT can achieve comparable (or even the best) performance among the four methods at the same selected feature dimension. The accuracy gap between the late elbow point and the full feature set (400-D) is very small. They are 0.62% and 0.94% using XGBoost classifier for clean MNIST and Fashion-MNIST, respectively. The late elbow point only uses 25-35% of the full feature set. The gaps in classification accuracy on noisy images are 0.58% and 1.6% for MNIST and Fashion-MNIST, respectively, indicating the robustness of the DFT feature selection method against input perturbation.
Comparison of distinct feature selection capability among four feature selection methods for classification task on the Fashion-MNIST dataset.
Comparison of distinct feature selection capability among four feature selection methods for classification task on the Fashion-MNIST dataset.
Comparison of relevant feature selection capability among four feature selection methods for regression task on the Fashion-MNIST dataset.
Comparison of relevant feature selection capability among four feature selection methods for regression task on the Fashion-MNIST dataset.
Comparison of classification performance (%) on Clean MNIST between different feature selection methods.
| Selected dimension . | Method . | LR . | SVM . | RF . | XGBoost . |
|---|---|---|---|---|---|
| ANOVA | 94.21 | 95.07 | 95.77 | 96.58 | |
| Early elbow point (30-D) | Corr. | 88.73 | 92.47 | 94.04 | 95.11 |
| Feat. imp. | 92.61 | 93.55 | 94.89 | 95.71 | |
| DFT (Ours) | 94.49 | 95.45 | 96.29 | 96.92 | |
| ANOVA | 98.24 | 98.22 | 97.98 | 98.66 | |
| Late elbow point (100-D) | Corr. | 97.61 | 97.78 | 97.35 | 98.57 |
| Feat. imp. | 98.24 | 98.15 | 98.18 | 98.78 | |
| DFT (Ours) | 97.93 | 97.83 | 97.81 | 98.52 | |
| Full set (400-D) | 98.89 | 98.77 | 98.61 | 99.14 |
| Selected dimension . | Method . | LR . | SVM . | RF . | XGBoost . |
|---|---|---|---|---|---|
| ANOVA | 94.21 | 95.07 | 95.77 | 96.58 | |
| Early elbow point (30-D) | Corr. | 88.73 | 92.47 | 94.04 | 95.11 |
| Feat. imp. | 92.61 | 93.55 | 94.89 | 95.71 | |
| DFT (Ours) | 94.49 | 95.45 | 96.29 | 96.92 | |
| ANOVA | 98.24 | 98.22 | 97.98 | 98.66 | |
| Late elbow point (100-D) | Corr. | 97.61 | 97.78 | 97.35 | 98.57 |
| Feat. imp. | 98.24 | 98.15 | 98.18 | 98.78 | |
| DFT (Ours) | 97.93 | 97.83 | 97.81 | 98.52 | |
| Full set (400-D) | 98.89 | 98.77 | 98.61 | 99.14 |
Comparison of classification performance (%) on Noisy MNIST between different feature selection methods.
| Selected dimension . | Method . | LR . | SVM . | RF . | XGBoost . |
|---|---|---|---|---|---|
| ANOVA | 94.22 | 94.62 | 95.60 | 96.04 | |
| Early elbow point (40-D) | Corr. | 90.97 | 93.06 | 93.64 | 95.21 |
| Feat. imp. | 92.59 | 93.35 | 94.48 | 95.34 | |
| DFT (Ours) | 94.03 | 95.22 | 95.78 | 96.59 | |
| ANOVA | 96.81 | 96.87 | 97.16 | 97.99 | |
| Late elbow point (100-D) | Corr. | 96.87 | 97.13 | 96.83 | 97.93 |
| Feat. imp. | 97.22 | 97.2 | 97.36 | 97.97 | |
| DFT (Ours) | 97.08 | 97.36 | 97.49 | 98.18 | |
| Full set (400-D) | 98.04 | 98.17 | 98.15 | 98.76 |
| Selected dimension . | Method . | LR . | SVM . | RF . | XGBoost . |
|---|---|---|---|---|---|
| ANOVA | 94.22 | 94.62 | 95.60 | 96.04 | |
| Early elbow point (40-D) | Corr. | 90.97 | 93.06 | 93.64 | 95.21 |
| Feat. imp. | 92.59 | 93.35 | 94.48 | 95.34 | |
| DFT (Ours) | 94.03 | 95.22 | 95.78 | 96.59 | |
| ANOVA | 96.81 | 96.87 | 97.16 | 97.99 | |
| Late elbow point (100-D) | Corr. | 96.87 | 97.13 | 96.83 | 97.93 |
| Feat. imp. | 97.22 | 97.2 | 97.36 | 97.97 | |
| DFT (Ours) | 97.08 | 97.36 | 97.49 | 98.18 | |
| Full set (400-D) | 98.04 | 98.17 | 98.15 | 98.76 |
Comparison of classification performance (%) on Clean Fashion-MNIST between different feature selection methods.
| Selected dimension . | Method . | LR . | SVM . | RF . | XGBoost . |
|---|---|---|---|---|---|
| ANOVA | 78.85 | 80.44 | 83.33 | 83.11 | |
| Early elbow point (30-D) | Corr. | 76.57 | 80.16 | 82.69 | 83.04 |
| Feat. imp. | 78.96 | 80.49 | 82.99 | 82.96 | |
| DFT (Ours) | 79.59 | 81.48 | 84.03 | 84.09 | |
| ANOVA | 87.06 | 86.61 | 87.69 | 89.08 | |
| Late elbow point (150-D) | Corr. | 86.99 | 86.96 | 87.36 | 88.81 |
| Feat. imp. | 87.47 | 87.62 | 88.28 | 89.33 | |
| DFT (Ours) | 87.60 | 87.02 | 87.71 | 89.13 | |
| Full set (400-D) | 89.05 | 88.18 | 88.74 | 90.07 |
| Selected dimension . | Method . | LR . | SVM . | RF . | XGBoost . |
|---|---|---|---|---|---|
| ANOVA | 78.85 | 80.44 | 83.33 | 83.11 | |
| Early elbow point (30-D) | Corr. | 76.57 | 80.16 | 82.69 | 83.04 |
| Feat. imp. | 78.96 | 80.49 | 82.99 | 82.96 | |
| DFT (Ours) | 79.59 | 81.48 | 84.03 | 84.09 | |
| ANOVA | 87.06 | 86.61 | 87.69 | 89.08 | |
| Late elbow point (150-D) | Corr. | 86.99 | 86.96 | 87.36 | 88.81 |
| Feat. imp. | 87.47 | 87.62 | 88.28 | 89.33 | |
| DFT (Ours) | 87.60 | 87.02 | 87.71 | 89.13 | |
| Full set (400-D) | 89.05 | 88.18 | 88.74 | 90.07 |
Table 6 summarizes the classification performance for the MultiFeat dataset on two early elbow points (10-D and 20-D) and one late elbow point (100-D). The elbow points are selected based on the sorted DFT loss curve. DFT can achieve comparable or even the best accuracy on early and late elbow points using different classifiers. The performance gap between 100 selected features and all 649 features is very small, which are 0.15% and 0.1% for LR and SVM, respectively. The classification accuracies even improve by 0.1% and 0.05% using RF and XGBoost, respectively. This shows that the proposed DFT can eliminate less discriminant features while maintaining or even improving the classification performance.
Comparison of classification performance (%) on Noisy Fashion-MNIST between different feature selection methods.
| Selected dimension . | Method . | LR . | SVM . | RF . | XGBoost . |
|---|---|---|---|---|---|
| ANOVA | 75.35 | 76.41 | 77.94 | 78.62 | |
| Early elbow point (40-D) | Corr. | 75.55 | 77.94 | 79.22 | 80.50 |
| Feat. imp. | 75.73 | 77.1 | 78.06 | 78.63 | |
| DFT (Ours) | 76.35 | 77.92 | 79.23 | 79.69 | |
| ANOVA | 81.84 | 81.98 | 82.42 | 84.10 | |
| Late elbow point (150-D) | Corr. | 82.26 | 82.9 | 82.59 | 84.72 |
| Feat. imp. | 83.19 | 83.43 | 83.54 | 84.91 | |
| DFT (Ours) | 82.08 | 82.40 | 82.61 | 84.31 | |
| Full set (400-D) | 84.35 | 84.24 | 84.23 | 85.91 |
| Selected dimension . | Method . | LR . | SVM . | RF . | XGBoost . |
|---|---|---|---|---|---|
| ANOVA | 75.35 | 76.41 | 77.94 | 78.62 | |
| Early elbow point (40-D) | Corr. | 75.55 | 77.94 | 79.22 | 80.50 |
| Feat. imp. | 75.73 | 77.1 | 78.06 | 78.63 | |
| DFT (Ours) | 76.35 | 77.92 | 79.23 | 79.69 | |
| ANOVA | 81.84 | 81.98 | 82.42 | 84.10 | |
| Late elbow point (150-D) | Corr. | 82.26 | 82.9 | 82.59 | 84.72 |
| Feat. imp. | 83.19 | 83.43 | 83.54 | 84.91 | |
| DFT (Ours) | 82.08 | 82.40 | 82.61 | 84.31 | |
| Full set (400-D) | 84.35 | 84.24 | 84.23 | 85.91 |
Comparison of classification performance (%) on MultiFeat between different feature selection methods.
| Classifier . | Method . | 10-D . | 20-D . | 100-D . | All features . |
|---|---|---|---|---|---|
| LR | ANOVA | 93.90 | 96.00 | 98.55 | 98.75 |
| Corr. | 84.70 | 91.65 | 98.50 | ||
| Feat. Imp. | 86.35 | 97.35 | 98.90 | ||
| DFT (Ours) | 92.80 | 96.65 | 98.60 | ||
| SVM | ANOVA | 93.90 | 96.20 | 98.55 | 98.45 |
| Corr. | 88.15 | 93.75 | 98.70 | ||
| Feat. Imp. | 89.65 | 97.60 | 98.80 | ||
| DFT (Ours) | 93.70 | 96.70 | 98.35 | ||
| RF | ANOVA | 93.75 | 96.20 | 98.90 | 98.60 |
| Corr. | 85.35 | 92.30 | 98.55 | ||
| Feat. Imp. | 87.50 | 97.15 | 99.05 | ||
| DFT (Ours) | 92.70 | 96.75 | 98.70 | ||
| XGBoost | ANOVA | 94.00 | 96.45 | 98.40 | 98.45 |
| Corr. | 86.80 | 93.00 | 98.45 | ||
| Feat. Imp. | 88.80 | 96.95 | 98.60 | ||
| DFT (Ours) | 93.55 | 96.40 | 98.50 |
| Classifier . | Method . | 10-D . | 20-D . | 100-D . | All features . |
|---|---|---|---|---|---|
| LR | ANOVA | 93.90 | 96.00 | 98.55 | 98.75 |
| Corr. | 84.70 | 91.65 | 98.50 | ||
| Feat. Imp. | 86.35 | 97.35 | 98.90 | ||
| DFT (Ours) | 92.80 | 96.65 | 98.60 | ||
| SVM | ANOVA | 93.90 | 96.20 | 98.55 | 98.45 |
| Corr. | 88.15 | 93.75 | 98.70 | ||
| Feat. Imp. | 89.65 | 97.60 | 98.80 | ||
| DFT (Ours) | 93.70 | 96.70 | 98.35 | ||
| RF | ANOVA | 93.75 | 96.20 | 98.90 | 98.60 |
| Corr. | 85.35 | 92.30 | 98.55 | ||
| Feat. Imp. | 87.50 | 97.15 | 99.05 | ||
| DFT (Ours) | 92.70 | 96.75 | 98.70 | ||
| XGBoost | ANOVA | 94.00 | 96.45 | 98.40 | 98.45 |
| Corr. | 86.80 | 93.00 | 98.45 | ||
| Feat. Imp. | 88.80 | 96.95 | 98.60 | ||
| DFT (Ours) | 93.55 | 96.40 | 98.50 |
We show the classification performance on the Colon dataset using LR and SVM in Table 7, where the linear kernel is used in SVM. DFT has the minimum or a comparable number of errors in leave-one-out validation. Furthermore, DFT can always achieve fewer errors as compared to the setting of using all 2000 features.
Comparison of number of errors on Colon cancer dataset between different feature selection methods.
| Classifier . | Method . | Selected dimension . | Full set 2000-D . | |||||
|---|---|---|---|---|---|---|---|---|
| 5 . | 10 . | 20 . | 50 . | 80 . | 100 . | |||
| LR | ANOVA | 6 | 9 | 10 | 10 | 7 | 7 | 10 |
| Corr. | 6 | 9 | 10 | 10 | 7 | 7 | ||
| Feat. Imp. | 8 | 6 | 5 | 5 | 9 | 9 | ||
| DFT (Ours) | 6 | 7 | 9 | 8 | 7 | 7 | ||
| SVM | ANOVA | 6 | 7 | 7 | 9 | 8 | 9 | 9 |
| Corr. | 6 | 7 | 7 | 9 | 8 | 9 | ||
| Feat. Imp. | 6 | 6 | 5 | 6 | 9 | 8 | ||
| DFT (Ours) | 6 | 6 | 8 | 8 | 6 | 6 | ||
| Classifier . | Method . | Selected dimension . | Full set 2000-D . | |||||
|---|---|---|---|---|---|---|---|---|
| 5 . | 10 . | 20 . | 50 . | 80 . | 100 . | |||
| LR | ANOVA | 6 | 9 | 10 | 10 | 7 | 7 | 10 |
| Corr. | 6 | 9 | 10 | 10 | 7 | 7 | ||
| Feat. Imp. | 8 | 6 | 5 | 5 | 9 | 9 | ||
| DFT (Ours) | 6 | 7 | 9 | 8 | 7 | 7 | ||
| SVM | ANOVA | 6 | 7 | 7 | 9 | 8 | 9 | 9 |
| Corr. | 6 | 7 | 7 | 9 | 8 | 9 | ||
| Feat. Imp. | 6 | 6 | 5 | 6 | 9 | 8 | ||
| DFT (Ours) | 6 | 6 | 8 | 8 | 6 | 6 | ||
4.2.3 Comparison between DFT and mRMR
The minimal-redundancy-maximal-relevance (mRMR) [13, 31] aims at finding a feature set with high relevance to the class while keeping the selected features with small redundancy, leading to an efficient but effective subset of features. It combines constraints measured by the mutual information of both relevance to the class and redundancy between selected features and treats it as an optimization problem. In this experiment, we compare DFT with mRMR using its incremental selection scheme.
Figures 5, 6 and 7 compare the performance of mRMR and DFT on MultiFeat, Colon and ARR datasets with SVM and XGBoost classifiers. For MultiFeat and Colon, DFT can achieve very competitive performance with mRMR. Specifically, the most discriminant feature of MultiFeat selected by DFT is identical to the first feature selected by mRMR. The error rate with the top 5 features selected by mRMR is smaller than that of DFT. Yet, the performance gap is substantially narrowed after selecting more than 10 features out of the total 649 features. Overall, the error rate of DFT and mRMR converges at similar reduced dimensions, as shown in Figures 5 and 7. On the other hand, the error rate of DFT on the ARR dataset is much lower than that of mRMR, with around 2.5% and 5% gap at 100 selected features for SVM and XGBoost, respectively, as shown in Figure 7.
Error rate comparison on the MultiFeat dataset between mRMR and DFT.
Comparison of the number of errors on the Colon dataset between mRMR and DFT.
4.2.4 DFT Requires Less Running Time
We compare time efficiency of DFT, ANOVA, mRMR and feature importance from the XGBoost model. Table 8 summarizes the running time for MultiFeat, Colon and ARR datasets. All methods are run on the same CPU. The pretrained XGBoost classifier uses the maximum depth of one with 300 trees. For filter methods such as ANOVA and DFT, the time is evaluated on all features without parallel computing. For mRMR, we set the maximum to 100 for incremental selection, which is smaller than the full feature set. ANOVA is the fastest and DFT method ranks the second on all three datasets. The running time of DFT with B = 16 is about ×2.6, ×7.3 and ×23.1 times faster than mRMR on MultiFeat, Colon and ARR, respectively. To further reduce the running time, our proposed DFT can be easily improved by adopting parallel computing since it processes each feature independently before the feature ranking.
Running time (sec.) comparison of different feature selection methods.
| . | ANOVA . | Feat. Imp. . | mRMR . | DFT (B = 8) . | DFT (B = 16) . |
|---|---|---|---|---|---|
| MultiFeat | 0.011 | 363.39 | 15.19 | 2.74 | 5.78 |
| Colon | 0.003 | 58.99 | 23.15 | 1.55 | 3.19 |
| ARR | 0.002 | 55.34 | 10.64 | 0.23 | 0.46 |
| . | ANOVA . | Feat. Imp. . | mRMR . | DFT (B = 8) . | DFT (B = 16) . |
|---|---|---|---|---|---|
| MultiFeat | 0.011 | 363.39 | 15.19 | 2.74 | 5.78 |
| Colon | 0.003 | 58.99 | 23.15 | 1.55 | 3.19 |
| ARR | 0.002 | 55.34 | 10.64 | 0.23 | 0.46 |
4.2.5 DFT with Feature Pre-processing
DFT assigns a score to each feature and selects a subset without any preprocessing. Yet, there might be correlation between features so that a redundant feature subset might be selected based on feature ranking [6]. Instead of adding redundancy measure to the DFT loss, we study the effect of combining DFT with feature pre-processing, such as PCA for feature decorrelation. We choose clean MNIST and Fashion-MNIST datasets as examples and perform PCA on the 400-D deep features without energy truncation. The DFT loss is calculated for each of 400 PCA-decorrelated features. After feature selection, the XGBoost classifier is applied. Figure 8 compares the test accuracy under different selected dimensions for each setting. We see that PCA pre-processing improves the classification performance with the same selected dimension.
Furthermore, PCA pre-processing allows a smaller feature dimension for the same performance. For example, the accuracy on Fashion-MNIST saturates at around 15-D and 30-D with and without pre-processing, respectively. This can be explained by the energy compaction capability of PCA. Figure 9 shows the histogram of energy ranking of the feature subset selected by DFT with and without PCA preprocessing. The raw features are first sorted by decreasing energy (variance) prior to feature selection. We see that the selected subset tends to gather in the first 20 to 50 principal components with PCA pre-preprocessing while the selected features are more widely distributed without PCA.
Performance comparison of DFT feature selection with and without PCA feature pre-processing.
Performance comparison of DFT feature selection with and without PCA feature pre-processing.
Histogram comparison of feature indices ranked by the energy with 20 and 40 selected feature numbers before and after PCA pre-processing. The smaller the ranking index in the x-axis, the higher the feature energy.
Histogram comparison of feature indices ranked by the energy with 20 and 40 selected feature numbers before and after PCA pre-processing. The smaller the ranking index in the x-axis, the higher the feature energy.
4.3 RFT for Regression Problems
We convert the discrete class labels arising from the classification problem to floating numbers so as to formulate a regression problem. We compare effectiveness of four feature selection methods: (1) variance (Var.), (2) absolute correlation coefficient w.r.t the regression target (Abs. Corr. Coeff.), (3) feature importance (Feat. Imp.) from a pre-trained XGBoost regressor (of 50 trees), and (4) RFT. Again, we can draw two conclusions.
4.3.1 RFT Offers a More Obvious Elbow Point
Figure 4 compares the ranked scores for different feature selection methods. The lower RFT loss, the higher feature importance while the other three have a reversed relation. RFT has a more obvious elbow point than the knee points of Variance and correlation-based methods. The feature importance from the pre-trained XGBoost regressor saturates very fast (up to 24-D) and the difference among the remaining features is small. In contrast, RFT has a more distinct and reasonable elbow point, ensuring the performance after dimension reduction. A larger XGBoost model with more trees can help increase the feature number of higher importance. Yet, it is not clear what model size would be suitable for a particular regression problem.
4.3.2 Features Selected by RFT Achieve Comparable and Stable Performance
Tables 9 and 10 summarize the regression MSE at two reduced dimensions selected by the RFT loss curves using early and late elbow points. The proposed RFT can achieve comparable (or even the best) performance among the four methods at the same selected feature dimension regardless of whether the input images are clean or noisy. By employing only 25−37.5% of the total feature dimensions, the regression MSEs obtained by the late elbow point of RFT are 20−30% and 5−10% higher than those of the full feature set for MNIST and Fashion MNIST, respectively. This demonstrates the effectiveness of the RFT feature selection method.
Regression MSE comparison for MNIST (clean/noisy) images with features selected by four methods.
| Method . | Early elbow point 30-D/50-D . | Late elbow point 100-D/100-D . | Full set 400-D . |
|---|---|---|---|
| Var. | 1.45/1.23 | 0.90/0.99 | 0.70/0.83 |
| Abs. Corr. Coeff. | 1.43/1.37 | 0.90/1.06 | |
| Feat. Imp. | 1.55/1.47 | 1.04/1.23 | |
| RFT (Ours) | 1.37/1.36 | 0.91/1.04 |
| Method . | Early elbow point 30-D/50-D . | Late elbow point 100-D/100-D . | Full set 400-D . |
|---|---|---|---|
| Var. | 1.45/1.23 | 0.90/0.99 | 0.70/0.83 |
| Abs. Corr. Coeff. | 1.43/1.37 | 0.90/1.06 | |
| Feat. Imp. | 1.55/1.47 | 1.04/1.23 | |
| RFT (Ours) | 1.37/1.36 | 0.91/1.04 |
Regression MSE comparison for Fashion-MNIST (clean/noisy) images with features selected by four methods.
| Method . | Early elbow point 30-D/50-D . | Late elbow point 150-D/150-D . | Full set 400-D . |
|---|---|---|---|
| Var. | 2.08/1.98 | 1.46/1.73 | 1.35/1.62 |
| Abs. Corr. Coeff. | 1.95/1.96 | 1.49/1.75 | |
| Feat. Imp. | 2.00/2.06 | 1.62/1.86 | |
| RFT (Ours) | 1.97/1.96 | 1.48/1.73 |
| Method . | Early elbow point 30-D/50-D . | Late elbow point 150-D/150-D . | Full set 400-D . |
|---|---|---|---|
| Var. | 2.08/1.98 | 1.46/1.73 | 1.35/1.62 |
| Abs. Corr. Coeff. | 1.95/1.96 | 1.49/1.75 | |
| Feat. Imp. | 2.00/2.06 | 1.62/1.86 | |
| RFT (Ours) | 1.97/1.96 | 1.48/1.73 |
5 Conclusion and Future Work
Two feature selection methods, DFT and RFT, were proposed for general classification and regression tasks in this work. As compared with other existing feature selection methods, DFT and RFT are effective in finding distinct feature subspaces by offering obvious elbow regions in DFT/RFT curves. They provide feature subspaces of significantly lower dimensions while maintaining near optimal classification/regression performance. They are computationally efficient. They are also robust to noisy input data.
Recently, there is an emerging research direction that targets unsupervised representation learning [8−10, 27, 28, 41, 42]. Through this process, it is easy to get high dimensional feature spaces (say, 1000-D or higher). We plan to apply DFT/RFT to them and find discriminant/relevant feature subspaces for specific tasks.
The authors acknowledge the Center for Advanced Research Computing (CARC) at the University of Southern California for providing computing resources that have contributed to the research results reported within this publication.
Biographies
Yijing Yang received her Bachelor’s degree from Tianjin University, China, in June 2016, and Master’s degree in Electrical Engineering from the University of Southern California (USC), Los Angeles, USA, in 2018. Currently, she is pursuing the Ph.D. degree in Media Communications Lab at USC, supervised by Prof. C.-C. Jay Kuo. Her research interests include image processing, computer vision and medical image analysis.
Wei Wang received her Bachelor’s in Applied Physics from Northeastern University (CN), and her MS degree in Materials Engineering from the University of Southern California in 2014 and 2016, respectively. She is currently a Ph.D. student in Electrical and Computer Engineering in Multimedia Communication Lab, advised by Prof. C.-C. Jay Kuo. Her research interests include image processing and machine learning.
Hongyu Fu received his Bachelor’s degree in Electrical Engineering from Peking University, Beijing, China in July 2017. As a Ph.D. student, he joined Media Communication Lab at University of Southern California, supervised by Prof. C.-C. Jay Kuo. His research interests include image processing, computer vision, and machine learning.
C.-C. Jay Kuo (F’99) received the B.S. degree in Electrical Engineering from the National Taiwan University, Taipei, Taiwan, in 1980, and the M.S. and Ph.D. degrees in Electrical Engineering from the Massachusetts Institute of Technology, Cambridge, in 1985 and 1987, respectively. He is currently the holder of William M. Hogue Professorship, the Director of the Multimedia Communications Laboratory and a Distinguished Professor of electrical engineering and computer science at the University of Southern California, Los Angeles. His research interests include digital image/video analysis and modeling, multimedia data compression, communication and networking, and biological signal/image processing. He is a co-author of 320 journal papers, 980 conference papers, 30 patents, and 15 books. Dr. Kuo is a Fellow of the American Association for the Advancement of Science (AAAS), the Institute of Electrical and Electronics Engineers (IEEE), the National Academy of Inventors (NAI) and the International Society for Optical Engineers (SPIE).










