Demosaicing is crucial in digital imaging, reconstructing full-color images from partial red, green and blue sensor data with two-thirds of the pixel information missing. Although deep learning has improved performance, its large models and high computational demands are unsuitable for resource-limited edge devices. This work presents a green U-shaped image demosaicing (GUSID) method, a novel approach grounded in green learning (GL) principles. GUSID addresses the limitations of traditional methods by offering a lightweight, transparent and efficient solution. Unlike neural network-based methods, GUSID completely avoids deep learning. Instead, GUSID uses unsupervised representation learning for effective feature extraction and supervised feature learning to enhance computational efficiency and ensure high-quality performance. GUSID’s compact design minimizes computational overhead while maintaining competitive accuracy. Its support for parallelized training also enables fast execution, making it an ideal choice for real-time vision applications on resource-constrained devices.
1. Introduction
In digital imaging, each pixel on an image sensor is usually designed to capture only one color channel: red, green or blue. Digital cameras use a color filter array (CFA), which places microscopic filters over pixels to sample different colors to capture red, green and blue (RGB) images. The most common CFA is the Bayer filter. The Bayer pattern reflects the human eye’s greater sensitivity to green light by assigning two green pixels for every red and blue pixel in each 2 × 2 pixel block. This arrangement results in only partial color information per pixel, with each pixel recording data for just one color.
This inherent limitation necessitates sophisticated processing algorithms to reconstruct a full-color image that accurately reflects the scene.
A process called demosaicing is used to create a complete color image. Demosaicing algorithms estimate the missing color values by interpolating from neighboring pixels. Early approaches relied on basic interpolation techniques, such as bilinear and bicubic methods. Over time, more sophisticated techniques were developed, incorporating heuristics and leveraging image statistics to improve accuracy. Some advanced algorithms further separate chrominance (color) from luminance (brightness), allowing for better edge preservation and reduced color bleeding, enhancing the overall quality of the final image (Adams, 1995).
Nonadaptive interpolation techniques generally perform well in smooth regions of an image, where the intensity of the pixels changes gradually. In these cases, the absence of sharp transitions allows these more straightforward methods to produce visually acceptable results without introducing significant artifacts. However, these algorithms often fail when applied to more complex areas of an image, such as those with high-frequency textures, intricate patterns or sharp edges. Their inability to adapt to the varying structures in such regions leads to unwanted effects such as blurring, aliasing or distortion of edges, ultimately compromising image quality. Adaptive directional interpolation methods enhance image reconstruction by aligning the interpolation with dominant edges. Using gradients (Laroche, 1994), Laplacian operators (Kiku et al., 2014) or edge detection mechanisms (Chung et al., 2008), adaptive methods identify edge orientation and strength, ensuring that interpolation follows natural image contours. These edge-aware approaches improve image quality, particularly in areas with sharp transitions or strong textures.
Nevertheless, the effectiveness of interpolation-based techniques often diminishes under real-world imaging conditions, such as sensor noise or motion blur, which introduce additional complexity into the reconstruction process.
Deep learning techniques are increasingly prominent in image demosaicing, setting new performance benchmarks. By using architectures like convolutional neural networks (CNNs) and other advanced deep learning frameworks, these methods have achieved remarkable accuracy in reconstructing high-quality images from raw sensor data. However, their complexity often demands significant computational power, making it challenging to implement them on edge devices like smartphones, which have limited processing capabilities, memory and energy resources compared to cloud-based systems with powerful GPUs or TPUs.
These large models require significant computational resources, leading to slow performance and increased latency in real-time applications like video streaming or autonomous systems on edge devices. In addition, running them on mobile phones quickly drains battery life, a critical concern for portable devices with limited power. Balancing performance with energy efficiency is essential for the broader adoption of deep learning-based demosaicing on edge platforms. Another challenge is the substantial size of the models. Often over-parameterized to improve performance, they demand significant storage, which is problematic for edge devices with limited capacity. Reducing model complexity can degrade performance, while offloading tasks to the cloud introduces delays and higher energy consumption due to data transfer.
Therefore, there is a growing need for demosaicing methods that are both computationally efficient and effective in terms of image quality, especially under constrained hardware environments.
A new approach, termed green learning (GL) (Kuo and Madni, 2023), has been introduced to address some of the limitations seen in deep learning over recent years. GL functions without relying on neural networks, distinguishing it from conventional deep learning approaches. It emphasizes simplicity, interpretability and energy efficiency, making it especially well-suited for edge and mobile devices where traditional deep models may not be practical.
This paper introduces an innovative demosaicing technique called the Green U-shaped Image Demosaicing (GUSID) method. GUSID leverages the strengths of GL and incorporates a novel U-shaped architecture tailored for efficient and accurate reconstruction. Our approach bridges the gap between traditional methods and deep models by offering a lightweight yet competitive alternative.
The rest of the paper is organized as follows. Section 2 reviews related research on image demosaicing and summarizes GL fundamentals. The proposed GUSID method is detailed in Section 3. The experimental results are discussed in Section 4. Finally, Section 5 concludes the paper.
2. Related work
2.1 Demosaicing
Demosaicing remains a critical and well-explored topic in image processing, essential for reconstructing full-color images from raw data captured by single-sensor cameras. Traditional methods commonly apply a uniform filtering approach across all pixels, effectively handling flat or low-texture areas where interpolation is less complex. However, despite their simplicity, these methods often struggle with visual artifacts such as Moiré patterns, color artifacts and over-smoothing, even in regions where they typically perform best. To mitigate these issues, gradient-based techniques that estimate the edge directions to refine interpolation have been introduced. By adapting filtering along the detected edges, these methods aim to preserve structural details and reduce artifacts, improving the visual quality in more complex image regions. These methods determine line or diagonal edge patterns using low-cost edge-sensing methods. Demosaicing can also be approached as an optimization problem, where image priors are incorporated and examined within optimization frameworks. Wu et al. (2016) use regression-based priors to refine demosaiced images, using spatial-spectral correlations to improve image quality. Another study uses the concept of prior based on mean shift vectors (Arjomand Bigdeli et al., 2017).
Recent techniques use CNNs for demosaicing tasks. Various studies, including Henz et al., (2018); Kokkinos and Lefkimmiatis, (2019); Tan et al., (2018); Gharbi et al., (2016); Liu et al., (2020); Ratnasingam, (2019); Lee et al., (2023); Hou et al., (2024), demonstrate the effectiveness of CNNs in enhancing demosaicing performance. Another approach explored in the literature is using multiple subnets, where several interconnected subnets address different aspects of the demosaicing process (Tan et al., 2017; Yamaguchi and Ikehara, 2019). Fine-to-coarse representation learning is also applied in computer vision tasks and has shown to be beneficial for progressively capturing both local detail and global structure in the reconstruction process (Deng et al., 2020; Ignatov et al., 2020; Jing et al., 2019). Although deep learning methods achieve state-of-the-art performance, they are computationally demanding and require large model sizes. Additionally, most existing studies do not treat demosaicing as a standalone issue; instead, they address it jointly with denoising in a combined approach.
2.2 Green learning
GL (Kuo and Madni, 2023) marks a transformative shift from conventional deep learning paradigms widely used in AI today. This innovative approach emphasizes computational efficiency, aiming to meet the increasing demands of resources posed by traditional AI models. One of the distinguishing aspects of GL is its departure from backpropagation, a core component of neural network training. Instead, it uses unsupervised feature extraction methods such as the Saab Transform (Li et al., 2020) or the channel-wise Saab Transform (Chen and Kuo, 2019). GL effectively extracts meaningful features using these techniques while avoiding the computational intensity associated with backpropagation.
To improve its overall performance, GL integrates advanced feature selection strategies, including the discriminant feature test and the relevant feature test (RFT) (Yang et al., 2022). These methods play a critical role in pinpointing the most valuable features for model training, streamlining the learning process and improving the accuracy of the resulting models. Once features are selected, GL uses advanced training algorithms such as XGBoost (Chen and Guestrin, 2016) and subspace learning machines (Fu et al., 2024). These algorithms are designed to maximize performance, allowing GL to efficiently handle diverse data sets and tasks, demonstrating its adaptability and robustness.
A significant benefit of GL lies in its efficiency, resulting from its avoidance of backpropagation and end-to-end training paradigms. A GL-based image demosaicing (GID) method is studied in Movahhedrad et al. (2024). Its lower computational overhead makes it more scalable and better suited to applications across various domains. This work presents a GL-based demosaicing method called GUSID. GUSID is entirely different from GID. The U-shaped learning is highly effective for demosaicing and other computer vision tasks because it combines multiscale feature extraction in the encoder with precise spatial detail recovery in the decoder, aided by skip connections. This design preserves fine details while integrating global context, making it ideal for pixel-level tasks where both high-level understanding and fine localization are critical. We will compare the performance of GID and GUSID in the experimental section.
3. Proposed GUSID method
In this section, we present the GUSID method. However, the U-shaped structure resembles the U-Net architecture in neural networks. Their operational principles are different. GUSID does not rely on backpropagation during training. Instead, it uses boost trees in the XGBoost regressor to lower the cost iteratively. All parameters in each GUSID processing unit are determined using a feedforward approach.
The block diagram of GUSID is depicted in Figure 1. It consists of four processing modules:
The architecture contains four modules arranged from left to right. Module 1 shows three input feature volumes labelled 32 by 32 by 3, 16 by 16 by K 2, and 4 by 4 by K 4. These correspond to levels L 1, L 2, and L 4. Module 2 contains R F T blocks for each level. Module 3 contains L N T blocks for each level. Module 4 contains a regressor for each level that outputs y with tilde subscript 1, y with tilde subscript 2, and y subscript n. Lower outputs pass through resize blocks. Circular nodes combine the resized outputs to produce the final output y. On the right, patch down sampling is indicated with progressively smaller image patches.An overview of the proposed pipeline in four modules. Module 1 involves deriving feature representations with fine-to-coarse Saab transformation. Module 2 selects relevant features. Module 3 generates complementary discriminative features. Module 4 makes decisions, where one XGBoost regressor in each level estimates the residuals based on the initial predictions of the coarse (lower) level
The architecture contains four modules arranged from left to right. Module 1 shows three input feature volumes labelled 32 by 32 by 3, 16 by 16 by K 2, and 4 by 4 by K 4. These correspond to levels L 1, L 2, and L 4. Module 2 contains R F T blocks for each level. Module 3 contains L N T blocks for each level. Module 4 contains a regressor for each level that outputs y with tilde subscript 1, y with tilde subscript 2, and y subscript n. Lower outputs pass through resize blocks. Circular nodes combine the resized outputs to produce the final output y. On the right, patch down sampling is indicated with progressively smaller image patches.An overview of the proposed pipeline in four modules. Module 1 involves deriving feature representations with fine-to-coarse Saab transformation. Module 2 selects relevant features. Module 3 generates complementary discriminative features. Module 4 makes decisions, where one XGBoost regressor in each level estimates the residuals based on the initial predictions of the coarse (lower) level
Module 1: Representation Learning. It derives the joint spatial-spectral representations from raw data.
Module 2: Relevant Feature Selection. It identifies a set of discriminative features from Module 1.
Module 3: New Feature Generation. It generates more powerful features by linearly combining discriminant features selected in Module 2.
Module 4: Decision Learning. It uses several XGBoost regressors to predict pixel values. Each level of the coarse-to-fine U-shaped pipeline incorporates these modules.
3.1 Representation learning
We transform each raw image patch into a joint spatial-spectral feature space in the representation learning module. We use a series of multi-hop Saab transforms that gradually extract and refine latent representations to accomplish this. Each hop of the Saab transform uncovers increasingly abstract feature representations by leveraging the input patches’ spatial and spectral characteristics.
In the first hop, consider a target pixel and its surrounding neighborhood, which form an input tensor . We apply a Saab filter of size , with , to this neighborhood. The Saab transform is closely related to the principal component analysis (PCA) since it projects on an orthogonal basis. Specifically, the projection is defined by a set of principal components and the corresponding bias terms. Mathematically, each projection at this hop is expressed as follows:
Here, represents the principal component associated with the highest eigenvalue, capturing the global energy of . By projecting onto , we extract a low-frequency (DC) component of the input. In contrast, the remaining principal components focus on capturing higher-frequency (AC) variations. Thus, the feature space can be decomposed into two orthogonal subspaces:
The DC subspace, , corresponds to the direction , while represents the subspace spanned by the remaining principal components. By isolating the DC projection , we remove the dominant energy component from . The residual AC component, , is then defined as:
To better understand and refine , an additional PCA step is applied to determine its eigenvectors and associated bias terms, thus ensuring an efficient representation that captures subtle variations in the data.
At each hop i, we start with an input neighborhood of size . The Saab filtering operation of size produces an output tensor , where . Since successive hops build on the previous one, the dimensionality of the features tends to grow. To manage the complexity, we apply a max-pooling operation after each hop to reduce spatial resolution before passing the features to the next hop. It ensures that the representation remains computationally tractable and informative.
In practice, we use channel-wise Saab transforms with different kernel sizes (, and ) to adaptively capture spatial patterns at multiple scales. The spatial resolutions of the input and output after each hop are as follows:
Hop 1: , .
Hop 2: , .
Hop 3: , .
Hop 4: , .
Through this progressive, multi-hop transformation process, we obtain a rich and compact latent representation that captures coarse global information and intricate local details, thereby facilitating more accurate and robust downstream tasks.
These raw features capture joint spatial-spectral representations of each input patch. However, the successive application of the Saab transform causes a rapid increase in the dimensionality of the features at each hop. This high dimensionality can lead to issues such as increased computational complexity, overfitting due to redundant or less informative features and difficulties in model interpretability. To mitigate these challenges, incorporating a feature selection module would be advantageous to manage the feature size effectively, reduce redundancy and retain the most discriminative and informative representations, ultimately improving the efficiency and generalization of the model.
3.2 Relevant feature selection
After performing the three-hop representation learning step, we obtain three sets of high-dimensional feature representations, one for each hop. Since these feature spaces are often large, it is essential to identify and retain only the most informative feature dimensions for the predictive modeling task at hand. To achieve this, we use a supervised feature selection technique analogous to the one introduced in CITE Yang, designed to filter out less relevant features efficiently. The RFT approach is adapted explicitly for regression problems. Iteratively evaluating the influence of features on the prediction error enables the extraction of the most critical dimensions from each hop’s representation space.
The RFT procedure proceeds as follows. For every feature dimension, j, we divide the training set into two subsets, and , based on a candidate threshold . Here, and consist of the samples that lie to the left and right of the threshold , respectively. For each threshold, we fit regression models to both subsets and compute the corresponding mean squared errors (MSE), denoted by and . The RFT loss at threshold t is then defined as a weighted average of these MSEs:
where N is the total number of samples and and are the sizes of the left and right subsets, respectively.
Next, the method identifies the optimal threshold for each feature dimension by selecting the threshold that minimizes the RFT loss:
where T is the set of all candidate thresholds. The resulting value represents the minimum achievable loss for the feature j, reflecting how effectively that feature can partition the data into two more predictable subsets. Thus, smaller values of correspond to features with higher predictive relevance.
Once all minimal losses are computed, we rank the features from most to least relevant by sorting the values. The top K features, characterized by the lowest scores, form the final set of selected features. This supervised selection process offers a principled and data-driven way to extract the most critical features, ensuring that the chosen dimensions can contribute robustly to regression performance.
The RFT loss curves guide the selection of key features in each hop. As illustrated in Figure 2, the RFT losses of the sorted feature indices highlight which dimensions are more critical. Features with lower RFT losses are considered more valuable. It can be observed that the first hop generally provides more stable and effective features than the later hops. Notably, the RFT-based approach is computationally efficient and requires only a modest number of training samples. A validation set is used to avoid overfitting, ensuring that the selected features perform well not only on the training data but also on previously unseen samples.
The line graph plots mean squared error on the vertical axis against rank on the horizontal axis. The rank ranges from 0 to about 650. The mean squared error ranges from about 24 to 50. Three curves are labelled Hop 1, Hop 2, and Hop 3. Hop 1 starts at about 24 at rank 0 and increases gradually to about 49 near rank 650. Hop 2 starts at about 26 at rank 0 and increases more quickly, approaching 50 by about rank 300 and remaining near 50 afterwards. Hop 3 starts at about 29 at rank 0 and rises sharply, reaching nearly 50 by about rank 100 and staying close to 50 for higher ranks.Loss of ranked features across hops
The line graph plots mean squared error on the vertical axis against rank on the horizontal axis. The rank ranges from 0 to about 650. The mean squared error ranges from about 24 to 50. Three curves are labelled Hop 1, Hop 2, and Hop 3. Hop 1 starts at about 24 at rank 0 and increases gradually to about 49 near rank 650. Hop 2 starts at about 26 at rank 0 and increases more quickly, approaching 50 by about rank 300 and remaining near 50 afterwards. Hop 3 starts at about 29 at rank 0 and rises sharply, reaching nearly 50 by about rank 100 and staying close to 50 for higher ranks.Loss of ranked features across hops
By establishing thresholds based on RFT loss curves at each hop, we define a search window that highlights the strongest feature dimensions in both the test and validation sets. Figure 3 presents the ranking of each feature in the training and validation sets, where a lower rank corresponds to higher energy and greater significance. The features with lower ranks in both sets represent the dimensions that provide the most discriminative power. Finally, we construct a cohesive and informative feature space that enhances the overall regression performance by integrating the selected features across all hops. This integrated set leverages the strengths of each hop’s representations, leading to improved generalization and prediction accuracy.
The scatter plot shows train feature ranking on the horizontal axis and validation feature ranking on the vertical axis. Both axes range from 0 to about 520. Three groups are labelled Strong, Moderate, and Weak. The Strong group contains points mainly between 0 and about 180 on both axes. These points form a dense cluster near the lower left. The Moderate group contains points mainly between about 150 and 350 on both axes. These points form a central cluster. The Weak group contains points mainly between about 250 and 520 on both axes. These points form a cluster toward the upper right.Feature ranking for training and validation sets. The intersecting features with low rank are considered strong features
The scatter plot shows train feature ranking on the horizontal axis and validation feature ranking on the vertical axis. Both axes range from 0 to about 520. Three groups are labelled Strong, Moderate, and Weak. The Strong group contains points mainly between 0 and about 180 on both axes. These points form a dense cluster near the lower left. The Moderate group contains points mainly between about 150 and 350 on both axes. These points form a central cluster. The Weak group contains points mainly between about 250 and 520 on both axes. These points form a cluster toward the upper right.Feature ranking for training and validation sets. The intersecting features with low rank are considered strong features
3.3 New feature generation
Although we have initially extracted raw features from image patches and identified those most discriminative of the underlying target structure, augmenting the feature space with additional complementary features can be highly beneficial. By introducing such supplementary representations, we enhance the overall richness of the feature pool and potentially improve the robustness and accuracy of subsequent classification tasks. We leverage the least squares normal transform (LNT) approach to derive these complementary features. This method reframes the original multi-class classification objective as a linear regression problem, enabling the computation of discriminative transformations that complement the selected features extracted in earlier stages.
Consider a data set of training samples corresponding to classes. We define an indicator matrix , where each element is tied to a superclass label configuration m and . The goal is to find a weight matrix that linearly maps the input feature space to the target space . Formally, we seek solutions to the linear least-squares regression problem:
where is a rank-one bias matrix that shifts the features without affecting their fundamental discriminative properties. By taking expectations, we isolate the bias term:
Because only changes the feature mean, it does not influence the selection of discriminative dimensions. Thus, determining is the primary challenge. From the normal equation of least squares, we have:
Once is calculated, any given input feature vector can be transformed into an of LNT-generated features:
These newly generated complementary LNT features act as an additional source of discriminative information, enriching the feature space with abstractions of higher levels. To maximize the utility of these LNT features, it is crucial to identify and isolate subsets of features that have the most significant potential to improve classification performance. In this study, we achieve this by partitioning the features into n distinct subsets and following the paths of the XGBoost decision trees within each subgroup to identify the most discriminative configurations.
We define the features generated directly from the raw input data as Level 1 features. These features capture complementary information derived from the original feature set, enhancing the discriminative power of the model. Building on this foundation, we introduce the Level 2 features, generated not directly from the raw features but from the complementary Level 1 feature set. This hierarchical generation process allows Level 2 features to capture higher-order interactions and relationships between features, effectively extending the depth and complexity of the feature space.
To validate the effectiveness of this approach, we compare the loss values associated with raw features, Level 1 features and Level 2 features. Figure 4 shows that the complementary features (Level 1 and Level 2) significantly outperform the raw feature. This improvement highlights the strength of the hierarchical feature generation process in improving the predictive capacity of the model.
The line graph plots loss on the vertical axis against ranked features on the horizontal axis. Ranked features range from 0 to about 520. Loss ranges from about 85 to 131. Three curves are labelled L N T features level 1, L N T features level 2, and R F T features. L N T features level 1 start near 86 at rank 0, rise gradually to about 91 by around rank 100, then increase sharply near rank 180 to about 130. L N T features level 2 start near 85 at rank 0, rise to about 90 by around rank 110, then increase sharply near rank 170 to about 130. R F T features start near 100 at rank 0, increase rapidly to about 129 by around rank 40, and remain close to 131 for higher ranks.Loss versus ranked features for raw and complementary LNT features
The line graph plots loss on the vertical axis against ranked features on the horizontal axis. Ranked features range from 0 to about 520. Loss ranges from about 85 to 131. Three curves are labelled L N T features level 1, L N T features level 2, and R F T features. L N T features level 1 start near 86 at rank 0, rise gradually to about 91 by around rank 100, then increase sharply near rank 180 to about 130. L N T features level 2 start near 85 at rank 0, rise to about 90 by around rank 110, then increase sharply near rank 170 to about 130. R F T features start near 100 at rank 0, increase rapidly to about 129 by around rank 40, and remain close to 131 for higher ranks.Loss versus ranked features for raw and complementary LNT features
Furthermore, complementary features generated through the LNT process are integrated with previously selected features from the RFT-based selection process. This integration creates a hybrid feature space that combines the strengths of raw feature selection with the nuanced and discriminative properties of the LNT-generated features. This enriched feature representation is subsequently fed into the decision-learning stage, input for advanced regressors. By combining complementary LNT features with RFT-selected features, this hybrid approach significantly enhances the model’s ability to capture complex patterns, thereby boosting its predictive accuracy and overall performance.
3.4 Decision learning
The Decision Learning module uses an XGBoost regressor for each level of processing. The input to this module is a feature set comprising both the selected raw features and the complementary generated features. The training output for the coarsest level is a down-sampled patch. Specifically, in this work, the coarsest level predicts the color values for a 2 × 2 patch, which is downsampled from the 16 × 16 ground truth patch from the finest level.
The coarsest level predicts the color values for the 2 × 2 patch, which are then up-sampled. Following a U-shaped architecture, the residuals between these up-sampled values and the ground-truth values at the next finer level are predicted iteratively in subsequent levels. This hierarchical approach ensures the precise refinement of predictions across levels.
We use residual learning at all levels except the coarsest level to maintain depth without performance drop. By focusing on predicting residuals at finer levels, we mitigate the risk of error propagation while ensuring consistent refinement to improve accuracy without introducing additional complexity.
We used XGBoost as it is highly efficient in handling structured data and offers robust performance with minimal tuning. Its ability to naturally integrate raw and generated features into a strong regression model aligns well with the iterative refinement process in our hierarchical framework, ensuring accuracy and computational efficiency across all levels.
When we get to the top levels of our pipeline, a significant challenge arises due to the accumulation of a large number of data points with low residuals, leading to a pronounced imbalance in the data distribution. To address this issue, we used a multi-round regression strategy. In this approach, regression is performed iteratively: at each round, data points yielding low MSE are filtered out and a new regressor is trained to predict the residuals of the remaining data. This iterative refinement enables the model to progressively focus on harder-to-predict samples by discarding easier cases with minimal residuals. As a result, each successive regressor is trained on a more balanced and informative subset of the data, which is critical for improving performance on the more challenging portions of the data set. To determine the optimal data acceptance threshold and corresponding data fraction at each regression round, we analyze the distribution of residuals and MSE across multiple candidate thresholds. At each stage, we consider several possible values, where each value defines a subset of accepted data along with its associated MSE. The subsequent stage operates on the remaining data, applying a new threshold to the residuals from the previous round.
This decision-making process is formulated as a path optimization problem over a trellis structure:
Each node in the trellis represents a candidate value at a given regression round, along with the proportion of data it retains and the MSE associated with that subset.
Each edge connects a node in one round to a node in the subsequent round, carrying forward the unselected portion of data. The edge accumulates a weighted MSE, considering both the error at the current node and the fraction of data it accounts for.
The objective is to identify a path through the trellis – i.e. a sequence of thresholds, one per round – that minimizes the total MSE across the entire data set. The total MSE is computed as the weighted sum of errors at each stage, where the weights correspond to the fraction of data handled by each regressor. This optimization ensures that the selected sequence of thresholds not only achieves low error locally at each round but also yields optimal performance globally across all rounds.
4. Experiments
4.1 Experimental setup
We used the Kodak and the McMaster (MCM) data sets to evaluate the performance of our algorithm in the demosaicing task. The Kodak data set comprises 24 high-quality images, each with a resolution of 768 × 512 pixels, offering a diverse range of natural scenes that serve as a robust benchmark for color image processing. In contrast, the MCM data set includes 18 images cropped to 500 × 500 pixels from their original high-resolution dimensions of 2310 × 1814 pixels, specifically to generate noise-free CFA images. These two data sets collectively provide a balanced assessment environment to gauge the effectiveness of our algorithm under various conditions.
We trained our model using the Waterloo Exploration Database (WED), a comprehensive repository of images that supports effective and representative training. We used XGBoost regressors with a maximum depth of three in the initial stages and four in the later refinement stages, while fixing the number of boosting rounds at 500. Unlike many models that address demosaicing and denoising simultaneously, our approach focuses solely on demosaicing, enabling a more precise evaluation of its capabilities. However, this focus makes direct comparisons with models handling noise-affected data less meaningful.
Our analysis highlights key aspects, including performance metrics, model complexity and computational efficiency. By isolating demosaicing, we provide deeper insights into its challenges and solutions, offering a clear view of our method’s strengths and limitations.
4.2 Performance analysis
We evaluated the proposed GUSID method by benchmarking it against several state-of-the-art techniques, including traditional and deep learning-based approaches, in Table 1. Among conventional methods, we compared against residual interpolation (RI) (Kiku et al., 2016), minimized Laplacian residual interpolation (MLRI) (Kiku et al., 2014), adaptive residual interpolation (ARI) (Monno et al., 2015) and regression of directional differences with efficient regression priors (DDR-ERP) (Wu et al., 2016). These techniques use sophisticated interpolation and regression strategies to enhance demosaicing performance by leveraging spatial and directional image characteristics. We also evaluated our model relative to the latest CNN-based models, including DJDD (Gharbi et al., 2016), NTSDCN (Wang et al., 2020) and Inc (Niu et al., 2023). Finally, we compare the performance with GID (Movahhedrad et al., 2024) as shown in the table.
Average CPSNR of different models
| Method | Dataset | |
|---|---|---|
| Kodak (24) | McMaster (18) | |
| RI (Kiku et al., 2016) | 39.05 | 36.79 |
| MLRI (Kiku et al., 2014) | 39.52 | 36.51 |
| ARI (Monno et al., 2015) | 39.63 | 37.42 |
| DDR-ERP (Wu et al., 2016) | 40.46 | 37.50 |
| DJDD (Gharbi et al., 2016) | 41.20 | 39.13 |
| NTSDCN (Wang et al., 2020) | 42.79 | 39.48 |
| TSNet (Wang et al., 2020) | 42.39 | 39.39 |
| InC (Niu et al., 2023) | 42.81 | 39.62 |
| GID (Movahhedrad et al., 2024) | 41.21 | 39.21 |
| GUSID (ours) | 42.19 | 39.40 |
| Method | Dataset | |
|---|---|---|
| Kodak (24) | McMaster (18) | |
| 39.05 | 36.79 | |
| 39.52 | 36.51 | |
| 39.63 | 37.42 | |
| DDR-ERP ( | 40.46 | 37.50 |
| 41.20 | 39.13 | |
| 42.79 | 39.48 | |
| TSNet ( | 42.39 | 39.39 |
| InC ( | 42.81 | 39.62 |
| 41.21 | 39.21 | |
| 42.19 | 39.40 | |
Deep learning methods are known for their ability to learn complex mappings and take advantage of local and global image features for improved results. Although GUSID is not based on deep learning, it consistently demonstrates competitive performance on two benchmarks. It underscores the strength and efficiency of our GUSID method. It achieves high-quality results through innovative algorithmic design while maintaining simplicity and computational efficiency. Our comparative analysis highlights the robust capabilities of our model in both traditional and modern contexts.
4.3 Model complexity analysis
Recent advances in neural network-based demosaicing often rely on a single integrated architecture for reconstructing the red, green and blue channels. Although this unified approach facilitates shared representations across channels, it can lead to uneven performance, with overfitting in some channels and underfitting in others. To mitigate these issues, we trained separate compact models for each channel. Specifically, the representation learning module has 0.17 million parameters.
Selecting relevant features involves maintaining a record of the partition loss for each feature stored in an array that organizes feature indices based on their associated losses. This method is computationally efficient, as the feature selection model has just 0.06 million parameters for all three channels. We used an XGBoost model with a maximum depth of 2 for feature generation, adding approximately 2,000 parameters per level.
All modules preceding the decision-learning stage are designed with low computational complexity. Most of the model’s parameters are concentrated within the decision-learning module, accounting for the dominant share of the model’s complexity. The number of parameters for an XGBoost regressor can be computed as follows. When dealing with an ensemble of T decision trees, each having a depth of D and applied to a classification task with C classes, the total number of parameters in the model is determined by considering both the decision nodes and the leaf nodes for every tree in the ensemble.
Decision nodes: A decision tree of depth D includes decision nodes. The factor of 2 arises due to the binary nature of decision-making at each node, directing the flow to either the left or right child.
Leaf nodes: Each tree comprises leaf nodes, which represent the final outcomes after all decisions have been made within the tree.
Combining these elements, the total number of parameters P in the ensemble can be calculated using the formula:
This expression gives a complete representation of the model size, accounting for all the parameters involved in the decision-making process across the ensemble. The decision learning component uses an XGBoost regressor with a depth of 4, resulting in 0.26 million parameters for all four levels. In total, our model consists of 0.55 million parameters. A comparison of our model’s size with other neural models is provided in the table.
Despite its compact architecture, our model is trained on four levels in three channels. The use of independent models enhances robustness against noise and outliers. Additionally, training models on data subsets supports parallel processing, which substantially reduces the training time required for large-scale and complex models.
4.4 Computational cost analysis
Our model has been specifically optimized for deployment in edge computing environments with highly constrained available computational resources. Demosaicing in mobile cameras is a real-time task that demands rapid processing with minimal latency. To evaluate the computational efficiency of the model, we analyze the number of floating-point operations (FLOPs) required. FLOPs represent the total count of basic mathematical operations, such as additions and multiplications, necessary for the model’s execution. This metric provides a reliable estimate of the computational effort required.
Instead of using runtime as a benchmark, we rely on FLOPs because runtime can be affected by various factors, including the complexity of the algorithm, hardware-specific parallel processing capabilities and memory access speeds. These variables may obscure the true computational requirements of the model. In contrast, FLOPs offer a consistent and hardware-agnostic measure, ensuring a fair comparison of computational demands across different platforms.
Neural networks often involve complex architectures and numerous computations due to their layered structure and interconnected nodes. These operations include large-scale matrix multiplications, additions and activation functions, contributing significantly to their overall computational load.
To estimate the FLOPs for our representation learning module, the following procedure was followed: Let the input tensor be , where c denotes the number of channels and specifies the spatial dimensions within each channel. A transformation, labeled Saab, is applied to using a kernel of size with a stride of 1. The resulting tensor is produced, where n represents the spatial resolution of the output.
This transformation can be described mathematically as follows:
Since the stride is set to 1, the spatial dimensions of are reduced from to . The value of n depends on the input dimensions p, the kernel size m and any padding used during the process.
The computational complexity of this operation, measured by the FLOPs, is given as follows:
Here, i corresponds to the iteration step (i-th application) of the transformation T, representing the number of consecutive transformations applied.
For the decision-learning module, which uses XGBoost regressors, the outlined computational framework is used.
Let the model consist of an ensemble of T decision trees, each with a maximum depth of D. For a classification task involving C classes, the computational cost is determined by the operations performed at each node of the decision trees.
Comparison Operations: Each decision tree has a depth of D and every node within the tree requires a comparison operation to determine whether to traverse the left or right subtree.
Addition Operations: A single addition operation is required to decide the following path in the tree structure at each decision node. We take these factors into account. Then, the total number of the FLOPs needed to process a single input sample across all trees in the ensemble is given by:
Our model has a remarkably low computational cost, requiring only 29,621 FLOPs per sample. We see its ability to deliver fast inference time while remaining computationally economical, even for complex tasks. A comparison of the FLOP count of our model with other deep learning methods is provided in Table 2.
Comparison of model parameters and FLOPs
| Method | Parameters (M) | FLOPs / sample (K) |
|---|---|---|
| PCNN (Yamaguchi and Ikehara, 2019) | 20.6 () | – |
| DJDD (Gharbi et al., 2016) | 36.2 () | 249.12 () |
| NTSDCN (Wang et al., 2020) | 1.2 () | 230.08 () |
| InC (Niu et al., 2023) | 1.3 () | 249.12 () |
| GID (Movahhedrad et al., 2024) | 0.7 () | 31.94 () |
| GUSID (ours) | 0.5 () | 29.62 () |
| Method | Parameters (M) | FLOPs / sample (K) |
|---|---|---|
| 20.6 ( | – | |
| 36.2 ( | 249.12 ( | |
| 1.2 ( | 230.08 ( | |
| InC ( | 1.3 ( | 249.12 ( |
| 0.7 ( | 31.94 ( | |
| 0.5 ( | 29.62 ( |
5. Conclusion
We proposed the GUSID method for image demosaicing, which delivers performance on par with state-of-the-art deep learning methods. GUSID is mathematically transparent, computationally efficient and has a tiny model size. GUSID is particularly well-suited for deployment on edge devices, offering minimal latency, low battery consumption, parallel training and reduced RAM usage.
Acknowledgements
The computational work was supported by the University of Southern California Center for Advanced Research Computing (carc.usc.edu).

