Individualized treatment effect inference of head and neck cancer with multimodal data

Wei, Yawen; Li, Zhen; Woo, Jonghye; Ouyang, Jinsong; El Fakhri, Georges; Liu, Xiaofeng

doi:10.1108/ATSIP-06-2025-0051

Estimating individualized treatment effects (ITE) is critical for personalized medicine, yet it remains a challenge due to retrospective observational data, which suffer from selection bias in clinical practice and the complexity of multimodal data used for patient status depiction. In this work, the authors develop an end-to-end deep learning (DL) framework that incorporates multimodal patient data and multiple treatments for accurate ITE inference in a retrospective head and neck cancer (HNC) study. A possible solution is concatenating the factors and adapting adversarial training, which has shown great promise on tabular data, to disentangle patient characteristics from patient status features to mitigate treatment selection bias. However, this approach suffers from instability when applied to complex multimodal patient data and multiple treatment options. For flexible and efficient treatment-conditioned information fusion, they propose a bi-stage adaptive instance normalization (Bi-AdaIN) to inject relevant factors into corresponding layers, an approach that is also robust to missing values. Furthermore, they propose to disentangle status features from the multi-treatment variable using mutual information (MI) regularization, enabling more accurate predictions of patient-specific outcomes for both factual and counterfactual data. The authors evaluated their model on the RADCURE dataset, comprising 3,346 HNC cases with CT scans and multiple clinical variables who received radiotherapy or additional chemotherapy and EGFRI. The Bias-Adjusted Treatment Effect (BATE) is substantially reduced compared to the conventional direct ITE method (which does not consider treatment bias) and to adversarial training, indicating a more robust estimation of causal effects. This work is one of the first DL-based studies to address ITE estimation using multimodal medical imaging, offering a promising approach to counterfactual reasoning in clinical oncology for decision support.

1. Introduction

Predicting the outcome of a specific treatment for an individual patient, known as individualized treatment effect (ITE) inference, is a cornerstone of precision medicine (Lei and Candès, 2021; Pearl, 2001; Prosperi et al., 2020). In complex diseases like head and neck cancer (HNC), where multiple treatment options such as radiotherapy, chemotherapy or their combinations exist, the ability to answer the question, “would this patient have lived longer had an alternative treatment been applied?” is of profound clinical importance. The increasing availability of large-scale medical data, including imaging and electronic health records (EHRs), offers an unprecedented opportunity for data-driven deep learning (DL) algorithms to learn the intricate causal relationships between patient status, treatment choices and subsequent outcomes. However, developing a DL framework for practical ITE prediction from observational data faces several significant challenges.

First, accurately predicting ITE requires a foundation built on patient status, which is characterized by a combination of multimodal data, including 3D medical images (e.g. CT) and tabular clinical variables (Welch et al., 2024). In addition to tabular EHR (Lacroix et al., 2001), a growing body of evidence suggests that imaging data holds great potential for prognosis. Effectively fusing information from heterogeneous yet complementary radiomic, geometric and demographic factors remains a technical hurdle.

Second, although ITE varies across treatments, many existing DL models for medical image-based survival prediction are direct models, as shown in Figure 1(a), that map preoperative patient status $x$ to an outcome $y$ (Kaur et al., 2022). These models either focus on a single treatment (Chaddad et al., 2016; Chang et al., 2016; Nie et al., 2019) or simply ignore the treatment $t$ (Baid et al., 2020; Bakas et al., 2018; Huang et al., 2021). Such models can predict outcomes under a single treatment regimen but cannot be used to compare different treatments for planning.

Figure 1.

Three causal diagrams compare a direct model, a naive treatment model, and a proposed model, showing relationships among x status, t treatment, and y I T E.

View large Download slide

The image shows three labelled panels a, b, and c that present causal model structures. Panel A is titled Direct model and shows x labelled Status pointing to y labelled I T E. Panel B is titled Naive treatment model and shows x labelled Status pointing to y labelled I T E, and t labelled Treatment also pointing to y. Panel C is titled Our proposal and shows x labelled Multimodal Status pointing to both t labelled Treatment and y labelled I T E. Arrows indicate the direction of relationships in each model.

Causal graphs for ITE estimation. (a) a direct model ignoring treatment. (b) a naive model including treatment as a feature. (c) the proposed causal framework where patient status influences both treatment and outcome. To estimate the true effect of treatment, we must disentangle its influence from the learned representation of a patient’s status

Third, observational studies, unlike randomized controlled trials, are subject to treatment selection bias (Shalit et al., 2017). For example, patients with certain characteristics (e.g. younger age, better overall health) might be systematically chosen for more aggressive treatments. A naive model, as shown in Figure 1(b), that learns a mapping from patient status $x$ and treatment $t$ to outcome $y$ without explicitly considering the causal role of treatment will confound the effect of the treatment with the baseline prognosis of the patient, failing to generalize to counterfactual scenarios (Liu et al., 2024; Shalit et al., 2017). To build a model that is generalizable to both factual (observed) and counterfactual (unobserved) patient-treatment cases, it is necessary to learn a treatment-invariant representation of the patient’s status.

Recent DL studies for ITE on tabular data (Bica et al., 2020; Curth and van der Schaar, 2021; Shalit et al., 2017; Shi et al., 2019; Xu and Yadlowsky, 2022; Yoon et al., 2018) have established that the core challenge is extracting unbiased, treatment-invariant status features $f_{θ} (x) ⊥ t$ (i.e. disentangling $t$ from $f_{θ} (x)$ ⁠), a challenge that is underexplored for multimodal status and multiple treatments. With multiple treatments, however, disentanglement via adversarial training – a widely used strategy in binary-treatment ITE prediction (Bica et al., 2020; Curth and van der Schaar, 2021; Shalit et al., 2017; Shi et al., 2019; Xu and Yadlowsky, 2022; Yoon et al., 2018) – cannot effectively disentangle the information between $f_{θ} (x)$ and multiple treatments $t$ (Liu et al., 2021a, 2021b; Roy and Boddeti, 2019). Moreover, adversarial training is notoriously unstable, which can be even more challenging for multimodal data compared to simple tabular data.

To address these challenges, we propose an end-to-end mutual information-regularized deep causal learning framework for ITE estimation in HNC. Instead of training independent models for each treatment, our approach uses all treatment data in a unified manner and explicitly considers their causal relationships. Specifically, we develop a bi-stage adaptive instance normalization (Bi-AdaIN) to inject relevant factors into their corresponding layers. This approach is robust to missing values and efficiently conditions the model on the multiple treatment factors. Our method is general and can be applied to modern DL backbones, such as the 3D Swin transformer (Liu et al., 2021c) and 3D ResNet (Turnbull, 2022). In our bi-stage injection, we fuse clinical variables after each transformer block in the Swin transformer or convolutional layer in ResNet before feature disentanglement, while conditioning on the treatment in fully connected (FC) layers after feature disentanglement. Furthermore, we propose an advanced mutual information (MI)-based disentanglement method over conventional adversarial training to ensure that factual and counterfactual patient-treatment cases share the same feature distribution. Consequently, a single model can be applied unbiasedly to both scenarios.

In summary, our work makes the following novel contributions:

We present, to our knowledge, the first DL-based study using multimodal medical imaging for ITE estimation.
We propose an efficient multimodal fusion and treatment conditioning network using Bi-AdaIN, allowing for a synergistic integration of information and causal correlation modeling.
We develop a robust deep disentanglement learning framework based on MI regularization to extract a treatment-invariant patient representation. This enables a single, unified model to predict outcomes for all treatments, accounting for both factual and unseen counterfactual scenarios.

We demonstrate the efficacy of our approach on the large-scale RADCURE clinical data set (Welch et al., 2024), showing that our model significantly reduces treatment effect bias compared to baseline approaches.

2. Methods

2.1 Causal framework for individualized treatment effect estimation

The goal of ITE estimation is to predict the potential outcomes for a patient under different treatments. Let $x$ denote the pretreatment status of a patient (e.g. CT images and clinical variables), $t \in {t_{1}, t_{2}, \dots, t_{K}}$ be the treatment received from a set of K possible treatments and $y$ be the outcome (e.g. survival length). In an observational data set, we only observe the factual outcome $y$ for the treatment $t$ that was actually administered. The outcomes for all other treatments $t' \neq t$ are unobserved counterfactuals.

As illustrated in Figure 1(a), direct models that ignore treatment or simply use it as an input feature fail to address the confounding issue where patient status $x$ influences both the treatment choice $t$ and the outcome $y$ ⁠. Moreover, in retrospective data sets, simply taking ${x, t} \to y$ ⁠, as illustrated in Figure 1(b), cannot guarantee performance. We only observe partial factuals ${x_{n}, t_{n}, y_{n}}_{n = 1}^{N}$ ⁠, without the counterfactuals associated with the alternative treatments ${x_{n}, t_{n}^{\neg}, y_{n}^{\neg}}_{n = 1}^{N}$ ⁠, where $t_{n}^{\neg} \neq t_{n}$ ⁠. Since the underlying treatment assignment mechanism is $p (t | x)$ ⁠, the inputs for factuals ${x_{n}, t_{n}}_{n = 1}^{N} \sim p^{F} (x, t) = p (x) p (t | x)$ and counterfactuals ${x_{n}, t_{n}^{\neg}}_{n = 1}^{N} \sim p^{C F} (x, t) = p (x) p (\neg t | x)$ follow different empirical distributions (Johansson et al., 2016), i.e. there is a covariate shift $p^{F} (x, t) \neq p^{C F} (x, t)$ (Liu et al., 2022). Thus, a model learned from factuals may not perform well on counterfactuals, failing to compare ITEs across treatments.

Our framework follows Figure 1(c), which explicitly models this causal structure. The core idea is that to accurately estimate the effect of $t$ on $y$ ⁠, we need to learn a representation of the patient status, $z = f_{ϕ} (x)$ ⁠, that is disentangled from the treatment assignment, i.e. $p (z | t) = p (z)$ ⁠. This ensures that the representation captures the patient’s underlying prognosis, independent of the treatment they received due to selection bias. This is an underexplored area for multimodal patient status and multi-treatment scenarios. After training, the model is expected to be unbiased with respect to treatments and all treatments can be traversed to select the one with the best ITE.

2.2 Proposed model architecture

The overall architecture is shown in Figure 2. It has three main components: a multimodal feature encoder with Bi-AdaIN conditioning, a disentanglement module (either adversarial or MI-based) and a survival prediction head. The gradient from the $L_{a d v}$ or $L_{M I}$ loss is used to update the main encoder, while the survival prediction loss trains both the encoder and the prediction head.

Figure 2.

Two architecture diagrams show a Swin Transformer and a 3 D R e s N e t 1 8 pipeline integrating image input, tabular input, treatment input, adaptive normalisation, and loss paths.

View large Download slide

The image contains two panels labelled a and b that present neural network architectures. Panel A shows a Swin Transformer pipeline. Image input x i m g of size 256 times 256 times 128 passes through patch partition, linear embedding, and four stages with patch merging and Swin transformer blocks. Each stage connects to an A module using adaptive instance normalisation written as A d a I N. Tabular input x t a b with dimensions 128, 64, and 16 feeds into l t a b and connects to all A modules. After stage 4, features of size 8 times 8 times 512 form z, followed by global average pooling and fully connected layers of 512 and 64, producing output y. Treatment input t with dimension 16 feeds into l t and A modules. Loss terms L a d v or L M I and L s u v connect to z and y. Panel B shows a 3 D R e s N e t 1 8 architecture. Image input x i m g passes through a 7 times 7 times 7 convolution with max pooling, then four layers of basic blocks with strides 1 or 2. Feature dimensions progress to 8 times 8 times 512. A d a I N modules connect each layer to tabular and treatment inputs. Global average pooling, fully connected layers, z, y, and the same loss connections are shown.

The proposed deep causal learning architecture for ITE inference of HNC. (a) 3D Swin transformer architecture and (b) 3D ResNet architecture. The main branch (top) encodes patient status (image and tabular data) into a treatment-invariant representation $z$ for survival prediction. The adversarial discriminator or MI estimator regularizes the encoder by minimizing the MI $I (z; t)$ ⁠, forcing the representation $z$ to be independent of the treatment $t$

2.2.1 Multimodal feature encoder with treatment conditioning.

Patient status $x$ consists of 3D CT image volumes, which are preprocessed to $x_{i m g} \in R^{256 \times 256 \times 128}$ with 128 slices to cover the head and neck region. In addition, the tabular clinical vector $x_{tab}$ ⁠. The HNC treatment usually involves radiotherapy (RT) as the standard procedure, with the addition of either chemotherapy (ChemoRT) or an epidermal growth factor receptor inhibitor (EGFRI). Therefore, we have three classes of treatment $t$ (RT, ChemoRT and RT+EGFRI), which are encoded as a three-dimensional one-hot vector. We aim to predict the unbiased treatment effect $y$ (e.g. survival days), which can be censored data representing the time from the CT scan to either death or the end of the study.

Without loss of generality, we adopt the widely used 3D Swin transformer (Liu et al., 2021c) and 3D ResNet (Turnbull, 2022) from the MONAI package (Cardoso et al., 2022) as the backbone image feature extractor, $E_{i m g}$ ⁠. Specifically, we used the transformer blocks in the 3D Swin transformer or the convolutional layers of 3D ResNet18 as $E_{i m g}$ ⁠, which includes four transformer stages or four residual layers (each with two basic ResNet blocks). The extracted patient status feature is $z \in R^{8 \times 8 \times 8 \times 512}$ ⁠. We enforce the independence of $z$ and $t$ with either adversarial training or MI regularization. Then, the patient status feature $z$ is conditioned on the treatment $t$ for the final ITE inference.

Thus, a critical problem is how to seamlessly inject $x_{t a b}$ and $t$ into the feature extraction process before and after disentanglement, respectively. A simple solution is to concatenate the clinical variables $x_{t a b}$ and $t$ into two FC layers and enforce disentanglement between them. We note that the conventional concatenation of a 1D vector and a 4D (3D spatial plus channel) convolutional feature map can be challenging and inefficient. However, combining $x_{t a b}$ only at the end of the network in a single FC layer can hardly allow the model to learn their complementary relationship (Liu et al., 2024) and it is hard to learn their correlation with a limited number of subsequent parameters. In addition, adding more FC layers to sequentially incorporate $x_{t a b}$ ⁠, disentanglement and $t$ can introduce many more parameters and be less efficient for learning their correlations.

Therefore, we resort to adaptive instance normalization (AdaIN) layers and propose a Bi-AdaIN to inject the relevant factors into corresponding layers. Following previous successful conditional modeling works (Karras et al., 2019; Liu et al., 2020), we adopt AdaIN (Huang and Belongie, 2017) to inject $x_{t a b}$ in each transformer or convolutional block, while conditioning on $t$ in each FC layer.

Specifically, we process $x_{t a b}$ or $t$ with a three-layer multi-layer perceptron (MLP) to produce a 16-dim vector $l_{tab} \in R^{16}$ or $l_{t} \in R^{16}$ in an intermediate latent space following (Liu et al., 2024). The learned affine transformations then specialize $l_{tab}$ or $t$ to subject-wise scale and bias scalars $ta b_{i}^{'} = (t a b_{i}^{s}, t a b_{i}^{b})$ and treatment-wise scale and bias scalars $t_{i}^{'} = (t_{i}^{s}, t_{i}^{b})$ in the i-th network layer to control the AdaIN operators. Note that the dimensionality of $t a b_{i}^{s}$ ⁠, $t a b_{i}^{b}$ ⁠, $t_{i}^{s}$ or $t_{i}^{b}$ must match the number of feature maps in that layer. Specifically, we have the following AdaIN operation in each layer:

AdaIN (x_{i}, ta b^{'}) = t a b_{i}^{s} \frac{x_{i} - μ (x_{i})}{σ (x_{i})} + t a b_{i}^{b},

(1)

AdaIN (x_{i}, t') = t_{i}^{s} \frac{x_{i} - μ (x_{i})}{σ (x_{i})} + t_{i}^{b},

(2)

where $x_{i}$ is the extracted feature map after the i-th layer. As a result, $x_{i}$ is individually normalized and then scaled and biased using the corresponding scalar components from $ta b_{i}^{'} = (t a b_{i}^{s}, t a b_{i}^{b})$ and $t_{i}^{'} = (t_{i}^{s}, t_{i}^{b})$ ⁠. By doing so, a stronger inductive bias of $t$ is injected into the DL model.

The model f is detailed in Figure 2. Specifically, we used four AdaIN layers for $x_{t a b}$ in both the 3D Swin Transformer and 3D ResNet18 models. In addition, the treatment $t$ is induced in two FC layers.

2.2.2 Disentanglement via adversarial training or mutual information regularization.

A core principle for mitigating treatment selection bias and covariate shift in ITE is to enforce the independence of the representation $f_{θ} (x)$ from the treatment $t$ (Hassanpour and Greiner, 2019; Johansson et al., 2016; Shalit et al., 2017). To enforce the independence of the representation $z = f_{ϕ} (x)$ and the treatment $t$ ⁠, we explore two methods.

Adversarial training. This approach introduces a discriminator network $D_{a d v}$ that attempts to predict the treatment $t$ from the representation $z$ ⁠. The feature encoder $f_{ϕ}$ is concurrently trained to produce representations that “fool” this discriminator, making them indistinguishable with respect to treatment. This is formulated as a min-max game. The discriminator is trained to minimize the treatment prediction loss (e.g. cross-entropy), while the encoder is trained to maximize it:

min_{f_{ϕ}} max_{D_{a d v}} L_{a d v} (D_{a d v} (f_{ϕ} (x)), t)

(3)

In practice, this is often implemented using a gradient reversal layer (GRL) (Ganin et al., 2016). The GRL passes the input through unchanged during the forward pass but multiplies the gradient by a negative constant $(- λ)$ during the backward pass, allowing the entire system to be trained with a single optimization step.

Mutual information regularization. While conceptually simple, adversarial training can be unstable and less effective for multiple treatments (Liu et al., 2021a, 2021b; Roy and Boddeti, 2019). To mitigate treatment selection bias, we regularize the encoder $f_{ϕ}$ by explicitly minimizing the MI between the representation and the treatment variable, $I (z; t)$ ⁠. The MI is defined as:

I (z; t) = D_{K L} (P_{zt} | | P_{z} \otimes P_{t})

(4)

where $P_{zt}$ is the joint probability distribution and $P_{z} \otimes P_{t}$ is the product of the marginals. Minimizing MI forces the joint distribution to approximate the product of marginals, rendering $z$ and $t$ statistically independent.

Calculating MI directly for high-dimensional continuous variables is intractable. Therefore, we use the mutual information neural estimator (MINE) (Belghazi et al., 2018), which provides a scalable and unbiased estimate of the MI by maximizing a lower bound based on the Donsker–Varadhan representation of the KL-divergence. A separate network, the MI estimator $T_{ψ} (z, t)$ ⁠, is trained to provide this lower bound:

\hat{I} (z; t) = sup_{ψ \in Ψ} (E_{P_{zt}} [T_{ψ}] - log (E_{P_{z} \otimes P_{t}} [e^{T_{ψ}}])) .

(5)

The critic $T_{ψ}$ is trained to output higher values for samples from the joint distribution (paired $(z, t)$ samples) and lower values for samples from the product of marginals (shuffled, unpaired $(z, t')$ samples). The expectations are approximated using Monte Carlo sampling on mini-batches:

L_{M I} = \frac{1}{N} \sum_{i = 1}^{N} T_{ψ} (z_{i}, t_{i}) - log (\frac{1}{N} \sum_{i = 1}^{N} e^{T_{ψ} (z_{i}, t_{i}^{'})}) .

(6)

This estimated MI, $L_{M I}$ ⁠, serves as a regularization loss for the main encoder $f_{ϕ}$ ⁠.

2.2.3 Survival prediction for censored data.

The final representation $z$ is fed into a prediction head $h_{θ}$ to estimate a risk score. Since our outcome data is right-censored, a standard regression loss like mean squared error (MSE) is inappropriate. Instead, we use a survival loss that can handle censored observations. We use the negative log partial likelihood loss from the Cox proportional hazards model (Cox, 1972).

Given a predicted risk score ${\hat{y}}_{i} = h_{θ} (z_{i})$ for patient i, the loss is calculated over all patients i for whom the event was observed (⁠ $δ_{i} = 1$ ⁠). The loss encourages the risk score of patient i to be higher than that of any patient j who was still at risk at the time of patient i’s event (i.e. $y_{j} \geq y_{i}$ ⁠):

L_{s u r v i v a l} = - \sum_{i : δ_{i} = 1} ({\hat{y}}_{i} - log \sum_{j : y_{j} \geq y_{i}} e^{{\hat{y}}_{j}})

(7)

This loss effectively ranks patients by their risk of an event without requiring the exact event times for censored individuals.

2.2.4 Overall objective.

The complete model is trained end-to-end by optimizing a composite objective function:

L_{t o t a l} = L_{s u r v i v a l} + λ \cdot L_{d i s e n t a n g l e}

(8)

where $L_{d i s e n t a n g l e}$ is either the adversarial loss $L_{a d v}$ or the MI loss $L_{M I}$ and $λ$ is a hyperparameter balancing prediction and disentanglement.

3. Experiments and results

3.1 Data preprocessing and augmentation

We used the publicly available RADCURE data set (Welch et al., 2024), an extensive collection of data from 3,346 HNC patients. This data set provides pretreatment computed tomography (CT) volumes, a comprehensive set of clinical variables and long-term survival outcomes. The patient cohort has a median age of 63 years, consists of 80% males and has a median follow-up time of 5 years.

Our study focuses on comparing three primary treatment regimens. The patient cohort is divided as follows: 1,413 patients received RT with concurrent chemotherapy (ChemoRT), 72 patients received RT with an epidermal growth factor receptor inhibitor (RT+EGFRI) and 1,861 patients received RT alone. For our model, the treatment variable $t$ is encoded as a three-dimensional one-hot vector.

A standardized preprocessing pipeline was implemented for all CT volumes using the MONAI framework (Cardoso et al., 2022). All CT volumes were resampled to an isotropic voxel spacing of $1.0 \times 1.0 \times 1.0$ mm³ using third-order spline interpolation. Voxel intensities were normalized by clipping values to a window of [−1000, 1000] hounsfield units (HU) to focus on relevant soft tissue and bone structures. Subsequently, we applied Z-score normalization by subtracting the mean and dividing by the standard deviation of the non-zero voxels within the patient’s body mask. A fixed-size volume of $256 \times 256 \times 128$ voxels was cropped from the center of each normalized volume to serve as the input for our DL models.

$x_{tab}$ includes age, sex, eastern cooperative oncology group performance status (ECOG PS), smoking pack-years, smoking status, disease site, tumor subsite, tumor size category (T), nodal involvement (N), metastasis status (M), clinical stage, pathological type and human papillomavirus (HPV) status. For continuous or ordinal variables, scalar encoding is used. For nominal categorical variables, one-hot encoding is applied. The detailed processing is given in Table 1.

Table 1.

Analysis of encoding and dimensionality for clinical variables

Clinical variable	Encoding	Dimension	Unique categories / note
Age	Scalar	1	N/A (continuous value)
Sex	One-hot	2	[‘Male’, ‘Female’]
ECOG PS	Scalar	1	N/A (ordinal scale 0–4)
Smoking pack-years	Scalar	1	N/A (continuous value)
Smoking status	One-hot	3	[‘Current’, ‘Former’, ‘Never’]
Disease site	One-hot	19	[‘Oropharynx’, ‘Larynx’, ‘Hypopharynx’,…]
Tumor subsite	One-hot	63	[‘Tonsil’, ‘Base of tongue’, ‘Glottis’,…]
Tumor size category (T)	One-hot	17	[‘T1’, ‘T2’, ‘T3’, ‘T4a’, ‘T4b’,…]
Nodal involvement (N)	One-hot	10	[‘N0’, ‘N1’, ‘N2a’, ‘N2b’, ‘N2c’,…]
Metastasis status (M)	One-hot	2	[‘M0’, ‘M1’]
Clinical stage	One-hot	14	[‘I’, ‘II’, ‘III’, ‘IVA’, ‘IVB’, ‘IVC’]
Pathological type	One-hot	41	[‘SCC’, ‘Adenocarcinoma’,…]
HPV status	One-hot	2	[‘Positive’, ‘Negative’]

Clinical variable	Encoding	Dimension	Unique categories / note
Age	Scalar	1	N/A (continuous value)
Sex	One-hot	2	[‘Male’, ‘Female’]
ECOG PS	Scalar	1	N/A (ordinal scale 0–4)
Smoking pack-years	Scalar	1	N/A (continuous value)
Smoking status	One-hot	3	[‘Current’, ‘Former’, ‘Never’]
Disease site	One-hot	19	[‘Oropharynx’, ‘Larynx’, ‘Hypopharynx’,…]
Tumor subsite	One-hot	63	[‘Tonsil’, ‘Base of tongue’, ‘Glottis’,…]
Tumor size category (T)	One-hot	17	[‘T1’, ‘T2’, ‘T3’, ‘T4a’, ‘T4b’,…]
Nodal involvement (N)	One-hot	10	[‘N0’, ‘N1’, ‘N2a’, ‘N2b’, ‘N2c’,…]
Metastasis status (M)	One-hot	2	[‘M0’, ‘M1’]
Clinical stage	One-hot	14	[‘I’, ‘II’, ‘III’, ‘IVA’, ‘IVB’, ‘IVC’]
Pathological type	One-hot	41	[‘SCC’, ‘Adenocarcinoma’,…]
HPV status	One-hot	2	[‘Positive’, ‘Negative’]

To improve model generalization, we applied on-the-fly data augmentation to the training set. The augmentation techniques, also implemented with MONAI (Cardoso et al., 2022), included geometric transformations such as random rotations within a range of [−15, 15] degrees, random scaling (zooming) with a factor between [0.9, 1.1] and random flipping along the sagittal plane. We also applied intensity transformations with random adjustments to brightness and contrast and used noise injection with the addition of random Gaussian noise.

3.2 Experimental setup

The data set was split into training (70%), validation (10%) and testing (20%) sets. Notably, we select a testing set with all non-censored data for more reliable evaluation. We also keep the distribution of treatment types consistent across all three sets. All models were implemented in PyTorch. We used the AdamW optimizer with an initial learning rate of $1 \times 10^{- 4}$ and a weight decay of $1 \times 10^{- 5}$ ⁠. A cosine annealing learning rate scheduler was used. Due to GPU memory constraints, a batch size of 4 was used. The models were trained for up to 200 epochs on NVIDIA A100 GPUs. The hyperparameter $λ$ was determined through a grid search.

3.3 Evaluation metrics

Prediction accuracy: We measured the accuracy of factual outcome prediction using MSE and mean absolute error (MAE).

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}, MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} | .

(9)

We use years as the unit of our survival time. Notably, these two metrics can only be applied to observed (factual) samples.

Causal effect estimation: Since ground-truth counterfactuals are unobservable, we evaluated the model’s ability to estimate causal effects using the bias-adjusted treatment effect (BATE) (Shalit et al., 2017). For any two treatments A and B, BATE measures the difference between the average treatment effect estimated by the model on the entire population and the true average effect observed in the factual data:

BATE = | E_{x} [{\hat{y}}_{A} - {\hat{y}}_{B}] - (E_{x | t = A} [y] - E_{x | t = B} [y]) |,

(10)

where $\hat{y}$ are the model predictions. A lower BATE indicates that the model is less biased by the treatment assignment mechanism.

For example, consider comparing ChemoRT (A) and RT alone (B). The second term, $(E_{x | t = A} [y] - E_{x | t = B} [y])$ ⁠, is the observed difference in average survival between the 1,413 patients who actually received ChemoRT and the 1,861 who received RT alone. This value is biased because these two groups of patients may have been different from the start. The first term, $E_{x} [{\hat{y}}_{A} - {\hat{y}}_{B}]$ ⁠, is the model’s estimated treatment effect, averaged over the entire population. To calculate this, the model takes every single patient and predicts their outcome for both treatments. BATE measures the gap between the model’s unbiased population estimate and the real-world biased observation. A small BATE means the model has successfully adjusted for the initial differences between the treatment groups.

3.4 Results

We present a comprehensive evaluation of our proposed methods, assessing the impact of different architectural choices on both predictive accuracy and causal inference robustness. We compare our final proposed model (Bi-AdaIN + MI) against several ablative and baseline configurations using two different network backbones: 3D Swin transformer and 3D ResNet18.

3.4.1 Quantitative comparison of predictive and causal performance.

The primary results are summarized in Table 2 for the 3D Swin transformer backbone and Table 3 for the 3D ResNet18 backbone. As shown, our proposed method (“Bi-AdaIN + MI”) achieves the best causal inference performance (lowest BATE) while maintaining strong predictive accuracy. Comparing the “Baseline (Concatenation)” with “Baseline + Bi-AdaIN,” we observe that incorporating our Bi-AdaIN fusion mechanism consistently improves predictive accuracy (lower MAE/MSE). This suggests that Bi-AdaIN provides a more effective way to integrate tabular clinical data. When adding disentanglement (“+ Adversarial” or “+ MI”), we observe the expected tradeoff: a slight degradation in MAE/MSE for a significant improvement in BATE. Notably, the MI-based approach demonstrates a more significant reduction in BATE than the adversarial method and the 3D Swin transformer backbone consistently yields better results than the 3D ResNet18 backbone.

Table 2.

Performance comparison of different model configurations using the 3D Swin transformer backbone. Best BATE values are highlighted in bold

Model	Predictive accuracy		Causal effect bias (BATE) (⁠ $↓$ ⁠)
Model	MAE	MSE	ChemoRT versus RT	ChemoRT versus RT+EGFRI	RT versus RT+EGFRI
Baseline (concatenation)	1.75 $\pm$ 0.08	4.21 $\pm$ 0.25	1.82 $\pm$ 0.15	0.41 $\pm$ 0.09	1.63 $\pm$ 0.18
Baseline + Bi-AdaIN	1.61 $\pm$ 0.07	4.09 $\pm$ 0.23	1.79 $\pm$ 0.14	0.40 $\pm$ 0.08	1.60 $\pm$ 0.17
Baseline + adversarial	1.79 $\pm$ 0.09	5.06 $\pm$ 0.28	1.48 $\pm$ 0.18	0.15 $\pm$ 0.11	1.31 $\pm$ 0.20
Bi-AdaIN + adversarial	1.64 $\pm$ 0.08	4.26 $\pm$ 0.26	1.41 $\pm$ 0.15	0.15 $\pm$ 0.09	1.24 $\pm$ 0.16
Baseline + MI	1.80 $\pm$ 0.09	5.19 $\pm$ 0.29	0.35 $\pm$ 0.12	0.09 $\pm$ 0.07	0.35 $\pm$ 0.14
Bi-AdaIN + MI (proposed)	1.62 $\pm$ 0.07	4.13 $\pm$ 0.24	0.14 $\pm$ 0.05	0.06 $\pm$ 0.04	0.11 $\pm$ 0.06

Model	Predictive accuracy		Causal effect bias (BATE) ( $↓$ )
Model	MAE	MSE	ChemoRT versus RT	ChemoRT versus RT+EGFRI	RT versus RT+EGFRI
Baseline (concatenation)	1.75 $\pm$ 0.08	4.21 $\pm$ 0.25	1.82 $\pm$ 0.15	0.41 $\pm$ 0.09	1.63 $\pm$ 0.18
Baseline + Bi-AdaIN	1.61 $\pm$ 0.07	4.09 $\pm$ 0.23	1.79 $\pm$ 0.14	0.40 $\pm$ 0.08	1.60 $\pm$ 0.17
Baseline + adversarial	1.79 $\pm$ 0.09	5.06 $\pm$ 0.28	1.48 $\pm$ 0.18	0.15 $\pm$ 0.11	1.31 $\pm$ 0.20
Bi-AdaIN + adversarial	1.64 $\pm$ 0.08	4.26 $\pm$ 0.26	1.41 $\pm$ 0.15	0.15 $\pm$ 0.09	1.24 $\pm$ 0.16
Baseline + MI	1.80 $\pm$ 0.09	5.19 $\pm$ 0.29	0.35 $\pm$ 0.12	0.09 $\pm$ 0.07	0.35 $\pm$ 0.14
Bi-AdaIN + MI (proposed)	1.62 $\pm$ 0.07	4.13 $\pm$ 0.24	0.14 $\pm$ 0.05	0.06 $\pm$ 0.04	0.11 $\pm$ 0.06

Note(s):

The BATE values involving the RT+EGFRI treatment group should be interpreted with caution due to the smaller sample size (n = 72) in this cohort compared to the ChemoRT (n = 1,413) and RT (n = 1,861) groups

Table 3.

Performance comparison of different model configurations using the 3D ResNet18 backbone. Best BATE values are highlighted in bold

Model	Predictive accuracy		Causal effect bias (BATE) (⁠ $↓$ ⁠)
Model	MAE	MSE	ChemoRT versus RT	ChemoRT versus RT+EGFRI	RT versus RT+EGFRI
Baseline (concatenation)	1.81 $\pm$ 0.09	5.35 $\pm$ 0.28	1.89 $\pm$ 0.17	0.65 $\pm$ 0.10	1.70 $\pm$ 0.20
Baseline + Bi-AdaIN	1.76 $\pm$ 0.08	4.22 $\pm$ 0.26	1.85 $\pm$ 0.16	0.53 $\pm$ 0.09	1.66 $\pm$ 0.19
Baseline + adversarial	1.84 $\pm$ 0.10	5.41 $\pm$ 0.31	1.59 $\pm$ 0.20	0.38 $\pm$ 0.12	1.42 $\pm$ 0.22
Bi-AdaIN + adversarial	1.69 $\pm$ 0.09	4.30 $\pm$ 0.29	1.15 $\pm$ 0.17	0.28 $\pm$ 0.10	0.95 $\pm$ 0.18
Baseline + MI	1.86 $\pm$ 0.10	5.44 $\pm$ 0.32	0.48 $\pm$ 0.15	0.23 $\pm$ 0.09	0.46 $\pm$ 0.16
Bi-AdaIN + MI (proposed)	1.68 $\pm$ 0.08	4.25 $\pm$ 0.27	0.21 $\pm$ 0.07	0.14 $\pm$ 0.06	0.18 $\pm$ 0.08

Model	Predictive accuracy		Causal effect bias (BATE) ( $↓$ )
Model	MAE	MSE	ChemoRT versus RT	ChemoRT versus RT+EGFRI	RT versus RT+EGFRI
Baseline (concatenation)	1.81 $\pm$ 0.09	5.35 $\pm$ 0.28	1.89 $\pm$ 0.17	0.65 $\pm$ 0.10	1.70 $\pm$ 0.20
Baseline + Bi-AdaIN	1.76 $\pm$ 0.08	4.22 $\pm$ 0.26	1.85 $\pm$ 0.16	0.53 $\pm$ 0.09	1.66 $\pm$ 0.19
Baseline + adversarial	1.84 $\pm$ 0.10	5.41 $\pm$ 0.31	1.59 $\pm$ 0.20	0.38 $\pm$ 0.12	1.42 $\pm$ 0.22
Bi-AdaIN + adversarial	1.69 $\pm$ 0.09	4.30 $\pm$ 0.29	1.15 $\pm$ 0.17	0.28 $\pm$ 0.10	0.95 $\pm$ 0.18
Baseline + MI	1.86 $\pm$ 0.10	5.44 $\pm$ 0.32	0.48 $\pm$ 0.15	0.23 $\pm$ 0.09	0.46 $\pm$ 0.16
Bi-AdaIN + MI (proposed)	1.68 $\pm$ 0.08	4.25 $\pm$ 0.27	0.21 $\pm$ 0.07	0.14 $\pm$ 0.06	0.18 $\pm$ 0.08

Note(s):

The BATE values involving the RT+EGFRI treatment group should be interpreted with caution due to the smaller sample size (n = 72) in this cohort compared to the ChemoRT (n = 1,413) and RT (n = 1,861) groups

3.4.2 Feature importance analysis.

To further understand the interplay between modalities and how our model makes decisions, we conducted a permutation importance analysis. This technique measures a feature’s importance by evaluating the drop in model performance when that feature’s values are randomly shuffled. A larger drop in the BATE after shuffling a feature therefore indicates higher feature importance. The results for the tabular-only model and our proposed multimodal model (using the Swin transformer backbone) are presented in Figure 3.

Figure 3.

Two horizontal bar charts compare permutation feature importance for models using tabular data only and tabular plus image features, ranked by mean decrease in C index.

View large Download slide

The image shows two side by side horizontal bar charts titled comparison of feature importance with and without image data. The left chart represents a model with tabular data only. The horizontal axis shows permutation importance measured as mean decrease in C index. Features are ranked from highest to lowest importance as Stage at about 0.36, N at about 0.31, T at about 0.28, H P V at about 0.24, Ds Site at about 0.19, E C O G P S at about 0.17, Age at about 0.11, Path at about 0.09, Smoking P Y at about 0.08, Subsite at about 0.06, Smoking Status at about 0.03, Sex at about 0.02, and M at about 0.01. The right chart represents a model with tabular plus image features. Image Features rank highest at about 0.46, followed by H P V at about 0.21, E C O G P S at about 0.18, Age at about 0.12, Stage at about 0.10, Ds Site at about 0.09, Smoking P Y at about 0.08, N at about 0.06, T at about 0.05, Path at about 0.05, Subsite at about 0.04, Sex at about 0.02, M at about 0.01, and Smoking Status at about 0.01. Error bars indicate variability around each estimate.

Permutation feature importance analysis comparing the model trained on (a) tabular data only with the (b) proposed multimodal (tabular + image) model. The analysis highlights a fundamental shift in feature reliance. In the multimodal model, “image features” become dominant, while the importance of redundant tabular features like “stage”, “T” and “N” decreases. Complementary features such as “HPV status” and “ECOG PS” retain their significance

For the model trained only on tabular data [Figure 3(a)], the feature importance largely aligns with clinical intuition. Clinical staging information (“Stage,” “N”, “T”) and the biological marker “HPV Status” are the most predictive features. We note that Stage, T and N are the cornerstones of the TNM staging system, which is the universal standard for classifying the anatomical extent of cancer and is a primary determinant of prognosis. HPV status is a critical molecular biomarker, as HPV-associated oropharyngeal cancers have a distinct biology and a significantly better prognosis than HPV-negative cancers. ECOG Performance Status is explained as a measure of a patient’s functional well-being and their ability to tolerate aggressive treatments.

However, a dramatic shift occurs in the multimodal model [Figure 3(b)]. The “Image Features,” extracted by the deep network, emerge as the single most important predictor. Consequently, the importance of the tabular features derived from imaging, such as “Stage,” “T,” and “N,” decreases substantially. This indicates that the model learns to extract richer, more nuanced prognostic information directly from the CT scans, making the manually annotated staging information partially redundant. Conversely, complementary, non-visual features like “HPV Status,” “ECOG PS,” and “Age” retain their high importance.

3.4.3 Ablation study on the depth of bi-stage adaptive instance normalization.

We also performed an ablation study on the 3D Swin Transformer backbone to compare three strategies for injecting tabular data via AdaIN:

All stages (proposed): AdaIN layers applied after all four transformer stages.
Late stages only: AdaIN layers applied only after the last two stages (stage 3 and stage 4).
Final stage only: an AdaIN layer applied only after the final stage (stage 4). (Table 4)

Table 4.

Ablation study on the depth of Bi-AdaIN integration for tabular data in the 3D Swin transformer backbone

Model configuration	MAE (⁠ $↓$ ⁠)	MSE (⁠ $↓$ ⁠)	Average BATE (⁠ $↓$ ⁠)
Bi-AdaIN (all stages)	1.62 $\pm$ 0.11	4.13 $\pm$ 0.24	0.10 $\pm$ 0.05
Bi-AdaIN (late stages)	1.65 $\pm$ 0.10	4.19 $\pm$ 0.20	0.19 $\pm$ 0.08
Bi-AdaIN (final stage)	1.69 $\pm$ 0.08	4.28 $\pm$ 0.26	0.31 $\pm$ 0.11

Model configuration	MAE ( $↓$ )	MSE ( $↓$ )	Average BATE ( $↓$ )
Bi-AdaIN (all stages)	1.62 $\pm$ 0.11	4.13 $\pm$ 0.24	0.10 $\pm$ 0.05
Bi-AdaIN (late stages)	1.65 $\pm$ 0.10	4.19 $\pm$ 0.20	0.19 $\pm$ 0.08
Bi-AdaIN (final stage)	1.69 $\pm$ 0.08	4.28 $\pm$ 0.26	0.31 $\pm$ 0.11

The findings confirm that while late-stage fusion is beneficial, a deeper integration of tabular data across all stages provides superior performance. Our rationale is that early fusion allows the network to learn a hierarchical co-adaptation of features. For instance, tabular variables like tumor stage (T) and nodal status (N) have direct visual correlates that are best captured in the early-to-mid layers of the network. By modulating these features from the beginning, Bi-AdaIN facilitates a more profound and interactive fusion, enabling the model to learn richer, more robust representations that are synergistically informed by both modalities throughout the entire feature extraction hierarchy. This ultimately leads to a more significant reduction in the causal effect bias (BATE).

3.4.4 Sensitivity analysis of hyperparameter $λ$ ⁠.

The hyperparameter $λ$ controls the tradeoff between the survival prediction loss and the disentanglement regularization loss. To analyze its impact, we trained our proposed “Bi-AdaIN + MI” model with varying values of $λ$ from 0.01–10.0, as shown in Figure 4. As $λ$ increases, the BATE consistently decreases, indicating more effective bias removal. Conversely, the predictive accuracy tends to degrade. We selected a value of $λ = 1.0$ for the Swin transformer and $λ = 0.8$ for the ResNet18, as these values offered the best balance between low causal bias and high predictive performance on the validation set.

Figure 4.

Two line charts show Swin Transformer and R e s Net 1 8 sensitivity, plotting causal effect bias and predictive error against regularisation parameter lambda on a log scale.

View large Download slide

The image presents two side by side line charts titled Swin Transformer Sensitivity and R e s Net 1 8 Sensitivity. In both charts, the horizontal axis shows the regularisation parameter lambda on a log scale with values from 0.01 to 10. The left vertical axis shows causal effect bias labelled B A T E. The right vertical axis shows predictive error labelled mean absolute error. In the Swin Transformer chart, causal effect bias decreases from about 1.78 at lambda 0.01 to about 0.10 at lambda 10. Predictive error stays near 1.71 at low lambda values, then increases after lambda 1, reaching about 1.86 at lambda 10. A vertical reference line marks lambda 1 as optimal. In the R e s Net 1 8 chart, causal effect bias decreases from about 1.88 at lambda 0.01 to about 0.16 at lambda 10. Predictive error increases steadily from about 1.76 at lambda 0.01 to about 1.95 at lambda 10. A vertical reference line marks lambda 0.8 as optimal.

Sensitivity analysis of the regularization parameter $λ$ on the validation set for the (left) swin transformer model and (right) 3D ResNet18 model. As $λ$ increases, the causal bias (averaged BATE) decreases while the predictive error (MAE) tends to increase, illustrating the fundamental tradeoff

4. Discussion

In this work, we developed and validated a deep causal learning framework to estimate individualized treatment effects for HNC patients using multimodal data. Our primary finding is that the synergistic combination of our proposed Bi-AdaIN fusion mechanism and a MI regularization strategy can significantly mitigate treatment selection bias, leading to more robust ITE predictions compared to conventional DL models and those using adversarial disentanglement.

This synergistic effect is vividly illustrated by our feature importance analysis (Figure 3). The analysis reveals that when the model gains access to the raw image data, the deep-learned “Image Features” become the dominant predictor. Consequently, the importance of tabular features that are themselves derived from imaging, such as “Stage,” “T” and “N,” is substantially diminished. This is not because they are clinically unimportant, but because their predictive information is now more richly and directly captured by the model from the raw pixels. This explicitly demonstrates the model’s ability to handle information redundancy.

Conversely, non-visual, complementary features like “HPV Status” and “ECOG PS” retain their high importance in the multimodal model. This confirms that our framework learns to intelligently weigh and integrate different sources of information, leveraging tabular data for systemic biological and performance status while relying on the imaging data for detailed morphological and anatomical assessment. This moves beyond simple correlations to model the complex, interdependent relationships required for accurate, personalized prognosis.

A key observation from our results is the tradeoff between predictive accuracy on factual data (MAE/MSE) and causal inference robustness (BATE). This outcome, while seemingly counterintuitive, is a strong indicator that our causal framework is operating correctly. It demonstrates that the baseline models were leveraging the spurious correlations inherent in the biased observational data to improve predictions. By forcing the model to learn treatment-invariant representations, we remove its reliance on these confounding factors, which, while slightly hurting its ability to predict a biased reality, vastly improves its capacity to generalize and make accurate counterfactual predictions – the ultimate goal of ITE estimation.

Our technical innovations directly address two fundamental challenges in this domain. First, the Bi-AdaIN module proved superior to simple concatenation for integrating clinical variables with 3D imaging data. By dynamically modulating feature maps at multiple stages of the network, Bi-AdaIN facilitates a more profound and interactive fusion, which translates to improved predictive performance as seen in our ablation studies. Second, our results confirm that MI regularization is a more stable and effective approach for disentanglement than adversarial training in this complex, multi-treatment, multimodal setting.

The clinical implications of this work are significant. A reliable ITE estimation tool could revolutionize treatment planning in oncology. For any given HNC patient, our model could provide clinicians with predicted outcomes for each available treatment option (e.g. “predicted survival with ChemoRT is 3 yrs, versus 2 yrs with RT alone”). This would empower clinicians to move beyond population-level guidelines and make more personalized, data-driven decisions. The superiority of the Swin transformer backbone also suggests that advanced architectures capable of capturing long-range spatial dependencies are crucial for extracting subtle prognostic biomarkers from medical images.

We also articulate that the model’s ability to learn image features as the dominant predictor demonstrates that it successfully extracts nuanced prognostic information (related to tumor morphology, infiltration and volume) directly from the CT scan, thereby capturing the information contained in T, N and stage in a more comprehensive, data-driven manner. We emphasize that the retained importance of complementary, non-visual features like HPV Status (a systemic viral marker) and ECOG PS (a measure of patient fitness) highlights the framework’s ability to intelligently integrate different, synergistic sources of information.

Despite these promising results, we acknowledge several limitations. First, our study is retrospective and relies on the RADCURE data set. While large, this data set lacks granularity on treatment protocols, such as specific radiation dosage or chemotherapy agents. The small sample size of the RT+EGFRI group also limits the statistical power of comparisons involving this treatment. Second, while our causal framework is principled, the feature extraction process within the deep neural network remains a “black box,” which can be a barrier to clinical trust. Finally, the model requires external validation on independent data sets to ensure its generalizability before any clinical application can be considered.

Future work will proceed along several key paths. To address model interpretability, we plan to integrate visualization techniques like SHapley Additive exPlanations (SHAP) score (Lundberg and Lee, 2017) and gradient-weighted class activation mapping (Grad-CAM) (He et al., 2019) to highlight the specific anatomical regions in the CT scans that are most influential in predicting the effects of different treatments. Furthermore, we aim to extend the framework to incorporate uncertainty quantification, providing confidence intervals alongside ITE predictions to better inform clinical decision-making. Finally, the ultimate goal is to validate this framework in prospective studies, assessing its real-world impact on treatment selection and patient outcomes.

5. Conclusion

In this work, we introduced a novel deep causal learning framework for estimating individualized treatment effects in HNC from multimodal observational data. By leveraging a proposed Bi-AdaIN method for effective data fusion and a robust MI regularization strategy for disentanglement, our model successfully overcomes the challenge of treatment selection bias. Our comprehensive experiments demonstrate that this approach yields significantly more accurate and reliable estimates of causal treatment effects than baseline and adversarial methods. This study represents a critical step toward developing trustworthy AI-driven decision support tools for personalized treatment planning in oncology, paving the way for more precise counterfactual reasoning in clinical practice.

Authors’ contribution

Yawen Wei and Zhen Li, both authors, contributed equally.

References

Baid

,

U.

,

Rane

,

S.U.

,

Talbar

,

S.

,

Gupta

,

S.

,

Thakur

,

M.H.

,

Moiyadi

,

A.

and

Mahajan

,

A.

(

2020

), “

Overall survival prediction in glioblastoma with radiomic features using machine learning

”,

Frontiers in Computational Neuroscience

, Vol.

14

, p.

61

.

Google Scholar

Crossref

PubMed

Bakas

,

S.

,

Reyes

,

M.

,

Jakab

,

A.

,

Bauer

,

S.

,

Rempfler

,

M.

,

Crimi

,

A.

,

Shinohara

,

R.T.

,

Berger

,

C.

,

Ha

,

S.M.

,

Rozycki

,

M.

, et al. (

2018

), “

Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment and overall survival prediction in the BRATS challenge

”,

arXiv preprint

arXiv:1811.02629

.

Google Scholar

Belghazi

,

M.I.

,

Baratin

,

A.

,

Rajeswar

,

S.

,

Ozair

,

S.

,

Bengio

,

Y.

,

Courville

,

A.

and

Hjelm

,

R.D.

(

2018

), “

Mine: mutual information neural estimation

”,

arXiv preprint

arXiv:1801.04062

.

Google Scholar

Bica

,

I.

,

Alaa

,

A.M.

,

Jordon

,

J.

and

van der Schaar

,

M.

(

2020

), “

Estimating coun-terfactual treatment outcomes over time through adversarially balanced representations

”,

arXiv:2002.04083

.

Google Scholar

Cardoso

,

M.J.

,

Li

,

W.

,

Brown

,

R.

,

Ma

,

N.

,

Kerfoot

,

E.

,

Wang

,

Y.

,

Murrey

,

B.

,

Myronenko

,

A.

,

Zhao

,

C.

,

Yang

,

D.

, et al. (

2022

), “

Monai: an open-source frame-work for deep learning in healthcare

”,

arXiv preprint

arXiv:2211.02701

.

Google Scholar

Chaddad

,

A.

,

Desrosiers

,

C.

,

Hassan

,

L.

and

Tanougast

,

C.

(

2016

), “

A quantitative study of shape descriptors from glioblastoma multiforme phenotypes for predicting survival outcome

”,

The British Journal of Radiology

, Vol.

89

No.

1068

, p.

20160575

.

Google Scholar

Crossref

PubMed

Chang

,

K.

,

Zhang

,

B.

,

Guo

,

X.

,

Zong

,

M.

,

Rahman

,

R.

,

Sanchez

,

D.

,

Winder

,

N.

,

Reardon

,

D.A.

,

Zhao

,

B.

,

Wen

,

P.Y.

and

Huang

,

R.Y.

(

2016

), “

Multimodal imaging patterns predict survival in recurrent glioblastoma patients treated with bevacizumab

”,

Neuro-oncology

, Vol.

18

No.

12

, pp.

1680

-

1687

.

Google Scholar

Crossref

PubMed

Cox

,

D.R.

(

1972

), “

Regression models and life-tables

”,

Journal of the Royal Statistical Society: Series B (Methodological)

, Vol.

34

No.

2

, pp.

187

-

202

.

Google Scholar

Crossref

Curth

,

A.

and

van der Schaar

,

M.

(

2021

), “

On inductive biases for heterogeneous treatment eﬀect estimation

”,

Advances in Neural Information Processing Systems

, Vol.

34

, pp.

15883

-

15894

.

Google Scholar

Ganin

,

Y.

,

Ustinova

,

E.

,

Ajakan

,

H.

,

Germain

,

P.

,

Larochelle

,

H.

,

Lavi-Olette

,

F.

,

March

,

M.

and

Lempitsky

,

V.

(

2016

), “

Domain-adversarial training of neural networks

”,

Journal of Machine Learning Research

, Vol.

17

No.

59

, pp.

1

-

35

.

Google Scholar

Hassanpour

,

N.

and

Greiner

,

R.

(

2019

), “

Learning disentangled representations for counterfactual regression

”, in

International Conference on Learning Representations.

Google Scholar

He

,

T.

,

Guo

,

J.

,

Chen

,

N.

,

Xu

,

X.

,

Wang

,

Z.

,

Fu

,

K.

,

Liu

,

L.

and

Yi

,

Z.

(

2019

), “

MediMLP: using grad-cam to extract crucial variables for lung cancer postoperative complication prediction

”,

IEEE Journal of Biomedical and Health Informatics

, Vol.

24

No.

6

, pp.

1762

-

1771

.

Google Scholar

Crossref

PubMed

Huang

,

H.

,

Zhang

,

W.

,

Fang

,

Y.

,

Hong

,

J.

,

Su

,

S.

and

Lai

,

X.

(

2021

), “

Overall survival prediction for gliomas using a novel compound approach

”,

Frontiers in Oncology

, Vol.

11

, p.

724191

.

Google Scholar

Crossref

PubMed

Huang

,

X.

and

Belongie

,

S.

(

2017

), “

Arbitrary style transfer in real-time with adaptive instance normalization

”, in

Proceedings of the IEEE International Conference on Computer Vision

, pp.

1501

-

1510

.

Google Scholar

Crossref

Johansson

,

F.

,

Shalit

,

U.

and

Sontag

,

D.

(

2016

), “

Learning representations for counterfactual inference

”, in

International Conference on Machine Learning

,

PMLR

, pp.

3020

-

3029

.

Google Scholar

Karras

,

T.

,

Laine

,

S.

and

Aila

,

T.

(

2019

), “

A style-based generator architecture for generative adversarial networks

”, in

IEEE/CVF Computer Vision and Pattern Recognition Conference

, pp.

4401

-

4410

.

Google Scholar

Crossref

Kaur

,

G.

,

Rana

,

P.S.

and

Arora

,

V.

(

2022

), “

State-of-the-art techniques using pre-operative brain MRI scans for survival prediction of glioblastoma multiforme patients and future research directions

”,

Clinical and Translational Imaging

, Vol.

10

No.

4

, pp.

355

-

389

.

Google Scholar

Crossref

PubMed

Lacroix

,

M.

,

Abi-Said

,

D.

,

Fourney

,

D.R.

,

Gokaslan

,

Z.L.

,

Shi

,

W.

,

De-Monte

,

F.

,

Lang

,

F.F.

,

McCutcheon

,

I.E.

,

Hassenbusch

,

S.J.

,

Holland

,

E.

,

Hess

,

K.

,

Michael

,

C.

,

Miller

,

D.

and

Sawaya

,

R.

(

2001

), “

A multivariate analysis of 416 patients with glioblastoma multiforme: prognosis, extent of resection and survival

”,

Journal of Neurosurgery

, Vol.

95

No.

2

, pp.

190

-

198

.

Google Scholar

Crossref

PubMed

Lei

,

L.

and

Candès

,

E.J.

(

2021

), “

Conformal inference of counterfactuals and individual treatment eﬀects

”,

Journal of the Royal Statistical Society: Series B (Statistical Methodology)

, Vol.

83

No.

5

, pp.

911

-

938

.

Google Scholar

Crossref

Liu

,

X.

,

Yoo

,

C.

,

Xing

,

F.

,

Oh

,

H.

,

El Fakhri

,

G.

,

Kang

,

J.

and

Woo

,

J.

(

2022

), “

Deep unsupervised domain adaptation: a review of recent advances and perspectives

”,

APSIPA Transactions on Signal and Information Processing

, Vol.

11

No.

1

.

Google Scholar

Liu

,

X.

,

Jin

,

L.

,

Han

,

X.

and

You

,

J.

(

2021b

), “

Mutual information regularized identity-aware facial expression recognition in compressed video

”,

Pattern Recognition

, Vol.

119

, p.

108105

.

Google Scholar

Crossref

Liu

,

X.

,

Shusharina

,

N.

,

Shih

,

H.A.

,

Kuo

,

C.-C.J.

,

El Fakhri

,

G.

and

Woo

,

J.

(

2024

), “

Treatment-wise glioblastoma survival inference with multi-parametric preoperative MRI

”, in Medical Imaging 2024: Computer-Aided Diagnosis, Vol.

12927

,

SPIE

, pp.

584

-

588

.

Google Scholar

Crossref

Liu

,

X.

,

Che

,

T.

,

Lu

,

Y.

,

Yang

,

C.

,

Li

,

S.

and

You

,

J.

(

2020

), “

Auto3D: novel view synthesis through unsupervisely learned variational viewpoint and global 3D representation

”, in

European Conference on Computer Vision

,

Springer

, pp.

52

-

71

.

Google Scholar

Crossref

Liu

,

X.

,

Chao

,

Y.

,

You

,

J.J.

,

Kuo

,

C.-C.J.

and

Vijayakumar

,

B.

(

2021a

), “

Mutual information regularized feature-level Frankenstein for discriminative recognition

”,

IEEE Transactions on Pattern Analysis and Machine Intelligence

, Vol.

44

No.

9

, pp.

5243

-

5260

.

Google Scholar

Liu

,

Z.

,

Lin

,

Y.

,

Cao

,

Y.

,

Hu

,

H.

,

Wei

,

Y.

,

Zhang

,

Z.

,

Lin

,

S.

and

Guo

,

B.

(

2021c

), “

Swin transformer: hierarchical vision transformer using shifted windows

”, in

Proceedings of the IEEE/CVF International Conference on Computer Vision

, pp.

10012

-

10022

.

Google Scholar

Crossref

Lundberg

,

S.M.

and

Lee

,

S.-I.

(

2017

), “

A unified approach to interpreting model predictions

”,

Advances in Neural Information Processing Systems

, p.

30

.

Google Scholar

Nie

,

D.

,

Lu

,

J.

,

Zhang

,

H.

,

Adeli

,

E.

,

Wang

,

J.

,

Yu

,

Z.

,

Liu

,

L.

,

Wang

,

Q.

,

Wu

,

J.

and

Shen

,

D.

(

2019

), “

Multi-channel 3D deep feature learning for survival time prediction of brain tumor patients using multi-modal neuroimages

”,

Scientific Reports

, Vol.

9

No.

1

, pp.

1

-

14

.

Google Scholar

PubMed

Pearl

,

J.

(

2001

), “

Causal and counterfactual inference in the health sciences: a conceptual introduction

”,

Health Services and Outcomes Research Methodology

, Vol.

2

Nos

3-4

, pp.

189

-

220

.

Google Scholar

Crossref

Prosperi

,

M.

,

Guo

,

Y.

,

Sperrin

,

M.

,

Koopman

,

J.S.

,

Min

,

J.S.

,

He

,

X.

,

Rich

,

S.

,

Wang

,

M.

,

Buchan

,

I.E.

and

Bian

,

J.

(

2020

), “

Causal inference and counterfactual prediction in machine learning for actionable healthcare

”,

Nature Machine Intelligence

, Vol.

2

No.

7

, pp.

369

-

375

.

Google Scholar

Crossref

Roy

,

P.C.

and

Boddeti

,

V.N.

(

2019

), “

Mitigating information leakage in image representations: a maximum entropy approach

”, in

IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

2586

-

2594

.

Google Scholar

Crossref

Shalit

,

U.

,

Johansson

,

F.D.

and

Sontag

,

D.

(

2017

), “

Estimating individual treatment eﬀect: generalization bounds and algorithms

”, in

International Conference on Machine Learning

,

PMLR

, pp.

3076

-

3085

.

Google Scholar

Shi

,

C.

,

Blei

,

D.

and

Veitch

,

V.

(

2019

), “

Adapting neural networks for the estimation of treatment eﬀects

”,

Advances in Neural Information Processing Systems

, p.

32

.

Google Scholar

Turnbull

,

R.

(

2022

), “

Using a 3D ResNet for detecting the presence and severity of COVID-19 from CT scans

”, in

European Conference on Computer Vision

,

Springer

, pp.

663

-

676

.

Google Scholar

Crossref

Welch

,

M.L.

,

Kim

,

S.

,

Hope

,

A.J.

,

Huang

,

S.H.

,

Lu

,

Z.

,

Marsilla

,

J.

,

Kazmierski

,

M.

,

Rey-McIntyre

,

K.

,

Patel

,

T.

,

O’Sullivan

,

B.

,

Waldron

,

J.

,

Bratman

,

S.

,

Haibe-Kains

,

B.

and

Tadic

,

T.

(

2024

), “

RAD-CURE: an open-source head and neck cancer CT dataset for clinical radiation therapy insights

”,

Medical Physics

, Vol.

51

No.

4

, pp.

3101

-

3109

.

Google Scholar

Crossref

PubMed

Xu

,

Y.

and

Yadlowsky

,

S.

(

2022

), “

Calibration error for heterogeneous treatment effects

”, in

International Conference on Artificial Intelligence and Statistics

,

PMLR

, pp.

9280

-

9303

.

Google Scholar

Yoon

,

J.

,

Jordon

,

J.

and

Van Der Schaar

,

M.

(

2018

), “

GANITE: estimation of individualized treatment effects using generative adversarial nets

”, in

International Conference on Learning Representations.

Google Scholar

2026

Yawen Wei, Zhen Li, Jonghye Woo, Jinsong Ouyang, Georges El Fakhri and Xiaofeng Liu

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licenceLink to the terms of the CC BY 4.0 license.

Individualized treatment effect inference of head and neck cancer with multimodal data

1. Introduction

2. Methods

2.1 Causal framework for individualized treatment effect estimation

2.2 Proposed model architecture

2.2.1 Multimodal feature encoder with treatment conditioning.

2.2.2 Disentanglement via adversarial training or mutual information regularization.

2.2.3 Survival prediction for censored data.

2.2.4 Overall objective.

3. Experiments and results

3.1 Data preprocessing and augmentation

3.2 Experimental setup

3.3 Evaluation metrics

3.4 Results

3.4.1 Quantitative comparison of predictive and causal performance.

3.4.2 Feature importance analysis.

3.4.3 Ablation study on the depth of bi-stage adaptive instance normalization.

3.4.4 Sensitivity analysis of hyperparameter $λ$ ⁠.

4. Discussion

5. Conclusion

Authors’ contribution

References

Email Alerts

Cited By

Individualized treatment effect inference of head and neck cancer with multimodal data Open Access

1. Introduction

2. Methods

2.1 Causal framework for individualized treatment effect estimation

2.2 Proposed model architecture

2.2.1 Multimodal feature encoder with treatment conditioning.

2.2.2 Disentanglement via adversarial training or mutual information regularization.

2.2.3 Survival prediction for censored data.

2.2.4 Overall objective.

3. Experiments and results

3.1 Data preprocessing and augmentation

3.2 Experimental setup

3.3 Evaluation metrics

3.4 Results

3.4.1 Quantitative comparison of predictive and causal performance.

3.4.2 Feature importance analysis.

3.4.3 Ablation study on the depth of bi-stage adaptive instance normalization.

3.4.4 Sensitivity analysis of hyperparameter λ⁠.

4. Discussion

5. Conclusion

Authors’ contribution

References

Email Alerts

Suggested Reading

Related Chapters

Recommended for you

Cited By

Individualized treatment effect inference of head and neck cancer with multimodal data

3.4.4 Sensitivity analysis of hyperparameter $λ$ ⁠.