Image-text retrieval via green explainable multi-modal alignment (GEMMA)

Yang, Tsung-Shan; Wang, Yun-Cheng; Wei, Chengwei; You, Suya; Kuo, C.-C. Jay

doi:10.1561/116.20250009

Image-text retrieval is a fundamental task in image understanding. The algorithm fetches the most relevant counterpart in the other modality by giving the image or text. Large visual-language models are trained by paired image and text data to extract the joint representations. However, they are computationally expensive and not explainable regarding how the data from different modalities are aligned. To this end, we propose an efficient and stage-wise alignment for image and text representations, called the Green Explainable Multi-Modal Alignment (GEMMA). GEMMA is computationally efficient by reducing trainable parameters to 3% compared to fine-tuning all image and text encoders. The intermediate clustering results demonstrate the explainability of the alignment mechanism in our model. Experiments show that GEMMA outperforms state-of-the-art retrieval models in text-to-image and image-to-text retrieval tasks on the Flick30k and MS-COCO datasets. GEMMA can also be generalized to unseen image-text pairs from pre-trained visual and text encoders separately.

1. Introduction

Image-text retrieval links textual and visual information and is a foundational image understanding application in computer vision. The goal of the task is to link textual descriptions and pixels in image arrays that represent similar concepts or semantics. The image-text retrieval task aims to find the most relevant information from the candidate sets in the counterpart modality. That is, when an image is given, the model needs to extract related captions by ranking them with higher scores and vice versa. Figure 1 shows an example of an image and its paired textual descriptions.

Figure 1.

A construction scene shows a worker raised on a blue lifting platform beside a tall, worn building, with demolition or repair activity underway.

View large Download slide

The image shows the side of a tall, damaged building with peeling exterior surfaces and exposed concrete. In the foreground, a blue industrial lifting platform extends upward from the ground to the building wall. A single worker stands on the raised platform close to the structure, appearing to carry out construction or maintenance work. The platform arm reaches diagonally upward, positioning the worker several storeys above ground level. Adjacent buildings with windows are visible in the background, indicating an urban construction or renovation site. The scene focuses on elevated work using heavy machinery alongside an ageing building façade.

The example of image-to-text retrieval. By giving an image, we need to retrieve the paired captions from the candidate set

Image-text retrieval can provide the information for visual-textual applications, including visual question answering (Nam et al., 2017), image captioning (Anderson et al., 2018), visual grounding (Wang et al., 2019), and visual common sense reasoning (Zellers et al., 2019). With the thriving development of deep learning and computational resources, neural networks dominate the current research trend. Jointly trained neural network-based image and text encoders transform the input text and image into vectors in a common latent space. The two encoders are trained under metric learning schemes, which compare the cosine similarity between the paired and unpaired image and text samples. For example, an intuitive solution to representing the image and text in a joint latent space is optimizing two encoder models by minimizing contrastive loss (Chen et al., 2020a, 2020b). The loss function can gather the paired information but repel the unpaired data in the latent space.

Although end-to-end solutions perform astonishingly, explainability is crucial for image-understanding applications. In the multi-modal application scenario, humans expect a complete reasoning procedure instead of a magic answer from the model. However, neural networks obscure the reasoning process within the joint latent space through complex floating-point operations, e.g., calculating cosine similarities between vectors. The nonlinearities in the model make the whole inference process a black box. To this end, we propose a multi-stage methodology, dividing the retrieval process into three stages:

Global alignment;
Image cluster alignment; and
Text cluster alignment.

Each alignment stage consists of three modules:

alignment;
subdomain clustering; and
subdomain feature selection.

More fine-grained information is revealed in the module’s feature selection process.

The availability of paired image and text data is another challenge when training multi-modal models. Most datasets contain only high-quality data in a single modality. For example, ImageNet (Deng et al., 2009) and MS-COCO (Lin et al., 2014) contain diverse images but lack sentence-level textual descriptions associated with the images. In contrast, in textual datasets, the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) contain well-structured paragraphs, yet without corresponding images. Collecting paired images and captions is expensive and labor-intensive. Due to the subjectiveness of caption labels, it is impractical to assume consistent captions for one image. However, the quality of collected pairs in both domains significantly impacts the performance of the jointly trained multi-modal encoders. Aiming to relieve the data scarcity, we adopt the pre-trained encoders in the image and text domains instead of jointly training text and image encoders from scratch. Then, we proposed a green learning alignment process to deal with the lack of paired information.

We propose a new Green Explainable Multi-Modal Alignment (GEMMA) scheme to deal with paired data scarcity and explainability. The method utilizes the frozen image and text encoder models and aligns the representations using the proposed alignment process. Our contributions are summarized as follows:

We reduce the number of parameters to around 3% compared to fine-tuning the whole encoders. Instead of fine-tuning the pre-trained encoders, we propose an alignment scheme from two pretrained encoders, making the pipeline computationally efficient.
In order to achieve pipeline transparency, we narrow the set of candidates in a stage-wise manner. The modular design divides the entire dataset into subsets. We can statistically understand the retrieval process and the crucial tokens by the feature selection modules in the sub-domain clustering.
We provide bidirectional retrieval in the proposed pipeline. The alignment modules consist of linear projections without incorporating any nonlinearity. Thus, the alignment process can be easily reversed from one to another.
We conduct extensive experiments on two public multi-modal datasets. The results demonstrate that our method can significantly improve the performance in text-to-image retrieval.

2. Related work

The existing methods can be classified into 1) cross-modal retrieval and 2) visual-language models (VLMs). Cross-modal models consist of a convolutional neural network (CNN) to extract features from images and a recurrent neural network (RNN) to process text data. The joint representations of the convolution and recurrent backbones are optimized by metric learning. On the other hand, VLMs employ Large Language Models (LLMs) that work in tandem with the Visual Transformer models (ViTs) for optimal performance. The VLM optimization can be performed by contrastive learning, masking filling, and generative matching.

2.1 Cross-modal retrieval

The cross-modal retrieval algorithms consist of representation matching and feature extraction. Metric learning schemes measure the similarity between the samples and predict the matching scores. Hadsell et al. (2006) propose the idea of contrastive learning. The loss formulation aims to reduce the distance in the latent space for similar samples and to increase the distance for different samples. Triplet loss (Schroff et al., 2015), lifted structure loss (Oh Song et al., 2016), and N-Pair loss (Sohn, 2016) construct the joint latent space by sampling training data. The losses gather the positive and repel the negative sampling schemes from positive and negative pairs, forming the positive and negative pairs with the sampling schemes. Thus, optimization can be improved by the hard sampling process (Robinson et al., 2020; Xuan et al., 2020). With the thriving development of self-supervised applications, SimSCE (Gao et al., 2021) and SimCLR (Chen et al., 2020a, 2020b) provide metrics to reinforce the representations. The losses map the origin and representations from the augmented images (crop, rotate, color distort, etc.) onto the same latent space.

Frome et al. (2013) first proposed the concept of joint image and text embedding in the ImageNet (Deng et al., 2009) classification. The pipeline utilizes the textual information from the label to construct a lookup table from the nearby concepts as the target embedding, leading to a hierarchical classification. Zheng et al. (2020) adopts deep CNN as the basis for extracting the image and text features. The instance loss optimizes the two feature extractors, which can project the representations from different modalities onto the joint latent space. Lee et al. (2018) utilize bottom-up attention object detector (Anderson et al., 2018) to obtain semantic representations of images and to perform word-level matching in the captions. The bottom-up detector can provide the modifiers with the noun, matching the corresponding sentence with the details.

Liu et al. (2020) formulate the information as a graph and adopt the structural matching to retrieve the closest subgraph. The object detector obtains the visual graph. The node features are the region of interest (ROI) feature of the model, and the vertices are constructed by the Multi-Layer Perceptron (MLP). The textual graph is the Part-Of-Speech (POS) prediction from the Gated Recurrent Unit (GRU) Networks. Wang et al. (2018) adopts the instance-wise matching for the subgraphs. The overall matching score aggregates the partial graph similarities in a bottom-up manner.

To further exploit the information in the query image, Cheng et al. (2022) adopts the optical character recognition (OCR) module to extract semantic information such as text embeddings of the scene. The model fuses the image token and the scene text for the joint representation. Diao et al. (2023) build the image tokens from ROI by the object detector and bidirectional GRU textual tokens. The cross-modal attention module is used for the token-wise matching process. Jawade et al. (2023) constructs the visual and textual tokens from the pre-trained model. However, the research merges the cross-modal information by cross-attention (Vaswani et al., 2017) modules and manages the retrieval task with the transformer structures.

2.2 Visual-language model

Transformers (Vaswani et al., 2017) have achieved significant results in natural language processing and computer vision tasks. The image-text encoders can share similar architectures. Wang et al. (2022) crop the input images into patches and use the patches as visual tokens to formulate the images as a novel language. The jointly trained visual and text encoders (Chen et al., 2023; Zeng et al., 2023) are optimized end-to-end. Visual language models (VLMs) can be categorized into three families (Pennington et al., 2014) by the optimization process:

contrast-based VLMs;
VLMs with masking objects; and
generative-based VLMs.

Constructive VLMs (Radford et al., 2021) are trained by the paired multi-modal data, and the objective loss is the contrastive loss. The self-supervised learning scheme obtains VLMs with masking objects (Kim et al., 2021; Kwon et al., 2022; Singh et al., 2022); the model needs to predict the masked visual and textual tokens. Generative-based VLMs (Liu et al., 2024; Yu et al., 2022, 2023) take advantage of the great success of AI chatbots, which are trained in visual question answering, image captioning, and other downstream tasks.

CLIP (Radford et al., 2021) demonstrates impressive visual representations trained together with paired text descriptions. The transformer encoder takes the nonoverlapping patches and the words as input and utilizes the pooled encoded tokens to represent the images and sentences. The model uses a contrastive learning scheme to project image and text representations onto a shared latent space. This shared space allows for a better understanding of the relationship between the two modalities. The dual (image-text) encoder architecture is prevalent in multi-modal applications.

Kim et al. (2021) utilizes the masked tokens in self-supervised learning in transformers (Devlin, 2018) for natural language processing. The model takes tokenized sentences and image patches as input. Training tasks include paired classification and masked token filling. Kwon et al. (2022) proposes the uniform transformer with two pre-training objectives, including masked vision and language modeling, and multi-modal alignment. Singh et al. (2022) proposes the multi-modal encoder with visual and text encoders. The multi-modal encoder aligns the features from the two encoders with global contrastive learning and masked multi-modal modeling.

In addition to representation learning, the large language model provides incredible performance on text generation tasks. Yu et al. (2022) optimize the visual encoder with image captioning as a downstream task. With a jointly trained visual encoder and language decoder, the model provides unified text and visual representations for the transformer. Yu et al. (2023) employ the diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015) for image generation and reinforce cross-modal representations. Liu et al. (2024) combine the visual encoder with the LLM. The given image tokens are used as instructions for the detailed LLM responses. However, the training process requires large-scale paired images and texts, which is computationally expensive.

Despite achieving state-of-the-art performance, large visual-language pre-trained models still have shortcomings in inference. The matching process is not transparent, and humans cannot understand the decision-making within fully connected layers because they lack semantic meanings. In addition to the lack of explainability, the fine-tuning process is computationally expensive. These models have billions of trainable parameters, and high-quality image-text pairs are required for tuning.

2.3 Green learning

To handle the computationally intensive fine-tuning process and expand the image-text encoder using unpaired data, we introduce the Green Learning Alignment algorithm, which uses separately pre-trained image-text encoders. The idea of Green Learning was proposed by Kuo and Madni (Kuo and Madni, 2023) and aims to reduce the computational cost of backpropagation while providing a theoretically explainable learning process for various applications. The modular designs can divide the problem into subproblems, which can be solved using transparent algorithms.

3. Proposed GEMMA method

The GEMMA algorithm can be divided into three stages:

Global alignment;
Image cluster alignment; and
Text cluster alignment.

We adopt the multi-stage approach to approximate the complicated decision-making process rather than building a single large visual-language foundation model from scratch to ensure model efficiency. Starting from the pre-trained image and text feature extractors, we keep the pre-trained model frozen to maintain its ability to generalize with unpaired data in the matching process. We align the representations by training additional single-layer adapter matrices to project the representations onto the joint latent space. Specifically, the alignment process consists of three modules:

alignment;
clustering in subdomains; and
selection of subdomain features, where clustering and feature selection are performed in both the image and text domains, as shown in Figure 2.

Figure 2.

A multi stage framework shows global alignment of paired image text data, followed by image domain and text domain clustering, feature selection, and subgroup alignment.

View large Download slide

The diagram presents a structured framework divided into paired data, image domain, and text domain sections. Stage 1 shows global alignment of paired images and captions. In the image domain, stage 2a clusters similar images, stage 2b selects discriminant tokens, and stage 2c performs image domain subgroup alignment. In the text domain, captions are clustered in stage 3a, keywords are selected in stage 3b, and subgroup alignment is performed in stage 3c. Each stage is shown in separate labelled panels, with icons and example images or documents indicating the processing flow.

The overall algorithm design of Alignment. The first stage is the global alignment. The second and third stages include fine-grained clustering and feature selections in the image and text domain

3.1 Alignment

In the alignment process, we do not fine-tune the pre-trained encoders. We train a lightweight linear transformation in the visual and textual domains to align the two representation spaces. The alignment module is illustrated in Figure 3. The visual and text embeddings can be formulated as:

\begin{array}{l} e_{v i s} = F (Image) \in R^{d_{v i s}} \\ e_{t x t} = G (Caption) \in R^{d_{t x t}}, \end{array}

(1)

Figure 3.

A dual stream model shows visual and text embeddings aligned into joint features, with inversion modules and reconstruction, cross reconstruction, and contrastive losses.

View large Download slide

The diagram shows a dual pathway architecture for visual and text data. On the left, visual embedding d vis and text embedding d txt feed into visual alignment and text alignment modules, which map both into joint features d joint on the right. From the joint visual feature, a reconstruction loss connects back through visual inversion to the visual embedding. From the joint text feature, a reconstruction loss connects back through text inversion to the text embedding. Cross text inversion and cross visual inversion connect joint features back to the opposite modality, forming a cross reconstruction loss between visual and text embeddings. A contrastive loss links the joint visual feature and joint text feature. Blue boxes represent features, yellow boxes represent alignment modules, and red boxes represent trainable inversion parameters.

The illustration of the alignment process. The blue boxes are the features extracted by the frozen encoders. The orange boxes are the trainable transformation matrices. The red boxes are the auxiliary matrices for constraining the representations in the joint space

where $e_{v i s}, e_{t x t}$ are the image and text embeddings, $F, G$ are the frozen image and text encoder models, and $d_{v i s}, d_{t x t}$ are the dimensions of the image and text representations. With the deterministic representations, the matching process can be denoted as:

s i m (A e_{v i s}, B e_{t x t}) = s i m (z_{v i s}, z_{t x t}),

(2)

where $A \in R^{d_{j o i n t} \times d_{v i s}}$ and $B \in R^{d_{j o i n t} \times d_{t x t}}$ represent the trainable image-text alignment matrices, $z \in R^{d_{j o i n t}}$ represents the vector in the joint space, and $s i m (., .)$ represents the similarity metric. We adopt cosine similarity as the similarity metric, namely $s i m (u, v) = \frac{u \cdot v}{‖ u ‖ ‖ v ‖}$ ⁠. We can further optimize the trainable parameters with the contrastive learning loss function (Chen et al., 2020a, 2020b):

L_{c o n} = - log \frac{e x p (s i m (z_{i}, z_{j}) / τ)}{\sum_{k = 1}^{N} 1_{[k \neq i]} e x p (s i m (z_{i}, z_{k}) / τ)} .

(3)

Here, $(i, j)$ denotes the paired image and sentence in the sampled batch, N denotes the batch size, and $τ \in R$ denotes the temperature hyperparameter. $1 \in {0, 1}$ is an indicator function and the value is one, while $[k \neq i]$ ⁠. The objective function maximizes the similarity of relevant image-text pairs while avoiding negative image-text pairs from being embedded closely in the latent space.

Hence, the problem can be formulated as an optimization problem, and all transformations are linear. We can define the inverse projection in the joint latent space without nonlinearities:

\begin{array}{l} e_{v i s} = A^{- 1} z_{v i s} \\ e_{t x t} = B^{- 1} z_{t x t}, \end{array}

(4)

where $A^{- 1} \in R^{d_{v i s} * d_{j o i n t}}$ and $B^{- 1} \in R^{d_{t x t} * d_{j o i n t}}$ represent the inverse transformation from the joint space to the original image-text representations. We define the reconstruction loss for both the image and text modality:

\begin{array}{l} L_{r e c o n} = {‖ A^{- 1} z_{v i s} - e_{v i s} ‖}_{2} + {‖ B^{- 1} z_{t x t} - e_{t x t} ‖}_{2}, \end{array}

(5)

Furthermore, we use the auxiliary matrices to constrain the joint representations and define the loss of cross-modality reconstruction as:

\begin{array}{l} L_{c r o s s - r e c o n} = {‖ C z_{t x t} - e_{v i s} ‖}_{2} + {‖ D z_{v i s} - e_{t x t} ‖}_{2}, \end{array}

(6)

where $C \in R^{d_{v i s} \times d_{j o i n t}}$ and $C \in R^{d_{t x t} \times d_{j o i n t}}$ are the auxiliary transformation matrices from the joint space onto the image and text modality, respectively. In addition, $z_{v i s}, z_{t x t}$ are obtained from the corresponding paired caption or image data, $e_{v i s}$ and $e_{t x t}$ ⁠. However, the C and D matrices will not be used during inference. The alignment process is a linear transformation carried out by matrices A and B. The objective function can be written as:

L = α L_{c o n} + β L_{r e c o n} + γ L_{c r o s s - r e c o n},

(7)

where $α$ ⁠, $β$ ⁠, and $γ \in R$ represent hyperparameters in training. Linear alignment provides an invertible transformation from the image-text modality to the joint latent space and vice versa. However, the single-layered alignment is too simple to match all the samples. Thus, we cluster the data to form sub-datasets and utilize the stage-wise alignments for the detailed decision.

3.2 Sub-domain clustering

With the alignment process, we can find similar representations by linear transformations. However, the transformation can only take global representations, which means that images or captions are represented as $d_{v i s}$ - or $d_{t x t}$ -dimensional vectors. The image and sentence representations are the pooled output of the tokens in the prevailing transformer models. It can be inferred from previous research that fine-grained information is also crucial in information-matching tasks.

Due to the complexity of the fine-grained token representations, it is challenging to train the token-wise alignment in a brute-force manner. Thus, we adopt the clustering algorithms and use the clustering results to obtain crucial tokens. The crucial token selection will be introduced in Section 3.3. We can reduce the feature dimension from the number of tokens and perform a second-stage alignment.

We adopt frequency analysis and statistical approaches to construct a transparent and human-sensible intermediate structure. The clustering is conducted through (1) concept aggregation and (2) representation aggregation.

3.2.1 Concept aggregation.

We extract the concrete concepts for the candidate sentences by the Part-of-speech (POS) tagger (Wei et al., 2024). We collect the nouns as anchors and calculate the Term Frequency-Inverse Document Frequency (TF-IDF) to select the representative terms. As shown in Figure 4, the concepts lie in a long-tailed distribution, leading to a biased probability estimation. Hence, we aggregate the high-frequency terms based on the detector results and divide the candidate set into subsets for better-detailed alignments.

Figure 4.

A bar chart shows concept frequency by label, with a steep decline from the highest values and the top 100 concepts highlighted on the left.

View large Download slide

The bar chart titled concept frequency plots labels on the horizontal axis and frequencies on the vertical axis, ranging from 0 to about 2000. Bars are ordered from highest to lowest frequency, forming a long tail distribution. The leftmost section is enclosed by a highlighted box and labelled top 100, showing the most frequent concepts. These bars start close to 2000 and decrease rapidly to around 1000 within the highlighted region. Beyond the top 100, frequencies continue to decline gradually across many labels, with most concepts appearing far less often than the highest ranked ones.

The frequency bar chart of the extracted corpus concepts. Top ten concepts and the corresponding counts are (‘man’, 36743), (‘woman’, 23845), (‘people’, 12810), (‘shirt’, 12743), (‘girl’, 10035), (‘dog’, 10030), (‘boy’, 9393), (‘men’, 8005), (‘child’, 7746), (‘street’, 7435), (‘group’, 6959), (‘front’, 6857), (‘water’, 5489), (‘hat’, 4075), (‘person’, 3810), (‘ball’, 3679), (‘jacket’, 3365), (‘building’, 3334), (‘hand’, 3113), and (‘player’, 3099)

We construct the co-occurrence matrix of the POS tagging and object detection results in the training set. As shown in Figure 5, concepts have a significant relationship with detection results. Hence, we can group the concepts tagged with POS with the probability conditional on the detection results. To visualize the physical meaning of the clusters, we can use the word clouds to show the high-frequency concepts in each cluster, shown in Figure 6.

Figure 5.

A heatmap shows a co occurrence matrix between detector objects and P O S concepts, with sparse high value points across a largely low intensity background.

View large Download slide

The heatmap titled co occurrence matrix plots detector objects on the horizontal axis from 0 to about 80 and P O S concepts on the vertical axis from 0 to about 90. Most cells show very low co occurrence values, forming a dark background. Scattered brighter cells appear at specific intersections, indicating higher co occurrence between certain detector objects and P O S concepts. Vertical streaks at some detector indices suggest repeated associations across multiple P O S concepts. The overall pattern shows sparse but structured co occurrence rather than uniform distribution.

The occurrence matrix of POS tagging concepts and the detection results. The x-axis is the 80 object classes from the pretrained detector from the MS-COCO (Lin et al., 2014) dataset. The y-axis is the top 100 concepts from the POS tagger

Figure 6.

A word cloud shows clusters of frequently occurring visual concepts such as man, street, boat, bike, ball, dog, table, and glass, each surrounded by related terms.

View large Download slide

The word cloud presents several clusters of concepts grouped by co occurrence. One cluster centres on man, with related words including woman, people, boy, girl, and shirt. Another cluster focuses on street, surrounded by person, city, building, road, park, car, and sidewalk. A boat cluster includes river, wave, surf, surfer, dock, canoe, wetsuit, and fishing. A bike cluster contains bicycle, dirt, race, rider, helmet, track, and motorcycle. A ball cluster groups player, soccer, basketball, baseball, football, game, team, and uniform. A dog cluster includes field, beach, grass, sand, toy, snow, and jump. A table cluster shows baby, chair, pool, band, stage, room, and microphone. A glass cluster includes food, drink, kitchen, bar, cup, bottle, fruit, beer, and apron.

The visualization results of the clustering. The font size denotes the frequency of the word in the corpus

Figure 7.

A scatter plot shows data points along a feature dimension, divided by multiple dashed partition points, with a central solid line marking the optimal partition.

View large Download slide

The diagram titled optimal partition shows circular data points distributed along a horizontal feature dimension. Vertical dashed lines indicate candidate partition points that divide the feature space into segments. A single solid vertical line at the centre marks the selected optimal partition. Data points appear on both sides of each partition, with clusters forming within segments. The layout illustrates how different partition choices split the data, highlighting the central partition as the optimal separation among the available partition points.

Visualization of DFT. Red and orange dots represent the binary labels. The partition metric is the weighted sum of the left and right binary cross-entropy. Dashed lines denote the potential partition points

3.2.2 Representation aggregation.

The clustering is based on the K-means algorithm. To ensure consistency of alignment and clustering, we use the $l_{2}$ -norm of normalized representations as a distance metric:

{‖ \tilde{u} - \tilde{v} ‖}_{2}^{2} = {‖ \tilde{u} ‖}_{2}^{2} + {‖ \tilde{v} ‖}_{2}^{2} - 2 \tilde{u} \tilde{v} = 2 - 2 s i m (u, v),

(8)

where $\tilde{u}$ and $\tilde{v}$ are the normalized representations, namely $\tilde{u} = \frac{u}{‖ \tilde{u} ‖}$ ⁠. The clustering probability can be denoted as:

\begin{array}{l} p r o b (u \in c l u s_{i}) = \frac{e^{ϵ' (2 - 2 s i m (u, c e n_{j}))}}{\sum_{j = 1}^{K} e^{ϵ' (2 - 2 s i m (u, c e n_{i}))}} = \frac{e^{ϵ \cdot s i m (u, c e n_{i})}}{\sum_{j = 1}^{K} e^{ϵ \cdot s i m (u, c e n_{j})}}, \end{array}

(9)

where $c l u s_{i}$ and $c e n_{i}$ represent the i-th cluster and i-th centroid vector, respectively. K represents the number of clusters and $ϵ$ is a hyperparameter. If $ϵ$ increases, the probability distribution will concentrate on a certain class. If $ϵ$ decreases, the probability distribution will become uniform.

We can group images and texts based on their probabilities and then align them using contrastive learning within these groups. We can improve the contrastive learning process by using negative samples similar to positive ones. We use hard-sample mining to ensure sample diversity within each group. The global alignment process helps identify the most challenging cases. We can then enlarge the groups by selecting the K-top candidates from the previous alignments as negative samples.

To clarify the roles of K-means clustering and the choice of hyperparameters, we conducted experiments comparing K-means and Agglomerative Clustering and varying the number of clusters. As shown in Table 1, increasing the number of clusters improves the retrieval in certain settings. However, this also requires training additional alignment matrices for the clusters. Therefore, we set the number of clusters to eight to strike a balance between the number of trainable parameters and the performance. K-means clustering is selected in GEMMA due to its slight empirical advantage over agglomerative clustering.

Table 1.

Sensitivity to clustering methods, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset

		Image-to-text			Text-to-image
Clustering	#Cluster	R@1	R@5	R@10	R@1	R@5	R@10	#Param
KMeans	4	84.1	95.7	96.6	65.3	90.1	93.4	5.2M
	8	86.3	98.2	99.4	73.2	94.2	97.2	10M
	16	86.4	98.1	99.6	73.4	94.2	97.3	20M
Agglomerative	4	84.0	94.4	96.2	64.8	90.0	92.2	5.2M
	8	85.5	97.7	98.7	72.9	92.8	96.1	10M
	16	86.0	96.9	99.5	73.4	93.7	97.0	20M

		Image-to-text			Text-to-image
Clustering	#Cluster	R@1	R@5	R@10	R@1	R@5	R@10	#Param
KMeans	4	84.1	95.7	96.6	65.3	90.1	93.4	5.2M
	8	86.3	98.2	99.4	73.2	94.2	97.2	10M
	16	86.4	98.1	99.6	73.4	94.2	97.3	20M
Agglomerative	4	84.0	94.4	96.2	64.8	90.0	92.2	5.2M
	8	85.5	97.7	98.7	72.9	92.8	96.1	10M
	16	86.0	96.9	99.5	73.4	93.7	97.0	20M

3.3 Feature selection

Clustering results provide pseudo-labels for further feature selection. The label can be denoted as:

\begin{array}{l} l a b e l_{c l u s_{i}}^{u} = {\begin{array}{l} 0, & if p r o b (u \in c l u s_{i}) < T . \\ 1, & otherwise . \end{array} \end{array}

(10)

Here, $l a b e l_{c l u s_{i}}^{u}$ represents the label of the data point u whether it belongs to the group i, and $T \in (0, 1)$ is the self-definition threshold. With pseudo-labels, we can further adopt Discriminant Feature Selection (Yang et al., 2022) (DFT) to select informative features and reduce feature dimensions. DFT is a supervised feature selection process that measures dimension-wise importance. For a given 1D input feature, we can order the samples by the feature values and bind the feature dimension to the sample maximum and sample minimum. Then, we can partition the samples along the given dimension and calculate the partition purity by weighted cross-entropy with pseudo-labels obtained from Section 3.2. A feature is more discriminant if it has a lower loss value. Then, we can plot the loss value curve from the lowest to the highest and use the elbow point to select discriminant features from the whole feature set.

Separating the whole dataset into subsets allows us to conduct the discriminant feature test among the tokens with the pseudo-labels from the clustering results. Thus, token-level alignments can be performed using the same procedure as global-level alignment.

3.4 Mathematical expression

The overall alignment process can be divided into three modules:

global matching;
subdomain clustering; and
subdomain matching.

The subdomain clustering and alignment will be conducted within the image and the text domain. We can aggregate the alignments in the subproblem to approximate the overall alignment:

\begin{array}{l} P (i m a g e | t e x t) = P (i m a g e, t e x t) / P (t e x t) = P (t e x t | i m a g e) \times \frac{P (i m a g e)}{P (t e x t)} \\ = \frac{1}{P (t e x t)} \sum_{c \in c l u s t e r} P (t e x t | i m a g e \in c) * P (i m a g e \in c) \\ \propto \sum_{c \in c l u s t e r} P (t e x t | i m a g e \in c) * P (i m a g e \in c), \end{array}

(11)

where $P (i m a g e | t e x t)$ denotes the probability distribution of the images with a given query text. $P (i m a g e)$ and $P (t e x t)$ denote the probability distribution of image and text, and $c l u s t e r$ is the result of the clustering of our clustering modules. We further assume that the probability distribution within the cluster can be approximated as uniform. Conditional probability can reflect the stage-wise design in the proposed pipeline.

Furthermore, we use the similarity measurement to simplify the probability estimator, which means that we use $s i m (i m a g e, t e x t)$ to represent $P (i m a g e | t e x t)$ ⁠. In the work, we adopt the cosine similarity as:

\begin{array}{l} s i m (i m a g e, t e x t) \\ = s i m (W^{v i s} F (image), W^{t x t} G (text)) \\ = s i m (W^{v i s} [G_{v i s}; T_{v i s}], W^{t x t} [G_{t x t}; T_{t x t}]) \\ \sim s i m (W_{g l o b a l}^{v i s} G_{v i s}, W_{g l o b a l}^{t x t} G_{t x t}) \\ + s i m (W_{t o k e n s}^{v i s} [G_{v i s}; T_{v i s}], W_{g l o b a l}^{t x t} G_{t x t}) \\ + s i m (W_{g l o b a l}^{v i s} G_{v i s}, W_{t o k e n s}^{t x t} [G_{t x t}; T_{t x t}]), \end{array}

(12)

where $G_{v i s}$ and $G_{t x t}$ denote the pooled outputs from the feature extractors (global features), $T_{v i s}$ and $T_{t x t}$ denote the token, i.e. fine-grained, features, W: denote the alignment matrices corresponding to different subsets from the clustering results. The [G; T] denotes the concatenated features of global and tokens. Due to computational cost, we cannot directly collect all token features. Therefore, we conduct the feature selection process based on the clustering results.

The feature selection process is an approximation based on the clustering results. The process is expressed as a combination of the conditional probabilities. For simplicity, we ignore the alignment matrix in the following representations:

\begin{array}{l} E [s i m (F (image), G (text))] \\ = E [E [s i m (F (image), G (text))] | image \in C_{1}; text \in C_{2}] \\ \sim E [s i m (G_{v i s}, G_{t x t})] \\ + E [E [s i m (D F T ([G_{v i s}; T_{v i s}]), G_{t x t}) | image \in C_{1}]] \\ + E [E [s i m (G_{v i s}, D F T ([G_{t x t}; T_{t x t}])) | text \in C_{2}]], \end{array}

(13)

where $D F T (.)$ represents the feature selection and dimension reduction process in Section 3.3 and $C_{1}$ and $C_{2}$ represent the cluster sets of K-means. Instead of training a complicated alignment process from the token-level output of the feature extractor, we propose a stage-wise decomposition on the dataset and train simpler structures for the subsets. Meanwhile, the alignments in the stages are linear, which provides the inversion operation and preserves the dual accessibility in image and text domains.

4. Experiments

4.1 Dataset

We perform the image-to-text and text-to-image retrieval on the image-text benchmark: Flickr30k and MS-COCO. The Flickr30k dataset (Young et al., 2014) contains 31,000 images, and every image has five paired captions. The training set contains 29,000 images; the validation and testing sets contain 1,000. The MS-COCO (Lin et al., 2014) is a larger-scale dataset with 123,287 images, each containing at least five captions. We follow the ‘Karpathy’ splitting for the experiments (Karpathy and Fei-Fei, 2015): 113,287 images for training, 5,000 for validation, and 5,000 for testing. We use the two benchmarks with different sizes to demonstrate the scalability and generalizability of our approaches. The performance is evaluated using the Recall@K metric where $K \in {1, 5, 10}$ ⁠. The notation K refers to the top-K matches of the retrieval results. A retrieval is considered a true positive if the predicted matches include at least one of the paired ground-truth captions. Specifically, if the top K matches contain one of the five corresponding captions for a given image, it is counted as a positive in the recall metrics.

4.2 Hyperparameter settings

The overall algorithm is trained stage by stage. We adopt K-means as the clustering algorithm. The number of clusters is 8, and $ϵ$ is 50 for pseudo-labeling. For Flickr30K, we set the temperature parameter at 0.02, and the ratio between losses is set to $α : β : γ = 1 : 0.5 : 0.6$ in the global alignment. In the alignment of the image subdomain, the temperature parameter is set to 0.015, and the ratio between the losses is set to $α : β : γ = 1 : 0.4 : 0.5$ ⁠. In the text subdomain alignment, the temperature parameter is set to 0.01, and the ratio between the losses is set to $α : β : γ = 1 : 0.3 : 0.4$ ⁠.

For the MS-COCO dataset, the temperature parameter is set to 0.05, and the ratio between the losses is set to $α : β : γ = 1 : 0.5 : 0.5$ in the global alignment. In the alignment of the image subdomain, the temperature parameter is set to 0.03, and the ratio between the losses is set to $α : β : γ = 1 : 0.3 : 0.5$ ⁠. In the text subdomain alignment, the temperature parameter is set to 0.02, and the ratio between the losses is set to $α : β : γ = 1 : 0.2 : 0.4$ ⁠.

The dimension of the joint space is set to 768, which follows the token dimension of the transformer encoders. All optimization is performed using AdamW with the learning rate = 0.001.

4.3 Retrieval

We conducted the experiments and compared our alignment approach to the SOTA retrieval models. The results are shown in Table 2. We extract information from the frozen CLIP image and text encoder in the experiments. The CLIP encoders remain frozen during further alignments and serve as the baseline for our alignment process. The CLIP encoder contains more than 428M parameters. However, we do not fine-tune the overall encoder in our alignment process; instead, we train additional alignment matrices. The trainable parameters can be reduced from 428M to 9.43M (⁠ $\sim 2.2 %$ ⁠). The encoders remain untrainable during the training of alignment matrices. Therefore, GPU memory consumption is proportional to the trainable parameters, which can be reduced to less than 10 percent of the fully fine-tuned approach.

Table 2.

The Flickr30k(1k testing set) and MSCOCO(5k testing set) dataset retrieval performance. We compare the single-model performance among all multi-modal retrieval models. The numbers are taken from Diao et al. (2023) R@1 represents Recall@1 for simplicity

	Flickr30k (1k testing set)						MS-COCO (5k testing set)
	Image-to-text			Text-to-image			Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
SCAN (Lee et al., 2018)	67.4	90.3	95.8	48.6	77.7	85.2	50.4	82.2	90.0	38.6	69.3	80.4
VSRN (Li et al., 2019)	71.3	90.6	96.0	54.7	81.8	88.2	53.0	81.1	89.4	40.5	70.6	81.1
CAAN (Zhang et al., 2020)	70.1	91.6	97.2	52.8	79.0	87.9	52.5	83.3	90.9	41.2	70.3	82.9
IMRAM (Chen et al., 2020)	74.1	93.0	96.6	53.9	79.4	87.2	53.7	83.2	91.0	39.7	69.1	79.8
MMCA (Wei et al., 2020)	74.2	92.8	96.4	54.8	81.4	87.8	54.0	82.5	90.7	38.7	69.7	80.8
GSMN (Liu et al., 2020)	76.4	94.3	97.3	57.4	82.3	89.0	–	–	–	–	–	–
SGRAF (Diao et al., 2021)	77.8	94.1	97.4	58.5	83.0	88.8	57.8	84.9	91.6	41.9	70.7	81.3
SHAN (Ji et al., 2021)	74.6	93.5	96.9	55.3	81.3	88.4	–	–	–	–	–	–
WCGL (Wang et al., 2021)	74.8	93.3	96.8	54.8	80.6	87.5	–	–	–	–	–	–
RCAR (Diao et al., 2023)	78.7	94.6	97.6	59.5	84.0	89.5	59.6	85.8	92.4	42.5	71.7	81.8
SGRAFS (Jawade et al., 2023)	79.2	95.3	97.7	58.3	83.1	89.2	58.0	85.1	91.6	41.7	71.2	81.5
CLIP (Radford et al., 2021)	88.0	98.7	99.4	68.7	90.6	95.2	58.4	81.5	88.1	37.8	62.4	72.2
GEMMA(Ours)	88.6	98.9	99.6	75.7	94.2	97.1	58.6	83.2	90.0	45.3	72.6	82.8

	Flickr30k (1k testing set)						MS-COCO (5k testing set)
	Image-to-text			Text-to-image			Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
SCAN (Lee et al., 2018)	67.4	90.3	95.8	48.6	77.7	85.2	50.4	82.2	90.0	38.6	69.3	80.4
VSRN (Li et al., 2019)	71.3	90.6	96.0	54.7	81.8	88.2	53.0	81.1	89.4	40.5	70.6	81.1
CAAN (Zhang et al., 2020)	70.1	91.6	97.2	52.8	79.0	87.9	52.5	83.3	90.9	41.2	70.3	82.9
IMRAM (Chen et al., 2020)	74.1	93.0	96.6	53.9	79.4	87.2	53.7	83.2	91.0	39.7	69.1	79.8
MMCA (Wei et al., 2020)	74.2	92.8	96.4	54.8	81.4	87.8	54.0	82.5	90.7	38.7	69.7	80.8
GSMN (Liu et al., 2020)	76.4	94.3	97.3	57.4	82.3	89.0	–	–	–	–	–	–
SGRAF (Diao et al., 2021)	77.8	94.1	97.4	58.5	83.0	88.8	57.8	84.9	91.6	41.9	70.7	81.3
SHAN (Ji et al., 2021)	74.6	93.5	96.9	55.3	81.3	88.4	–	–	–	–	–	–
WCGL (Wang et al., 2021)	74.8	93.3	96.8	54.8	80.6	87.5	–	–	–	–	–	–
RCAR (Diao et al., 2023)	78.7	94.6	97.6	59.5	84.0	89.5	59.6	85.8	92.4	42.5	71.7	81.8
SGRAFS (Jawade et al., 2023)	79.2	95.3	97.7	58.3	83.1	89.2	58.0	85.1	91.6	41.7	71.2	81.5
CLIP (Radford et al., 2021)	88.0	98.7	99.4	68.7	90.6	95.2	58.4	81.5	88.1	37.8	62.4	72.2
GEMMA(Ours)	88.6	98.9	99.6	75.7	94.2	97.1	58.6	83.2	90.0	45.3	72.6	82.8

In Flickr30k (1k testing set), our approach outperforms other image-to-text and text-to-image retrieval methods. Alignment can improve recall @ 1 by 0. 6% in image-to-text retrieval. Meanwhile, our approach provides a 6% boost in text-to-image retrieval. RCAR (Diao et al., 2023) needs dual-way optimized models, namely image-to-text and text-to-image. Our method is optimized in a feed-forward manner, and it ensembles the substructures directly.

In MS-COCO (5k testing set), our method provides competitive performance in image-to-text retrieval and outperforms the others in text-to-image retrieval by a boost of 2.1% in Recall@1. We achieve the best text-to-image retrieval performance among the two datasets, showcasing our approach’s scalability.

4.4 Generalizability

This section demonstrates the alignment between the visual/text encoders, which are trained separately. The encoders remain frozen in the alignment process. All alignments are based on the grouping and linear projection proposed in our pipeline. The performance of CLIP visual and text encoders without GEMMA alignment is taken from the original CLIP paper (Radford et al., 2021). Starting from the jointly trained CLIP structure, we change the text encoders into the RoBERTa (Liu et al., 2019) and the visual encoder into a CNN-based object detector (Anderson et al., 2018). All experiments are carried out on the Flickr30k dataset and follow the parameter settings in Section 4.2.

The results are shown in Table 3. The best performance comes from the jointly trained models, whose representations are preliminarily aligned in the pre-training process. Compared to the CLIP visual encoder, the features of the object detector are weaker in the alignment process. However, the separately trained text encoder, RoBERTa (Liu et al., 2019), does not suffer from the unpaired training dataset. The representations from the CLIP visual encoder and the RoBERTa text encoder can provide competitive performance in image-to-text retrieval and better performance in text-to-image retrieval than the original CLIP. The encoder can be adapted to the retrieval application without fine-tuning with the paired image and text data.

Table 3.

The experiment results with different visual and text features for the alignment process. All the experiments are conducted in the Flickr30k dataset

			Flickr30k (1k testing set)
		Alignment	Image-to-text			Text-to-image
Visual enc.	Text enc.	(GEMMA)	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	x	88.0	98.7	99.4	68.7	90.6	95.2
DETR (Carion et al., 2020)	RoBERTa (Liu et al., 2019)	v	66.7	89.5	93.6	56.7	84.5	90.3
DETR (Carion et al., 2020)	CLIP text (Radford et al., 2021)	v	73.6	91.6	94.5	60.0	85.8	90.6
CLIP vis (Radford et al., 2021)	RoBERTa (Liu et al., 2019)	v	86.3	98.2	99.4	73.2	94.2	97.2
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	v	88.6	98.9	99.6	74.8	94.2	97.1

			Flickr30k (1k testing set)
		Alignment	Image-to-text			Text-to-image
Visual enc.	Text enc.	(GEMMA)	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	x	88.0	98.7	99.4	68.7	90.6	95.2
DETR (Carion et al., 2020)	RoBERTa (Liu et al., 2019)	v	66.7	89.5	93.6	56.7	84.5	90.3
DETR (Carion et al., 2020)	CLIP text (Radford et al., 2021)	v	73.6	91.6	94.5	60.0	85.8	90.6
CLIP vis (Radford et al., 2021)	RoBERTa (Liu et al., 2019)	v	86.3	98.2	99.4	73.2	94.2	97.2
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	v	88.6	98.9	99.6	74.8	94.2	97.1

Table 4.

Ablation Studies on different stages, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset.

Alignment	Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10
Without alignment	64.5	71.7	84.3	32.7	61.6	80.1
Global	84.8	97.8	99.0	68.3	90.7	91.1
+Image cluster	85.4	98.0	99.1	70.3	91.5	94.3
+Text cluster (Final)	86.3	98.2	99.4	73.2	94.2	97.2

Alignment	Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10
Without alignment	64.5	71.7	84.3	32.7	61.6	80.1
Global	84.8	97.8	99.0	68.3	90.7	91.1
+Image cluster	85.4	98.0	99.1	70.3	91.5	94.3
+Text cluster (Final)	86.3	98.2	99.4	73.2	94.2	97.2

Table 5.

Experiments on Detector Features

		Flickr30k (1k testing set)
Vis Feat			Image-to-text			Text-to-image
Global Feat	Detail Feat	Text Feat	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP	CLIP	CLIP	85.3	91.9	93.3	72.1	90.6	92.2
DETR encoder	DETR decoder	CLIP	18.3	35.1	41.8	19.5	25.3	45.9
ResNet Backbone	DETR encoder	CLIP	66.7	89.5	93.3	56.7	84.5	90.3
ResNet Backbone	DETR decoder	CLIP	72.4	91.6	95.1	59.5	85.7	90.5
ResNet Backbone	DETR decoder	RoBERTa	64.5	84.5	88.4	53.3	83.3	87.3

		Flickr30k (1k testing set)
Vis Feat			Image-to-text			Text-to-image
Global Feat	Detail Feat	Text Feat	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP	CLIP	CLIP	85.3	91.9	93.3	72.1	90.6	92.2
DETR encoder	DETR decoder	CLIP	18.3	35.1	41.8	19.5	25.3	45.9
ResNet Backbone	DETR encoder	CLIP	66.7	89.5	93.3	56.7	84.5	90.3
ResNet Backbone	DETR decoder	CLIP	72.4	91.6	95.1	59.5	85.7	90.5
ResNet Backbone	DETR decoder	RoBERTa	64.5	84.5	88.4	53.3	83.3	87.3

In contrast, the Convolution Neural Network (CNN)-based object detector representation cannot be applied directly to rthe image-text retrieval task. The decrease in performance results from global understanding. The object detector efeatures are obtained from part of the image, and the representations lack a global understanding of the image. As dto CLIP visual encoders, the visual tokens’ pooled output contains the input images’ global information and has detailed token features for us to process further stage alignments. The visual example can be found in Section 4.6. If the alignment process misses the global information in the very beginning, then the alignment process on detailed information may lead to a misfocused result.

4.5 Ablation study on different stages

Due to the modularized design, we can compare the design from global alignment to subgroup alignment in the visual and textual domains. We choose encoders trained in different modalities to perform the alignment process. We use the CLIP visual encoder (Radford et al., 2021) and the RoBERTa (Liu et al., 2019) text encoder for the ablation study of stage-wise alignment on Flickr30k dataset (Young et al., 2014). The two encoders remain frozen in the experiments. The ‘without alignment’ setting means the direct dot product between the encoded features from two models. The two embeddings are located in different semantic latent spaces. Hence, the performance is the lowest compared to the other alignment processes.

With global alignment, the features can provide basic performance in retrieval tasks. However, a naive linear projection can not handle complex interactions between detailed information in the candidate set. Thus, recall rates increase as we add more stages in grouping, feature selection, and alignment. Feature selection provides statistical criteria for dimension reduction, preventing the latent dimension from increasing with additional tokens. We can take the essential features into the next stage and reduce computational cost simultaneously. Hence, the three-stage alignment can achieve the best performance with comparable efficiency.

4.6 From detection to alignment

To better understand the difference between the visual features of transformers and object detectors, we demonstrate the retrieval processing step by step.

The object detector can detect humans and vehicles, but the features lack a sensible relationship with each other. In the clustering stage, the clusters will focus on the specific object in the figure, that is, the bus in Figure 8. In global alignment, the paired sentence is fifth. However, the correct captions fall to the seventh when we perform the finer alignment, which clusters on cars and buses. Although object detectors can provide information fragments, the grouping process cannot link features. The detector features cannot find the central concept in the picture, but can be distracted by the surrounding objects.

Figure 8.

A visual text alignment example shows an urban street image with detected objects and two stages of caption alignment compared against ground truth descriptions.

View large Download slide

The figure combines an urban street photograph with alignment results. On the left, the image shows a police officer standing beside cars on a city street, with bounding boxes highlighting vehicles and a person. Below, ground truth captions describe an officer near a car on a busy city street. On the right, two columns list first stage alignment and second stage alignment captions, clustered by car, bus, and human. The first stage includes several general street descriptions, while the second stage refines the list to fewer captions, retaining the description of a police officer standing in front of a car on a busy street as the selected alignment.

Error cases of object detector alignment. The object detector will give all objects equal weights and try to include all the objects in the captions

The object detector can provide the features with the local information, yet the patched information is not represented in a structured manner. That is, we can only obtain the partial contents in the image and lose the global semantic representation in the clustering process. We rely on the global and local information relationship to retrieve suitable captions in the proposed coarse-to-fine clustering process.

On the other hand, the visual transformer can provide more information about the tokens and integrate the rrepresentations through a global pooling process. Hence, the token information can be selected in our feature selection module (Section 3.3) and clustered according to the global features. The overall architecture can sort the rich representation in a coarse-to-fine manner and provide a better multi-modal alignment performance.

When comparing the alignment process across different features, it becomes evident that performance is influenced by the types of features used. However, the alignment process cannot transform weak visual features into strong ones. Instead, it aims to bridge the gap caused by differences in modality. Consequently, performance improves when encoded features have larger receptive fields. The proposed alignment does not require jointly fine-tuning the encoders in the limited paired multi-modal data and generalizes the single-modal encoder with additional alignment matrices.

5. Conclusion and future work

Our approach can achieve outstanding performance in both image-to-text and text-to-image retrieval tasks. Furthermore, our method involves a step-by-step alignment process that maintains compatibility in the decision-making procedure. We divide the alignment into global and subdomain matching and apply a feature selection method to decrease the input feature dimensions. All subprocesses can be expressed mathematically and analyzed statistically, providing transparency compared to black-box output. To ensure computational efficiency, we froze the visual and text encoders and only trained the alignment matrices, which represent only about 3% of the parameters compared to the original model.

In addition, we conducted experiments on applying our alignment mechanism to individually trained text and image encoders. In the testing dataset, we found that the pre-trained text encoder can improve the performance of text-to-image retrieval. Replacement of the text encoder can also lead to similar performance in image-to-text retrieval.

We are working on developing a purely green learning solution for image understanding in the foreseeable future. By aiming not only for transparency but also computational efficiency, we can have a better understanding of the multi-modal information representation.

This work was supported by the DEVCOM Army Research Laboratory (ARL) under agreement W911NF2020157. Computation in the work was supported by the University of Southern California’s Center for Advanced Research Computing (carc.usc.edu).

References

Anderson

,

P.

,

He

,

X.

,

Buehler

,

C.

,

Teney

,

D.

,

Johnson

,

M.

,

Gould

,

S.

and

Zhang

,

L.

(

2018

), “

Bottom-up and top-down attention for image captioning and visual question answering

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

6077

-

6086

.

Google Scholar

Crossref

Carion

,

N.

,

Massa

,

F.

,

Synnaeve

,

G.

,

Usunier

,

N.

,

Kirillov

,

A.

and

Zagoruyko

,

S.

(

2020

), “

End-to-end object detection with transformers

”,

European Conference on Computer Vision

,

Springer

, pp.

213

-

229

.

Google Scholar

Crossref

Chen

,

T.

,

Kornblith

,

S.

,

Norouzi

,

M.

and

Hinton

,

G.

(

2020a

), “

A simple framework for contrastive learning of visual representations

”,

International Conference on Machine Learning

,

PMLR

, pp.

1597

-

1607

.

Google Scholar

Chen

,

H.

,

Ding

,

G.

,

Liu

,

X.

,

Lin

,

Z.

,

Liu

,

J.

and

Han

,

J.

(

2020b

), “

Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

12655

-

12663

.

Google Scholar

Crossref

Chen

,

Z.

,

Wu

,

J.

,

Wang

,

W.

,

Su

,

W.

,

Chen

,

G.

,

Xing

,

S.

,

Muyan

,

Z.

,

Zhang

,

Q.

,

Zhu

,

X.

,

Lu

,

L.

, et al. (

2023

), “

Internvl: scaling up vision foundation models and aligning for generic visual-linguistic tasks

”,

arXiv preprint

arXiv:2312.14238

.

Google Scholar

Cheng

,

M.

,

Sun

,

Y.

,

Wang

,

L.

,

Zhu

,

X.

,

Yao

,

K.

,

Chen

,

J.

,

Song

,

G.

,

Han

,

J.

,

Liu

,

J.

,

Ding

,

E.

, et al. (

2022

), “

Vista: vision and scene text aggregation for cross-modal retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

5184

-

5193

.

Google Scholar

Crossref

Deng

,

J.

,

Dong

,

W.

,

Socher

,

R.

,

Li

,

L.-J.

,

Li

,

K.

and

Fei-Fei

,

L.

(

2009

), “

ImageNet: a large-scale hierarchical image database

”,

CVPR09

.

Google Scholar

Devlin

,

J.

(

2018

), “

Bert: pre-training of deep bidirectional transformers for language understanding

”,

arXiv preprint

arXiv:1810.04805

.

Google Scholar

Diao

,

H.

,

Zhang

,

Y.

,

Liu

,

W.

,

Ruan

,

X.

and

Lu

,

H.

(

2023

), “

Plug-and-play regulators for image-text matching

”,

IEEE Transactions on Image Processing.

Google Scholar

Diao

,

H.

,

Zhang

,

Y.

,

Ma

,

L.

and

Lu

,

H.

(

2021

), “

Similarity reasoning and filtration for image-text matching

”,

Proceedings of the AAAI Conference on Artificial Intelligence

, Vol.

35

, No.

2

, pp.

1218

-

1226

.

Google Scholar

Crossref

Frome

,

A.

,

Corrado

,

G.S.

,

Shlens

,

J.

,

Bengio

,

S.

,

Dean

,

J.

,

Ranzato

,

M.

and

Mikolov

,

T.

(

2013

), “

Devise: a deep visual-semantic embedding model

”,

Advances in Neural Information Processing Systems

, p.

26

.

Google Scholar

Gao

,

T.

,

Yao

,

X.

and

Chen

,

D.

(

2021

), “

SIMCSE: simple contrastive learning of sentence embeddings

”,

arXiv preprint

arXiv:2104.08821

.

Google Scholar

Hadsell

,

R.

,

Chopra

,

S.

and

LeCun

,

Y.

(

2006

), “

Dimensionality reduction by learning an invariant mapping

”,

2006 IEEE Computer Society Conference On Computer Vision And Pattern Recognition (CVPR’06)

,

IEEE

, Vol.

2

, pp.

1735

-

1742

.

Google Scholar

Crossref

Ho

,

J.

,

Jain

,

A.

and

Abbeel

,

P.

(

2020

), “

Denoising diffusion probabilistic models

”,

Advances in neural information processing systems

, Vol.

33

, pp.

6840

-

6851

.

Google Scholar

Jawade

,

B.

,

Mohan

,

D.D.

,

Ali

,

N.M.

,

Setlur

,

S.

and

Govindaraju

,

V.

(

2023

), “

NAPReg: nouns as proxies regularization for semantically aware crossmodal embeddings

”,

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

, pp.

1135

-

1144

.

Google Scholar

Crossref

Ji

,

Z.

,

Chen

,

K.

and

Wang

,

H.

(

2021

), “

Step-wise hierarchical alignment network for image-text matching

”,

arXiv preprint

arXiv:2106.06509

.

Google Scholar

Karpathy

A.

and

Fei-Fei

,

L.

(

2015

), “

Deep visual-semantic alignments for generating image descriptions

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

3128

-

3137

.

Google Scholar

Crossref

Kim

,

W.

,

Son

,

B.

and

Kim

,

I.

(

2021

), “

Vilt: vision-and-language transformer without convolution or region supervision

”,

International Conference on Machine Learning

,

PMLR

, pp.

5583

-

5594

.

Google Scholar

Kuo

,

C.-C.J.

and

Madni

,

A.M.

(

2023

), “

Green learning: introduction, examples and outlook

”,

Journal of Visual Communication and Image Representation

, Vol.

90

, p.

103685

.

Google Scholar

Crossref

Kwon

,

G.

,

Cai

,

Z.

,

Ravichandran

,

A.

,

Bas

,

E.

,

Bhotika

,

R.

and

Soatto

,

S.

(

2022

), “

Masked vision and language modeling for multi-modal representation learning

”,

arXiv preprint

arXiv:2208.02131

.

Google Scholar

Lee

,

K.-H.

,

Chen

,

X.

,

Hua

,

G.

,

Hu

,

H.

and

He

,

X.

(

2018

), “

Stacked cross attention for image-text matching

”,

Proceedings of the European Conference on Computer Vision (ECCV)

, pp.

201

-

216

.

Google Scholar

Crossref

Li

,

K.

,

Zhang

,

Y.

,

Li

,

K.

,

Li

,

Y.

and

Fu

,

Y.

(

2019

), “

Visual semantic reasoning for image-text matching

”,

Proceedings of the IEEE/CVF International Conference on Computer Vision

, pp.

4654

-

4662

.

Google Scholar

Crossref

Lin

,

T.-Y.

,

Maire

,

M.

,

Belongie

,

S.

,

Hays

,

J.

,

Perona

,

P.

,

Ramanan

,

D.

,

Dollár

,

P.

and

Zitnick

,

C.L.

(

2014

), “

Microsoft coco: common objects in context

”,

Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

,

Springer

, pp.

740

-

755

.

Google Scholar

Crossref

Liu

,

C.

,

Mao

,

Z.

,

Zhang

,

T.

,

Xie

,

H.

,

Wang

,

B.

and

Zhang

,

Y.

(

2020

), “

Graph structured network for image-text matching

”,

Proceedings of the IEEE/CVF Conference On Computer Vision And Pattern Recognition

, pp.

10921

-

10930

.

Google Scholar

Crossref

Liu

,

H.

,

Li

,

C.

,

Wu

,

Q.

and

Lee

,

Y.J.

(

2024

), “

Visual instruction tuning

”,

Advances in Neural Information Processing Systems

, p.

36

.

Google Scholar

Liu

,

Y.

,

Ott

,

M.

,

Goyal

,

N.

,

Du

,

J.

,

Joshi

,

M.

,

Chen

,

D.

,

Levy

,

O.

,

Lewis

,

M.

,

Zettlemoyer

,

L.

and

Stoyanov

,

V.

(

2019

), “

Roberta: a robustly optimized Bert pretraining approach

”,

arXiv preprint

arXiv:1907.11692

.

Google Scholar

Nam

,

H.

,

Ha

,

J.-W.

and

Kim

,

J.

(

2017

), “

Dual attention networks for multimodal reasoning and matching

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

299

-

307

.

Google Scholar

Crossref

Oh Song

,

H.

,

Xiang

,

Y.

,

Jegelka

,

S.

and

Savarese

,

S.

(

2016

), “

Deep metric learning via lifted structured feature embedding

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

4004

-

4012

.

Google Scholar

Crossref

Pennington

,

J.

,

Socher

,

R.

and

Manning

,

C.D.

(

2014

), “

Glove: global vectors for word representation

”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.

1532

-

1543

.

Google Scholar

Crossref

Radford

,

A.

,

Kim

,

J.W.

,

Hallacy

,

C.

,

Ramesh

,

A.

,

Goh

,

G.

,

Agarwal

,

S.

,

Sastry

,

G.

,

Askell

,

A.

,

Mishkin

,

P.

,

Clark

,

J.

et al. (

2021

), “

Learning transferable visual models from natural language supervision

”,

International Conference on Machine Learning

,

PMLR

, pp.

8748

-

8763

.

Google Scholar

Robinson

,

J.

,

Chuang

,

C.-Y.

,

Sra

,

S.

and

Jegelka

,

S.

(

2020

), “

Contrastive learning with hard negative samples

”,

arXiv preprint

arXiv:2010.04592

.

Google Scholar

Schroff

,

F.

,

Kalenichenko

,

D.

and

Philbin

,

J.

(

2015

), “

Facenet: a unified embedding for face recognition and clustering

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

815

-

823

.

Google Scholar

Crossref

Singh

,

A.

,

Hu

,

R.

,

Goswami

,

V.

,

Couairon

,

G.

,

Galuba

,

W.

,

Rohrbach

,

M.

and

Kiela

,

D.

(

2022

), “

Flava: a foundational language and vision alignment model

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

15638

-

15650

.

Google Scholar

Crossref

Sohl-Dickstein

,

J.

,

Weiss

,

E.

,

Maheswaranathan

,

N.

and

Ganguli

,

S.

(

2015

), “

Deep unsupervised learning using nonequilibrium thermodynamics

”,

International Conference on Machine Learning

,

PMLR

, pp.

2256

-

2265

.

Google Scholar

Sohn

,

K.

(

2016

), “

Improved deep metric learning with multi-class n-pair loss objective

”,

Advances in Neural Information Processing Systems

, p.

29

.

Google Scholar

Vaswani

,

A.

,

Shazeer

,

N.

,

Parmar

,

N.

,

Uszkoreit

,

J.

,

Jones

,

L.

,

Gomez

,

Kaiser

,

A.N.

and

Polosukhin

,

I.

(

2017

), “

Attention is all you need

”,

Advances in Neural Information Processing Systems

, p.

30

.

Google Scholar

Wang

,

L.

,

Li

,

Y.

,

Huang

,

J.

and

Lazebnik

,

S.

(

2018

), “

Learning two-branch neural networks for image-text matching tasks

”,

IEEE Transactions on Pattern Analysis and Machine Intelligence

, Vol.

41

No.

2

, pp.

394

-

407

.

Google Scholar

Crossref

PubMed

Wang

,

P.

,

Wu

,

Q.

,

Cao

,

J.

,

Shen

,

C.

,

Gao

,

L.

and

Hengel

,

A.v.d.

(

2019

), “

Neighbourhood watch: referring expression comprehension via languageguided graph attention networks

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

1960

-

1968

.

Google Scholar

Crossref

Wang

,

W.

,

Bao

,

H.

,

Dong

,

L.

,

Bjorck

,

J.

,

Peng

,

Z.

,

Liu

,

Q.

,

Aggarwal

,

K.

,

Mohammed

,

O.K.

,

Singhal

,

S.

,

Som

,

S.

, et al., (

2022

), “

Image as a foreign language: Beit pretraining for all vision and vision-language tasks

”,

arXiv preprint

arXiv:2208.10442

.

Google Scholar

Wang

,

Y.

,

Zhang

,

T.

,

Zhang

,

X.

,

Cui

,

Z.

,

Huang

,

Y.

,

Shen

,

P.

,

Li

,

S.

and

Yang

,

J.

(

2021

), “

Wasserstein coupled graph learning for cross-modal retrieval

”,

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

,

IEEE

, pp.

1793

-

1802

.

Google Scholar

Crossref

Wei

,

C.

,

Pang

,

R.

and

Kuo

,

C.-C.J.

(

2024

), “

GWPT: a GreenWord-embeddingbased POS tagger

”,

arXiv preprint

arXiv:2401.07475

.

Google Scholar

Wei

,

X.

,

Zhang

,

T.

,

Li

,

Y.

,

Zhang

,

Y.

and

Wu

,

F.

(

2020

), “

Multi-modality cross attention network for image and sentence matching

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

10941

-

10950

.

Google Scholar

Crossref

Xuan

,

H.

,

Stylianou

,

A.

,

Liu

,

X.

and

Pless

,

R.

(

2020

), “

Hard negative examples are hard, but useful

”,

Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16

,

Springer

, pp.

126

-

142

.

Google Scholar

Crossref

Yang

,

Y.

,

Wang

,

W.

,

Fu

,

H.

,

Kuo

,

C.-C.J.

et al. (

2022

), “

On supervised feature selection from high dimensional feature spaces

”,

APSIPA Transactions on Signal and Information Processing

, Vol.

11

No.

1

.

Google Scholar

Young

,

P.

,

Lai

,

A.

,

Hodosh

,

M.

and

Hockenmaier

,

J.

(

2014

), “

From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions

”,

Transactions of the Association for Computational Linguistics

, Vol.

2

, pp.

67

-

78

.

Google Scholar

Crossref

Yu

,

J.

,

Wang

,

Z.

,

Vasudevan

,

V.

,

Yeung

,

L.

,

Seyedhosseini

,

M.

and

Wu

,

Y.

(

2022

), “

Coca: contrastive captioners are image-text foundation models

”,

arXiv preprint

arXiv:2205.01917

.

Google Scholar

Yu

,

L.

,

Shi

,

B.

,

Pasunuru

,

R.

,

Muller

,

B.

,

Golovneva

,

O.

,

Wang

,

T.

,

Babu

,

A.

,

Tang

,

B.

,

Karrer

,

B.

,

Sheynin

,

S.

et al. (

2023

), “

Scaling autoregressive multi-modal models: pretraining and instruction tuning

”,

arXiv preprint

arXiv:2309.02591

, Vol.

2

No.

3

.

Google Scholar

Zellers

,

R.

,

Bisk

,

Y.

,

Farhadi

,

A.

and

Choi

,

Y.

(

2019

), “

From recognition to cognition: visual commonsense reasoning

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

6720

-

6731

.

Google Scholar

Crossref

Zeng

,

Y.

,

Zhang

,

X.

,

Li

,

H.

,

Wang

,

J.

,

Zhang

,

J.

and

Zhou

,

W.

(

2023

), “

X 2-vlm: all-in-one pre-trained model for vision-language tasks

”,

IEEE Transactions on Pattern Analysis and Machine Intelligence.

Google Scholar

Zhang

,

Q.

,

Lei

,

Z.

,

Zhang

,

Z.

and

Li

,

S.Z.

(

2020

), “

Context-aware attention network for image-text retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

3536

-

3545

.

Google Scholar

Crossref

Zheng

,

Z.

,

Zheng

,

L.

,

Garrett

,

M.

,

Yang

,

Y.

,

Xu

,

M.

and

Shen

,

Y.-D.

(

2020

), “

Dualpath convolutional image-text embeddings with instance loss

”,

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

, Vol.

16

No.

2

, pp.

1

-

23

.

Google Scholar

Crossref

Zhu

,

Y.

,

Kiros

,

R.

,

Zemel

,

R.

,

Salakhutdinov

,

R.

,

Urtasun

,

R.

,

Torralba

,

A.

and

Fidler

,

S.

(

2015

), “

Aligning books and movies: towards story-like visual explanations by watching movies and reading books

”,

Proceedings of the IEEE International Conference on Computer Vision

, pp.

19

-

27

.

Google Scholar

Crossref

2025

Tsung-Shan Yang, Yun-Cheng Wang, Chengwei Wei, Suya You and C.-C. Jay Kuo

Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licence

Image-text retrieval via green explainable multi-modal alignment (GEMMA)

1. Introduction

2. Related work

2.1 Cross-modal retrieval

2.2 Visual-language model

2.3 Green learning

3. Proposed GEMMA method

3.1 Alignment

3.2 Sub-domain clustering

3.2.1 Concept aggregation.

3.2.2 Representation aggregation.

3.3 Feature selection

3.4 Mathematical expression

4. Experiments

4.1 Dataset

4.2 Hyperparameter settings

4.3 Retrieval

4.4 Generalizability

4.5 Ablation study on different stages

4.6 From detection to alignment

5. Conclusion and future work

References

Email Alerts

Cited By

Image-text retrieval via green explainable multi-modal alignment (GEMMA)

1. Introduction

2. Related work

2.1 Cross-modal retrieval

2.2 Visual-language model

2.3 Green learning

3. Proposed GEMMA method

3.1 Alignment

3.2 Sub-domain clustering

3.2.1 Concept aggregation.

3.2.2 Representation aggregation.

3.3 Feature selection

3.4 Mathematical expression

4. Experiments

4.1 Dataset

4.2 Hyperparameter settings

4.3 Retrieval

4.4 Generalizability

4.5 Ablation study on different stages

4.6 From detection to alignment

5. Conclusion and future work

References

Email Alerts

Suggested Reading

Recommended for you

Cited By

Sharing Unavailable