Image-text retrieval is a fundamental task in image understanding. The algorithm fetches the most relevant counterpart in the other modality by giving the image or text. Large visual-language models are trained by paired image and text data to extract the joint representations. However, they are computationally expensive and not explainable regarding how the data from different modalities are aligned. To this end, we propose an efficient and stage-wise alignment for image and text representations, called the Green Explainable Multi-Modal Alignment (GEMMA). GEMMA is computationally efficient by reducing trainable parameters to 3% compared to fine-tuning all image and text encoders. The intermediate clustering results demonstrate the explainability of the alignment mechanism in our model. Experiments show that GEMMA outperforms state-of-the-art retrieval models in text-to-image and image-to-text retrieval tasks on the Flick30k and MS-COCO datasets. GEMMA can also be generalized to unseen image-text pairs from pre-trained visual and text encoders separately.
1. Introduction
Image-text retrieval links textual and visual information and is a foundational image understanding application in computer vision. The goal of the task is to link textual descriptions and pixels in image arrays that represent similar concepts or semantics. The image-text retrieval task aims to find the most relevant information from the candidate sets in the counterpart modality. That is, when an image is given, the model needs to extract related captions by ranking them with higher scores and vice versa. Figure 1 shows an example of an image and its paired textual descriptions.
Image-text retrieval can provide the information for visual-textual applications, including visual question answering (Nam et al., 2017), image captioning (Anderson et al., 2018), visual grounding (Wang et al., 2019), and visual common sense reasoning (Zellers et al., 2019). With the thriving development of deep learning and computational resources, neural networks dominate the current research trend. Jointly trained neural network-based image and text encoders transform the input text and image into vectors in a common latent space. The two encoders are trained under metric learning schemes, which compare the cosine similarity between the paired and unpaired image and text samples. For example, an intuitive solution to representing the image and text in a joint latent space is optimizing two encoder models by minimizing contrastive loss (Chen et al., 2020a, 2020b). The loss function can gather the paired information but repel the unpaired data in the latent space.
Although end-to-end solutions perform astonishingly, explainability is crucial for image-understanding applications. In the multi-modal application scenario, humans expect a complete reasoning procedure instead of a magic answer from the model. However, neural networks obscure the reasoning process within the joint latent space through complex floating-point operations, e.g., calculating cosine similarities between vectors. The nonlinearities in the model make the whole inference process a black box. To this end, we propose a multi-stage methodology, dividing the retrieval process into three stages:
Global alignment;
Image cluster alignment; and
Text cluster alignment.
Each alignment stage consists of three modules:
alignment;
subdomain clustering; and
subdomain feature selection.
More fine-grained information is revealed in the module’s feature selection process.
The availability of paired image and text data is another challenge when training multi-modal models. Most datasets contain only high-quality data in a single modality. For example, ImageNet (Deng et al., 2009) and MS-COCO (Lin et al., 2014) contain diverse images but lack sentence-level textual descriptions associated with the images. In contrast, in textual datasets, the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) contain well-structured paragraphs, yet without corresponding images. Collecting paired images and captions is expensive and labor-intensive. Due to the subjectiveness of caption labels, it is impractical to assume consistent captions for one image. However, the quality of collected pairs in both domains significantly impacts the performance of the jointly trained multi-modal encoders. Aiming to relieve the data scarcity, we adopt the pre-trained encoders in the image and text domains instead of jointly training text and image encoders from scratch. Then, we proposed a green learning alignment process to deal with the lack of paired information.
We propose a new Green Explainable Multi-Modal Alignment (GEMMA) scheme to deal with paired data scarcity and explainability. The method utilizes the frozen image and text encoder models and aligns the representations using the proposed alignment process. Our contributions are summarized as follows:
We reduce the number of parameters to around 3% compared to fine-tuning the whole encoders. Instead of fine-tuning the pre-trained encoders, we propose an alignment scheme from two pretrained encoders, making the pipeline computationally efficient.
In order to achieve pipeline transparency, we narrow the set of candidates in a stage-wise manner. The modular design divides the entire dataset into subsets. We can statistically understand the retrieval process and the crucial tokens by the feature selection modules in the sub-domain clustering.
We provide bidirectional retrieval in the proposed pipeline. The alignment modules consist of linear projections without incorporating any nonlinearity. Thus, the alignment process can be easily reversed from one to another.
We conduct extensive experiments on two public multi-modal datasets. The results demonstrate that our method can significantly improve the performance in text-to-image retrieval.
2. Related work
The existing methods can be classified into 1) cross-modal retrieval and 2) visual-language models (VLMs). Cross-modal models consist of a convolutional neural network (CNN) to extract features from images and a recurrent neural network (RNN) to process text data. The joint representations of the convolution and recurrent backbones are optimized by metric learning. On the other hand, VLMs employ Large Language Models (LLMs) that work in tandem with the Visual Transformer models (ViTs) for optimal performance. The VLM optimization can be performed by contrastive learning, masking filling, and generative matching.
2.1 Cross-modal retrieval
The cross-modal retrieval algorithms consist of representation matching and feature extraction. Metric learning schemes measure the similarity between the samples and predict the matching scores. Hadsell et al. (2006) propose the idea of contrastive learning. The loss formulation aims to reduce the distance in the latent space for similar samples and to increase the distance for different samples. Triplet loss (Schroff et al., 2015), lifted structure loss (Oh Song et al., 2016), and N-Pair loss (Sohn, 2016) construct the joint latent space by sampling training data. The losses gather the positive and repel the negative sampling schemes from positive and negative pairs, forming the positive and negative pairs with the sampling schemes. Thus, optimization can be improved by the hard sampling process (Robinson et al., 2020; Xuan et al., 2020). With the thriving development of self-supervised applications, SimSCE (Gao et al., 2021) and SimCLR (Chen et al., 2020a, 2020b) provide metrics to reinforce the representations. The losses map the origin and representations from the augmented images (crop, rotate, color distort, etc.) onto the same latent space.
Frome et al. (2013) first proposed the concept of joint image and text embedding in the ImageNet (Deng et al., 2009) classification. The pipeline utilizes the textual information from the label to construct a lookup table from the nearby concepts as the target embedding, leading to a hierarchical classification. Zheng et al. (2020) adopts deep CNN as the basis for extracting the image and text features. The instance loss optimizes the two feature extractors, which can project the representations from different modalities onto the joint latent space. Lee et al. (2018) utilize bottom-up attention object detector (Anderson et al., 2018) to obtain semantic representations of images and to perform word-level matching in the captions. The bottom-up detector can provide the modifiers with the noun, matching the corresponding sentence with the details.
Liu et al. (2020) formulate the information as a graph and adopt the structural matching to retrieve the closest subgraph. The object detector obtains the visual graph. The node features are the region of interest (ROI) feature of the model, and the vertices are constructed by the Multi-Layer Perceptron (MLP). The textual graph is the Part-Of-Speech (POS) prediction from the Gated Recurrent Unit (GRU) Networks. Wang et al. (2018) adopts the instance-wise matching for the subgraphs. The overall matching score aggregates the partial graph similarities in a bottom-up manner.
To further exploit the information in the query image, Cheng et al. (2022) adopts the optical character recognition (OCR) module to extract semantic information such as text embeddings of the scene. The model fuses the image token and the scene text for the joint representation. Diao et al. (2023) build the image tokens from ROI by the object detector and bidirectional GRU textual tokens. The cross-modal attention module is used for the token-wise matching process. Jawade et al. (2023) constructs the visual and textual tokens from the pre-trained model. However, the research merges the cross-modal information by cross-attention (Vaswani et al., 2017) modules and manages the retrieval task with the transformer structures.
2.2 Visual-language model
Transformers (Vaswani et al., 2017) have achieved significant results in natural language processing and computer vision tasks. The image-text encoders can share similar architectures. Wang et al. (2022) crop the input images into patches and use the patches as visual tokens to formulate the images as a novel language. The jointly trained visual and text encoders (Chen et al., 2023; Zeng et al., 2023) are optimized end-to-end. Visual language models (VLMs) can be categorized into three families (Pennington et al., 2014) by the optimization process:
contrast-based VLMs;
VLMs with masking objects; and
generative-based VLMs.
Constructive VLMs (Radford et al., 2021) are trained by the paired multi-modal data, and the objective loss is the contrastive loss. The self-supervised learning scheme obtains VLMs with masking objects (Kim et al., 2021; Kwon et al., 2022; Singh et al., 2022); the model needs to predict the masked visual and textual tokens. Generative-based VLMs (Liu et al., 2024; Yu et al., 2022, 2023) take advantage of the great success of AI chatbots, which are trained in visual question answering, image captioning, and other downstream tasks.
CLIP (Radford et al., 2021) demonstrates impressive visual representations trained together with paired text descriptions. The transformer encoder takes the nonoverlapping patches and the words as input and utilizes the pooled encoded tokens to represent the images and sentences. The model uses a contrastive learning scheme to project image and text representations onto a shared latent space. This shared space allows for a better understanding of the relationship between the two modalities. The dual (image-text) encoder architecture is prevalent in multi-modal applications.
Kim et al. (2021) utilizes the masked tokens in self-supervised learning in transformers (Devlin, 2018) for natural language processing. The model takes tokenized sentences and image patches as input. Training tasks include paired classification and masked token filling. Kwon et al. (2022) proposes the uniform transformer with two pre-training objectives, including masked vision and language modeling, and multi-modal alignment. Singh et al. (2022) proposes the multi-modal encoder with visual and text encoders. The multi-modal encoder aligns the features from the two encoders with global contrastive learning and masked multi-modal modeling.
In addition to representation learning, the large language model provides incredible performance on text generation tasks. Yu et al. (2022) optimize the visual encoder with image captioning as a downstream task. With a jointly trained visual encoder and language decoder, the model provides unified text and visual representations for the transformer. Yu et al. (2023) employ the diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015) for image generation and reinforce cross-modal representations. Liu et al. (2024) combine the visual encoder with the LLM. The given image tokens are used as instructions for the detailed LLM responses. However, the training process requires large-scale paired images and texts, which is computationally expensive.
Despite achieving state-of-the-art performance, large visual-language pre-trained models still have shortcomings in inference. The matching process is not transparent, and humans cannot understand the decision-making within fully connected layers because they lack semantic meanings. In addition to the lack of explainability, the fine-tuning process is computationally expensive. These models have billions of trainable parameters, and high-quality image-text pairs are required for tuning.
2.3 Green learning
To handle the computationally intensive fine-tuning process and expand the image-text encoder using unpaired data, we introduce the Green Learning Alignment algorithm, which uses separately pre-trained image-text encoders. The idea of Green Learning was proposed by Kuo and Madni (Kuo and Madni, 2023) and aims to reduce the computational cost of backpropagation while providing a theoretically explainable learning process for various applications. The modular designs can divide the problem into subproblems, which can be solved using transparent algorithms.
3. Proposed GEMMA method
The GEMMA algorithm can be divided into three stages:
Global alignment;
Image cluster alignment; and
Text cluster alignment.
We adopt the multi-stage approach to approximate the complicated decision-making process rather than building a single large visual-language foundation model from scratch to ensure model efficiency. Starting from the pre-trained image and text feature extractors, we keep the pre-trained model frozen to maintain its ability to generalize with unpaired data in the matching process. We align the representations by training additional single-layer adapter matrices to project the representations onto the joint latent space. Specifically, the alignment process consists of three modules:
alignment;
clustering in subdomains; and
selection of subdomain features, where clustering and feature selection are performed in both the image and text domains, as shown in Figure 2.
3.1 Alignment
In the alignment process, we do not fine-tune the pre-trained encoders. We train a lightweight linear transformation in the visual and textual domains to align the two representation spaces. The alignment module is illustrated in Figure 3. The visual and text embeddings can be formulated as:
where are the image and text embeddings, are the frozen image and text encoder models, and are the dimensions of the image and text representations. With the deterministic representations, the matching process can be denoted as:
where and represent the trainable image-text alignment matrices, represents the vector in the joint space, and represents the similarity metric. We adopt cosine similarity as the similarity metric, namely . We can further optimize the trainable parameters with the contrastive learning loss function (Chen et al., 2020a, 2020b):
Here, denotes the paired image and sentence in the sampled batch, N denotes the batch size, and denotes the temperature hyperparameter. is an indicator function and the value is one, while . The objective function maximizes the similarity of relevant image-text pairs while avoiding negative image-text pairs from being embedded closely in the latent space.
Hence, the problem can be formulated as an optimization problem, and all transformations are linear. We can define the inverse projection in the joint latent space without nonlinearities:
where and represent the inverse transformation from the joint space to the original image-text representations. We define the reconstruction loss for both the image and text modality:
Furthermore, we use the auxiliary matrices to constrain the joint representations and define the loss of cross-modality reconstruction as:
where and are the auxiliary transformation matrices from the joint space onto the image and text modality, respectively. In addition, are obtained from the corresponding paired caption or image data, and . However, the C and D matrices will not be used during inference. The alignment process is a linear transformation carried out by matrices A and B. The objective function can be written as:
where , , and represent hyperparameters in training. Linear alignment provides an invertible transformation from the image-text modality to the joint latent space and vice versa. However, the single-layered alignment is too simple to match all the samples. Thus, we cluster the data to form sub-datasets and utilize the stage-wise alignments for the detailed decision.
3.2 Sub-domain clustering
With the alignment process, we can find similar representations by linear transformations. However, the transformation can only take global representations, which means that images or captions are represented as - or -dimensional vectors. The image and sentence representations are the pooled output of the tokens in the prevailing transformer models. It can be inferred from previous research that fine-grained information is also crucial in information-matching tasks.
Due to the complexity of the fine-grained token representations, it is challenging to train the token-wise alignment in a brute-force manner. Thus, we adopt the clustering algorithms and use the clustering results to obtain crucial tokens. The crucial token selection will be introduced in Section 3.3. We can reduce the feature dimension from the number of tokens and perform a second-stage alignment.
We adopt frequency analysis and statistical approaches to construct a transparent and human-sensible intermediate structure. The clustering is conducted through (1) concept aggregation and (2) representation aggregation.
3.2.1 Concept aggregation.
We extract the concrete concepts for the candidate sentences by the Part-of-speech (POS) tagger (Wei et al., 2024). We collect the nouns as anchors and calculate the Term Frequency-Inverse Document Frequency (TF-IDF) to select the representative terms. As shown in Figure 4, the concepts lie in a long-tailed distribution, leading to a biased probability estimation. Hence, we aggregate the high-frequency terms based on the detector results and divide the candidate set into subsets for better-detailed alignments.
We construct the co-occurrence matrix of the POS tagging and object detection results in the training set. As shown in Figure 5, concepts have a significant relationship with detection results. Hence, we can group the concepts tagged with POS with the probability conditional on the detection results. To visualize the physical meaning of the clusters, we can use the word clouds to show the high-frequency concepts in each cluster, shown in Figure 6.
3.2.2 Representation aggregation.
The clustering is based on the K-means algorithm. To ensure consistency of alignment and clustering, we use the -norm of normalized representations as a distance metric:
where and are the normalized representations, namely . The clustering probability can be denoted as:
where and represent the i-th cluster and i-th centroid vector, respectively. K represents the number of clusters and is a hyperparameter. If increases, the probability distribution will concentrate on a certain class. If decreases, the probability distribution will become uniform.
We can group images and texts based on their probabilities and then align them using contrastive learning within these groups. We can improve the contrastive learning process by using negative samples similar to positive ones. We use hard-sample mining to ensure sample diversity within each group. The global alignment process helps identify the most challenging cases. We can then enlarge the groups by selecting the K-top candidates from the previous alignments as negative samples.
To clarify the roles of K-means clustering and the choice of hyperparameters, we conducted experiments comparing K-means and Agglomerative Clustering and varying the number of clusters. As shown in Table 1, increasing the number of clusters improves the retrieval in certain settings. However, this also requires training additional alignment matrices for the clusters. Therefore, we set the number of clusters to eight to strike a balance between the number of trainable parameters and the performance. K-means clustering is selected in GEMMA due to its slight empirical advantage over agglomerative clustering.
3.3 Feature selection
Clustering results provide pseudo-labels for further feature selection. The label can be denoted as:
Here, represents the label of the data point u whether it belongs to the group i, and is the self-definition threshold. With pseudo-labels, we can further adopt Discriminant Feature Selection (Yang et al., 2022) (DFT) to select informative features and reduce feature dimensions. DFT is a supervised feature selection process that measures dimension-wise importance. For a given 1D input feature, we can order the samples by the feature values and bind the feature dimension to the sample maximum and sample minimum. Then, we can partition the samples along the given dimension and calculate the partition purity by weighted cross-entropy with pseudo-labels obtained from Section 3.2. A feature is more discriminant if it has a lower loss value. Then, we can plot the loss value curve from the lowest to the highest and use the elbow point to select discriminant features from the whole feature set.
Separating the whole dataset into subsets allows us to conduct the discriminant feature test among the tokens with the pseudo-labels from the clustering results. Thus, token-level alignments can be performed using the same procedure as global-level alignment.
3.4 Mathematical expression
The overall alignment process can be divided into three modules:
global matching;
subdomain clustering; and
subdomain matching.
The subdomain clustering and alignment will be conducted within the image and the text domain. We can aggregate the alignments in the subproblem to approximate the overall alignment:
where denotes the probability distribution of the images with a given query text. and denote the probability distribution of image and text, and is the result of the clustering of our clustering modules. We further assume that the probability distribution within the cluster can be approximated as uniform. Conditional probability can reflect the stage-wise design in the proposed pipeline.
Furthermore, we use the similarity measurement to simplify the probability estimator, which means that we use to represent . In the work, we adopt the cosine similarity as:
where and denote the pooled outputs from the feature extractors (global features), and denote the token, i.e. fine-grained, features, W: denote the alignment matrices corresponding to different subsets from the clustering results. The [G; T] denotes the concatenated features of global and tokens. Due to computational cost, we cannot directly collect all token features. Therefore, we conduct the feature selection process based on the clustering results.
The feature selection process is an approximation based on the clustering results. The process is expressed as a combination of the conditional probabilities. For simplicity, we ignore the alignment matrix in the following representations:
where represents the feature selection and dimension reduction process in Section 3.3 and and represent the cluster sets of K-means. Instead of training a complicated alignment process from the token-level output of the feature extractor, we propose a stage-wise decomposition on the dataset and train simpler structures for the subsets. Meanwhile, the alignments in the stages are linear, which provides the inversion operation and preserves the dual accessibility in image and text domains.
4. Experiments
4.1 Dataset
We perform the image-to-text and text-to-image retrieval on the image-text benchmark: Flickr30k and MS-COCO. The Flickr30k dataset (Young et al., 2014) contains 31,000 images, and every image has five paired captions. The training set contains 29,000 images; the validation and testing sets contain 1,000. The MS-COCO (Lin et al., 2014) is a larger-scale dataset with 123,287 images, each containing at least five captions. We follow the ‘Karpathy’ splitting for the experiments (Karpathy and Fei-Fei, 2015): 113,287 images for training, 5,000 for validation, and 5,000 for testing. We use the two benchmarks with different sizes to demonstrate the scalability and generalizability of our approaches. The performance is evaluated using the Recall@K metric where . The notation K refers to the top-K matches of the retrieval results. A retrieval is considered a true positive if the predicted matches include at least one of the paired ground-truth captions. Specifically, if the top K matches contain one of the five corresponding captions for a given image, it is counted as a positive in the recall metrics.
4.2 Hyperparameter settings
The overall algorithm is trained stage by stage. We adopt K-means as the clustering algorithm. The number of clusters is 8, and is 50 for pseudo-labeling. For Flickr30K, we set the temperature parameter at 0.02, and the ratio between losses is set to in the global alignment. In the alignment of the image subdomain, the temperature parameter is set to 0.015, and the ratio between the losses is set to . In the text subdomain alignment, the temperature parameter is set to 0.01, and the ratio between the losses is set to .
For the MS-COCO dataset, the temperature parameter is set to 0.05, and the ratio between the losses is set to in the global alignment. In the alignment of the image subdomain, the temperature parameter is set to 0.03, and the ratio between the losses is set to . In the text subdomain alignment, the temperature parameter is set to 0.02, and the ratio between the losses is set to .
The dimension of the joint space is set to 768, which follows the token dimension of the transformer encoders. All optimization is performed using AdamW with the learning rate = 0.001.
4.3 Retrieval
We conducted the experiments and compared our alignment approach to the SOTA retrieval models. The results are shown in Table 2. We extract information from the frozen CLIP image and text encoder in the experiments. The CLIP encoders remain frozen during further alignments and serve as the baseline for our alignment process. The CLIP encoder contains more than 428M parameters. However, we do not fine-tune the overall encoder in our alignment process; instead, we train additional alignment matrices. The trainable parameters can be reduced from 428M to 9.43M (). The encoders remain untrainable during the training of alignment matrices. Therefore, GPU memory consumption is proportional to the trainable parameters, which can be reduced to less than 10 percent of the fully fine-tuned approach.
In Flickr30k (1k testing set), our approach outperforms other image-to-text and text-to-image retrieval methods. Alignment can improve recall @ 1 by 0. 6% in image-to-text retrieval. Meanwhile, our approach provides a 6% boost in text-to-image retrieval. RCAR (Diao et al., 2023) needs dual-way optimized models, namely image-to-text and text-to-image. Our method is optimized in a feed-forward manner, and it ensembles the substructures directly.
In MS-COCO (5k testing set), our method provides competitive performance in image-to-text retrieval and outperforms the others in text-to-image retrieval by a boost of 2.1% in Recall@1. We achieve the best text-to-image retrieval performance among the two datasets, showcasing our approach’s scalability.
4.4 Generalizability
This section demonstrates the alignment between the visual/text encoders, which are trained separately. The encoders remain frozen in the alignment process. All alignments are based on the grouping and linear projection proposed in our pipeline. The performance of CLIP visual and text encoders without GEMMA alignment is taken from the original CLIP paper (Radford et al., 2021). Starting from the jointly trained CLIP structure, we change the text encoders into the RoBERTa (Liu et al., 2019) and the visual encoder into a CNN-based object detector (Anderson et al., 2018). All experiments are carried out on the Flickr30k dataset and follow the parameter settings in Section 4.2.
The results are shown in Table 3. The best performance comes from the jointly trained models, whose representations are preliminarily aligned in the pre-training process. Compared to the CLIP visual encoder, the features of the object detector are weaker in the alignment process. However, the separately trained text encoder, RoBERTa (Liu et al., 2019), does not suffer from the unpaired training dataset. The representations from the CLIP visual encoder and the RoBERTa text encoder can provide competitive performance in image-to-text retrieval and better performance in text-to-image retrieval than the original CLIP. The encoder can be adapted to the retrieval application without fine-tuning with the paired image and text data.
In contrast, the Convolution Neural Network (CNN)-based object detector representation cannot be applied directly to rthe image-text retrieval task. The decrease in performance results from global understanding. The object detector efeatures are obtained from part of the image, and the representations lack a global understanding of the image. As dto CLIP visual encoders, the visual tokens’ pooled output contains the input images’ global information and has detailed token features for us to process further stage alignments. The visual example can be found in Section 4.6. If the alignment process misses the global information in the very beginning, then the alignment process on detailed information may lead to a misfocused result.
4.5 Ablation study on different stages
Due to the modularized design, we can compare the design from global alignment to subgroup alignment in the visual and textual domains. We choose encoders trained in different modalities to perform the alignment process. We use the CLIP visual encoder (Radford et al., 2021) and the RoBERTa (Liu et al., 2019) text encoder for the ablation study of stage-wise alignment on Flickr30k dataset (Young et al., 2014). The two encoders remain frozen in the experiments. The ‘without alignment’ setting means the direct dot product between the encoded features from two models. The two embeddings are located in different semantic latent spaces. Hence, the performance is the lowest compared to the other alignment processes.
With global alignment, the features can provide basic performance in retrieval tasks. However, a naive linear projection can not handle complex interactions between detailed information in the candidate set. Thus, recall rates increase as we add more stages in grouping, feature selection, and alignment. Feature selection provides statistical criteria for dimension reduction, preventing the latent dimension from increasing with additional tokens. We can take the essential features into the next stage and reduce computational cost simultaneously. Hence, the three-stage alignment can achieve the best performance with comparable efficiency.
4.6 From detection to alignment
To better understand the difference between the visual features of transformers and object detectors, we demonstrate the retrieval processing step by step.
The object detector can detect humans and vehicles, but the features lack a sensible relationship with each other. In the clustering stage, the clusters will focus on the specific object in the figure, that is, the bus in Figure 8. In global alignment, the paired sentence is fifth. However, the correct captions fall to the seventh when we perform the finer alignment, which clusters on cars and buses. Although object detectors can provide information fragments, the grouping process cannot link features. The detector features cannot find the central concept in the picture, but can be distracted by the surrounding objects.
The object detector can provide the features with the local information, yet the patched information is not represented in a structured manner. That is, we can only obtain the partial contents in the image and lose the global semantic representation in the clustering process. We rely on the global and local information relationship to retrieve suitable captions in the proposed coarse-to-fine clustering process.
On the other hand, the visual transformer can provide more information about the tokens and integrate the rrepresentations through a global pooling process. Hence, the token information can be selected in our feature selection module (Section 3.3) and clustered according to the global features. The overall architecture can sort the rich representation in a coarse-to-fine manner and provide a better multi-modal alignment performance.
When comparing the alignment process across different features, it becomes evident that performance is influenced by the types of features used. However, the alignment process cannot transform weak visual features into strong ones. Instead, it aims to bridge the gap caused by differences in modality. Consequently, performance improves when encoded features have larger receptive fields. The proposed alignment does not require jointly fine-tuning the encoders in the limited paired multi-modal data and generalizes the single-modal encoder with additional alignment matrices.
5. Conclusion and future work
Our approach can achieve outstanding performance in both image-to-text and text-to-image retrieval tasks. Furthermore, our method involves a step-by-step alignment process that maintains compatibility in the decision-making procedure. We divide the alignment into global and subdomain matching and apply a feature selection method to decrease the input feature dimensions. All subprocesses can be expressed mathematically and analyzed statistically, providing transparency compared to black-box output. To ensure computational efficiency, we froze the visual and text encoders and only trained the alignment matrices, which represent only about 3% of the parameters compared to the original model.
In addition, we conducted experiments on applying our alignment mechanism to individually trained text and image encoders. In the testing dataset, we found that the pre-trained text encoder can improve the performance of text-to-image retrieval. Replacement of the text encoder can also lead to similar performance in image-to-text retrieval.
We are working on developing a purely green learning solution for image understanding in the foreseeable future. By aiming not only for transparency but also computational efficiency, we can have a better understanding of the multi-modal information representation.
This work was supported by the DEVCOM Army Research Laboratory (ARL) under agreement W911NF2020157. Computation in the work was supported by the University of Southern California’s Center for Advanced Research Computing (carc.usc.edu).









