Skip to Main Content

Image-text retrieval is a fundamental task in image understanding. The algorithm fetches the most relevant counterpart in the other modality by giving the image or text. Large visual-language models are trained by paired image and text data to extract the joint representations. However, they are computationally expensive and not explainable regarding how the data from different modalities are aligned. To this end, we propose an efficient and stage-wise alignment for image and text representations, called the Green Explainable Multi-Modal Alignment (GEMMA). GEMMA is computationally efficient by reducing trainable parameters to 3% compared to fine-tuning all image and text encoders. The intermediate clustering results demonstrate the explainability of the alignment mechanism in our model. Experiments show that GEMMA outperforms state-of-the-art retrieval models in text-to-image and image-to-text retrieval tasks on the Flick30k and MS-COCO datasets. GEMMA can also be generalized to unseen image-text pairs from pre-trained visual and text encoders separately.

Image-text retrieval links textual and visual information and is a foundational image understanding application in computer vision. The goal of the task is to link textual descriptions and pixels in image arrays that represent similar concepts or semantics. The image-text retrieval task aims to find the most relevant information from the candidate sets in the counterpart modality. That is, when an image is given, the model needs to extract related captions by ranking them with higher scores and vice versa. Figure 1 shows an example of an image and its paired textual descriptions.

Figure 1.
A construction scene shows a worker raised on a blue lifting platform beside a tall, worn building, with demolition or repair activity underway.The image shows the side of a tall, damaged building with peeling exterior surfaces and exposed concrete. In the foreground, a blue industrial lifting platform extends upward from the ground to the building wall. A single worker stands on the raised platform close to the structure, appearing to carry out construction or maintenance work. The platform arm reaches diagonally upward, positioning the worker several storeys above ground level. Adjacent buildings with windows are visible in the background, indicating an urban construction or renovation site. The scene focuses on elevated work using heavy machinery alongside an ageing building façade.

The example of image-to-text retrieval. By giving an image, we need to retrieve the paired captions from the candidate set

Figure 1.
A construction scene shows a worker raised on a blue lifting platform beside a tall, worn building, with demolition or repair activity underway.The image shows the side of a tall, damaged building with peeling exterior surfaces and exposed concrete. In the foreground, a blue industrial lifting platform extends upward from the ground to the building wall. A single worker stands on the raised platform close to the structure, appearing to carry out construction or maintenance work. The platform arm reaches diagonally upward, positioning the worker several storeys above ground level. Adjacent buildings with windows are visible in the background, indicating an urban construction or renovation site. The scene focuses on elevated work using heavy machinery alongside an ageing building façade.

The example of image-to-text retrieval. By giving an image, we need to retrieve the paired captions from the candidate set

Close modal

Image-text retrieval can provide the information for visual-textual applications, including visual question answering (Nam et al., 2017), image captioning (Anderson et al., 2018), visual grounding (Wang et al., 2019), and visual common sense reasoning (Zellers et al., 2019). With the thriving development of deep learning and computational resources, neural networks dominate the current research trend. Jointly trained neural network-based image and text encoders transform the input text and image into vectors in a common latent space. The two encoders are trained under metric learning schemes, which compare the cosine similarity between the paired and unpaired image and text samples. For example, an intuitive solution to representing the image and text in a joint latent space is optimizing two encoder models by minimizing contrastive loss (Chen et al., 2020a, 2020b). The loss function can gather the paired information but repel the unpaired data in the latent space.

Although end-to-end solutions perform astonishingly, explainability is crucial for image-understanding applications. In the multi-modal application scenario, humans expect a complete reasoning procedure instead of a magic answer from the model. However, neural networks obscure the reasoning process within the joint latent space through complex floating-point operations, e.g., calculating cosine similarities between vectors. The nonlinearities in the model make the whole inference process a black box. To this end, we propose a multi-stage methodology, dividing the retrieval process into three stages:

  1. Global alignment;

  2. Image cluster alignment; and

  3. Text cluster alignment.

Each alignment stage consists of three modules:

  1. alignment;

  2. subdomain clustering; and

  3. subdomain feature selection.

More fine-grained information is revealed in the module’s feature selection process.

The availability of paired image and text data is another challenge when training multi-modal models. Most datasets contain only high-quality data in a single modality. For example, ImageNet (Deng et al., 2009) and MS-COCO (Lin et al., 2014) contain diverse images but lack sentence-level textual descriptions associated with the images. In contrast, in textual datasets, the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words) contain well-structured paragraphs, yet without corresponding images. Collecting paired images and captions is expensive and labor-intensive. Due to the subjectiveness of caption labels, it is impractical to assume consistent captions for one image. However, the quality of collected pairs in both domains significantly impacts the performance of the jointly trained multi-modal encoders. Aiming to relieve the data scarcity, we adopt the pre-trained encoders in the image and text domains instead of jointly training text and image encoders from scratch. Then, we proposed a green learning alignment process to deal with the lack of paired information.

We propose a new Green Explainable Multi-Modal Alignment (GEMMA) scheme to deal with paired data scarcity and explainability. The method utilizes the frozen image and text encoder models and aligns the representations using the proposed alignment process. Our contributions are summarized as follows:

  • We reduce the number of parameters to around 3% compared to fine-tuning the whole encoders. Instead of fine-tuning the pre-trained encoders, we propose an alignment scheme from two pretrained encoders, making the pipeline computationally efficient.

  • In order to achieve pipeline transparency, we narrow the set of candidates in a stage-wise manner. The modular design divides the entire dataset into subsets. We can statistically understand the retrieval process and the crucial tokens by the feature selection modules in the sub-domain clustering.

  • We provide bidirectional retrieval in the proposed pipeline. The alignment modules consist of linear projections without incorporating any nonlinearity. Thus, the alignment process can be easily reversed from one to another.

  • We conduct extensive experiments on two public multi-modal datasets. The results demonstrate that our method can significantly improve the performance in text-to-image retrieval.

The existing methods can be classified into 1) cross-modal retrieval and 2) visual-language models (VLMs). Cross-modal models consist of a convolutional neural network (CNN) to extract features from images and a recurrent neural network (RNN) to process text data. The joint representations of the convolution and recurrent backbones are optimized by metric learning. On the other hand, VLMs employ Large Language Models (LLMs) that work in tandem with the Visual Transformer models (ViTs) for optimal performance. The VLM optimization can be performed by contrastive learning, masking filling, and generative matching.

The cross-modal retrieval algorithms consist of representation matching and feature extraction. Metric learning schemes measure the similarity between the samples and predict the matching scores. Hadsell et al. (2006) propose the idea of contrastive learning. The loss formulation aims to reduce the distance in the latent space for similar samples and to increase the distance for different samples. Triplet loss (Schroff et al., 2015), lifted structure loss (Oh Song et al., 2016), and N-Pair loss (Sohn, 2016) construct the joint latent space by sampling training data. The losses gather the positive and repel the negative sampling schemes from positive and negative pairs, forming the positive and negative pairs with the sampling schemes. Thus, optimization can be improved by the hard sampling process (Robinson et al., 2020; Xuan et al., 2020). With the thriving development of self-supervised applications, SimSCE (Gao et al., 2021) and SimCLR (Chen et al., 2020a, 2020b) provide metrics to reinforce the representations. The losses map the origin and representations from the augmented images (crop, rotate, color distort, etc.) onto the same latent space.

Frome et al. (2013) first proposed the concept of joint image and text embedding in the ImageNet (Deng et al., 2009) classification. The pipeline utilizes the textual information from the label to construct a lookup table from the nearby concepts as the target embedding, leading to a hierarchical classification. Zheng et al. (2020) adopts deep CNN as the basis for extracting the image and text features. The instance loss optimizes the two feature extractors, which can project the representations from different modalities onto the joint latent space. Lee et al. (2018) utilize bottom-up attention object detector (Anderson et al., 2018) to obtain semantic representations of images and to perform word-level matching in the captions. The bottom-up detector can provide the modifiers with the noun, matching the corresponding sentence with the details.

Liu et al. (2020) formulate the information as a graph and adopt the structural matching to retrieve the closest subgraph. The object detector obtains the visual graph. The node features are the region of interest (ROI) feature of the model, and the vertices are constructed by the Multi-Layer Perceptron (MLP). The textual graph is the Part-Of-Speech (POS) prediction from the Gated Recurrent Unit (GRU) Networks. Wang et al. (2018) adopts the instance-wise matching for the subgraphs. The overall matching score aggregates the partial graph similarities in a bottom-up manner.

To further exploit the information in the query image, Cheng et al. (2022) adopts the optical character recognition (OCR) module to extract semantic information such as text embeddings of the scene. The model fuses the image token and the scene text for the joint representation. Diao et al. (2023) build the image tokens from ROI by the object detector and bidirectional GRU textual tokens. The cross-modal attention module is used for the token-wise matching process. Jawade et al. (2023) constructs the visual and textual tokens from the pre-trained model. However, the research merges the cross-modal information by cross-attention (Vaswani et al., 2017) modules and manages the retrieval task with the transformer structures.

Transformers (Vaswani et al., 2017) have achieved significant results in natural language processing and computer vision tasks. The image-text encoders can share similar architectures. Wang et al. (2022) crop the input images into patches and use the patches as visual tokens to formulate the images as a novel language. The jointly trained visual and text encoders (Chen et al., 2023; Zeng et al., 2023) are optimized end-to-end. Visual language models (VLMs) can be categorized into three families (Pennington et al., 2014) by the optimization process:

  1. contrast-based VLMs;

  2. VLMs with masking objects; and

  3. generative-based VLMs.

Constructive VLMs (Radford et al., 2021) are trained by the paired multi-modal data, and the objective loss is the contrastive loss. The self-supervised learning scheme obtains VLMs with masking objects (Kim et al., 2021; Kwon et al., 2022; Singh et al., 2022); the model needs to predict the masked visual and textual tokens. Generative-based VLMs (Liu et al., 2024; Yu et al., 2022, 2023) take advantage of the great success of AI chatbots, which are trained in visual question answering, image captioning, and other downstream tasks.

CLIP (Radford et al., 2021) demonstrates impressive visual representations trained together with paired text descriptions. The transformer encoder takes the nonoverlapping patches and the words as input and utilizes the pooled encoded tokens to represent the images and sentences. The model uses a contrastive learning scheme to project image and text representations onto a shared latent space. This shared space allows for a better understanding of the relationship between the two modalities. The dual (image-text) encoder architecture is prevalent in multi-modal applications.

Kim et al. (2021) utilizes the masked tokens in self-supervised learning in transformers (Devlin, 2018) for natural language processing. The model takes tokenized sentences and image patches as input. Training tasks include paired classification and masked token filling. Kwon et al. (2022) proposes the uniform transformer with two pre-training objectives, including masked vision and language modeling, and multi-modal alignment. Singh et al. (2022) proposes the multi-modal encoder with visual and text encoders. The multi-modal encoder aligns the features from the two encoders with global contrastive learning and masked multi-modal modeling.

In addition to representation learning, the large language model provides incredible performance on text generation tasks. Yu et al. (2022) optimize the visual encoder with image captioning as a downstream task. With a jointly trained visual encoder and language decoder, the model provides unified text and visual representations for the transformer. Yu et al. (2023) employ the diffusion models (Ho et al., 2020; Sohl-Dickstein et al., 2015) for image generation and reinforce cross-modal representations. Liu et al. (2024) combine the visual encoder with the LLM. The given image tokens are used as instructions for the detailed LLM responses. However, the training process requires large-scale paired images and texts, which is computationally expensive.

Despite achieving state-of-the-art performance, large visual-language pre-trained models still have shortcomings in inference. The matching process is not transparent, and humans cannot understand the decision-making within fully connected layers because they lack semantic meanings. In addition to the lack of explainability, the fine-tuning process is computationally expensive. These models have billions of trainable parameters, and high-quality image-text pairs are required for tuning.

To handle the computationally intensive fine-tuning process and expand the image-text encoder using unpaired data, we introduce the Green Learning Alignment algorithm, which uses separately pre-trained image-text encoders. The idea of Green Learning was proposed by Kuo and Madni (Kuo and Madni, 2023) and aims to reduce the computational cost of backpropagation while providing a theoretically explainable learning process for various applications. The modular designs can divide the problem into subproblems, which can be solved using transparent algorithms.

The GEMMA algorithm can be divided into three stages:

  1. Global alignment;

  2. Image cluster alignment; and

  3. Text cluster alignment.

We adopt the multi-stage approach to approximate the complicated decision-making process rather than building a single large visual-language foundation model from scratch to ensure model efficiency. Starting from the pre-trained image and text feature extractors, we keep the pre-trained model frozen to maintain its ability to generalize with unpaired data in the matching process. We align the representations by training additional single-layer adapter matrices to project the representations onto the joint latent space. Specifically, the alignment process consists of three modules:

  1. alignment;

  2. clustering in subdomains; and

  3. selection of subdomain features, where clustering and feature selection are performed in both the image and text domains, as shown in Figure 2.

Figure 2.
A multi stage framework shows global alignment of paired image text data, followed by image domain and text domain clustering, feature selection, and subgroup alignment.The diagram presents a structured framework divided into paired data, image domain, and text domain sections. Stage 1 shows global alignment of paired images and captions. In the image domain, stage 2a clusters similar images, stage 2b selects discriminant tokens, and stage 2c performs image domain subgroup alignment. In the text domain, captions are clustered in stage 3a, keywords are selected in stage 3b, and subgroup alignment is performed in stage 3c. Each stage is shown in separate labelled panels, with icons and example images or documents indicating the processing flow.

The overall algorithm design of Alignment. The first stage is the global alignment. The second and third stages include fine-grained clustering and feature selections in the image and text domain

Figure 2.
A multi stage framework shows global alignment of paired image text data, followed by image domain and text domain clustering, feature selection, and subgroup alignment.The diagram presents a structured framework divided into paired data, image domain, and text domain sections. Stage 1 shows global alignment of paired images and captions. In the image domain, stage 2a clusters similar images, stage 2b selects discriminant tokens, and stage 2c performs image domain subgroup alignment. In the text domain, captions are clustered in stage 3a, keywords are selected in stage 3b, and subgroup alignment is performed in stage 3c. Each stage is shown in separate labelled panels, with icons and example images or documents indicating the processing flow.

The overall algorithm design of Alignment. The first stage is the global alignment. The second and third stages include fine-grained clustering and feature selections in the image and text domain

Close modal

In the alignment process, we do not fine-tune the pre-trained encoders. We train a lightweight linear transformation in the visual and textual domains to align the two representation spaces. The alignment module is illustrated in Figure 3. The visual and text embeddings can be formulated as:

(1)
Figure 3.
A dual stream model shows visual and text embeddings aligned into joint features, with inversion modules and reconstruction, cross reconstruction, and contrastive losses.The diagram shows a dual pathway architecture for visual and text data. On the left, visual embedding d vis and text embedding d txt feed into visual alignment and text alignment modules, which map both into joint features d joint on the right. From the joint visual feature, a reconstruction loss connects back through visual inversion to the visual embedding. From the joint text feature, a reconstruction loss connects back through text inversion to the text embedding. Cross text inversion and cross visual inversion connect joint features back to the opposite modality, forming a cross reconstruction loss between visual and text embeddings. A contrastive loss links the joint visual feature and joint text feature. Blue boxes represent features, yellow boxes represent alignment modules, and red boxes represent trainable inversion parameters.

The illustration of the alignment process. The blue boxes are the features extracted by the frozen encoders. The orange boxes are the trainable transformation matrices. The red boxes are the auxiliary matrices for constraining the representations in the joint space

Figure 3.
A dual stream model shows visual and text embeddings aligned into joint features, with inversion modules and reconstruction, cross reconstruction, and contrastive losses.The diagram shows a dual pathway architecture for visual and text data. On the left, visual embedding d vis and text embedding d txt feed into visual alignment and text alignment modules, which map both into joint features d joint on the right. From the joint visual feature, a reconstruction loss connects back through visual inversion to the visual embedding. From the joint text feature, a reconstruction loss connects back through text inversion to the text embedding. Cross text inversion and cross visual inversion connect joint features back to the opposite modality, forming a cross reconstruction loss between visual and text embeddings. A contrastive loss links the joint visual feature and joint text feature. Blue boxes represent features, yellow boxes represent alignment modules, and red boxes represent trainable inversion parameters.

The illustration of the alignment process. The blue boxes are the features extracted by the frozen encoders. The orange boxes are the trainable transformation matrices. The red boxes are the auxiliary matrices for constraining the representations in the joint space

Close modal

where evis,etxt are the image and text embeddings, F,G are the frozen image and text encoder models, and dvis,dtxt are the dimensions of the image and text representations. With the deterministic representations, the matching process can be denoted as:

(2)

where ARdjoint×dvis and BRdjoint×dtxt represent the trainable image-text alignment matrices, zRdjoint represents the vector in the joint space, and sim(.,.) represents the similarity metric. We adopt cosine similarity as the similarity metric, namely sim(u,v)=u·vuv. We can further optimize the trainable parameters with the contrastive learning loss function (Chen et al., 2020a, 2020b):

(3)

Here, (i,j) denotes the paired image and sentence in the sampled batch, N denotes the batch size, and τR denotes the temperature hyperparameter. 1{0,1} is an indicator function and the value is one, while [ki]. The objective function maximizes the similarity of relevant image-text pairs while avoiding negative image-text pairs from being embedded closely in the latent space.

Hence, the problem can be formulated as an optimization problem, and all transformations are linear. We can define the inverse projection in the joint latent space without nonlinearities:

(4)

where A1Rdvis*djoint and B1Rdtxt*djoint represent the inverse transformation from the joint space to the original image-text representations. We define the reconstruction loss for both the image and text modality:

(5)

Furthermore, we use the auxiliary matrices to constrain the joint representations and define the loss of cross-modality reconstruction as:

(6)

where CRdvis×djoint and CRdtxt×djoint are the auxiliary transformation matrices from the joint space onto the image and text modality, respectively. In addition, zvis,ztxt are obtained from the corresponding paired caption or image data, evis and etxt. However, the C and D matrices will not be used during inference. The alignment process is a linear transformation carried out by matrices A and B. The objective function can be written as:

(7)

where α, β, and γR represent hyperparameters in training. Linear alignment provides an invertible transformation from the image-text modality to the joint latent space and vice versa. However, the single-layered alignment is too simple to match all the samples. Thus, we cluster the data to form sub-datasets and utilize the stage-wise alignments for the detailed decision.

With the alignment process, we can find similar representations by linear transformations. However, the transformation can only take global representations, which means that images or captions are represented as dvis- or dtxt-dimensional vectors. The image and sentence representations are the pooled output of the tokens in the prevailing transformer models. It can be inferred from previous research that fine-grained information is also crucial in information-matching tasks.

Due to the complexity of the fine-grained token representations, it is challenging to train the token-wise alignment in a brute-force manner. Thus, we adopt the clustering algorithms and use the clustering results to obtain crucial tokens. The crucial token selection will be introduced in Section 3.3. We can reduce the feature dimension from the number of tokens and perform a second-stage alignment.

We adopt frequency analysis and statistical approaches to construct a transparent and human-sensible intermediate structure. The clustering is conducted through (1) concept aggregation and (2) representation aggregation.

3.2.1 Concept aggregation.

We extract the concrete concepts for the candidate sentences by the Part-of-speech (POS) tagger (Wei et al., 2024). We collect the nouns as anchors and calculate the Term Frequency-Inverse Document Frequency (TF-IDF) to select the representative terms. As shown in Figure 4, the concepts lie in a long-tailed distribution, leading to a biased probability estimation. Hence, we aggregate the high-frequency terms based on the detector results and divide the candidate set into subsets for better-detailed alignments.

Figure 4.
A bar chart shows concept frequency by label, with a steep decline from the highest values and the top 100 concepts highlighted on the left.The bar chart titled concept frequency plots labels on the horizontal axis and frequencies on the vertical axis, ranging from 0 to about 2000. Bars are ordered from highest to lowest frequency, forming a long tail distribution. The leftmost section is enclosed by a highlighted box and labelled top 100, showing the most frequent concepts. These bars start close to 2000 and decrease rapidly to around 1000 within the highlighted region. Beyond the top 100, frequencies continue to decline gradually across many labels, with most concepts appearing far less often than the highest ranked ones.

The frequency bar chart of the extracted corpus concepts. Top ten concepts and the corresponding counts are (‘man’, 36743), (‘woman’, 23845), (‘people’, 12810), (‘shirt’, 12743), (‘girl’, 10035), (‘dog’, 10030), (‘boy’, 9393), (‘men’, 8005), (‘child’, 7746), (‘street’, 7435), (‘group’, 6959), (‘front’, 6857), (‘water’, 5489), (‘hat’, 4075), (‘person’, 3810), (‘ball’, 3679), (‘jacket’, 3365), (‘building’, 3334), (‘hand’, 3113), and (‘player’, 3099)

Figure 4.
A bar chart shows concept frequency by label, with a steep decline from the highest values and the top 100 concepts highlighted on the left.The bar chart titled concept frequency plots labels on the horizontal axis and frequencies on the vertical axis, ranging from 0 to about 2000. Bars are ordered from highest to lowest frequency, forming a long tail distribution. The leftmost section is enclosed by a highlighted box and labelled top 100, showing the most frequent concepts. These bars start close to 2000 and decrease rapidly to around 1000 within the highlighted region. Beyond the top 100, frequencies continue to decline gradually across many labels, with most concepts appearing far less often than the highest ranked ones.

The frequency bar chart of the extracted corpus concepts. Top ten concepts and the corresponding counts are (‘man’, 36743), (‘woman’, 23845), (‘people’, 12810), (‘shirt’, 12743), (‘girl’, 10035), (‘dog’, 10030), (‘boy’, 9393), (‘men’, 8005), (‘child’, 7746), (‘street’, 7435), (‘group’, 6959), (‘front’, 6857), (‘water’, 5489), (‘hat’, 4075), (‘person’, 3810), (‘ball’, 3679), (‘jacket’, 3365), (‘building’, 3334), (‘hand’, 3113), and (‘player’, 3099)

Close modal

We construct the co-occurrence matrix of the POS tagging and object detection results in the training set. As shown in Figure 5, concepts have a significant relationship with detection results. Hence, we can group the concepts tagged with POS with the probability conditional on the detection results. To visualize the physical meaning of the clusters, we can use the word clouds to show the high-frequency concepts in each cluster, shown in Figure 6.

Figure 5.
A heatmap shows a co occurrence matrix between detector objects and P O S concepts, with sparse high value points across a largely low intensity background.The heatmap titled co occurrence matrix plots detector objects on the horizontal axis from 0 to about 80 and P O S concepts on the vertical axis from 0 to about 90. Most cells show very low co occurrence values, forming a dark background. Scattered brighter cells appear at specific intersections, indicating higher co occurrence between certain detector objects and P O S concepts. Vertical streaks at some detector indices suggest repeated associations across multiple P O S concepts. The overall pattern shows sparse but structured co occurrence rather than uniform distribution.

The occurrence matrix of POS tagging concepts and the detection results. The x-axis is the 80 object classes from the pretrained detector from the MS-COCO (Lin et al., 2014) dataset. The y-axis is the top 100 concepts from the POS tagger

Figure 5.
A heatmap shows a co occurrence matrix between detector objects and P O S concepts, with sparse high value points across a largely low intensity background.The heatmap titled co occurrence matrix plots detector objects on the horizontal axis from 0 to about 80 and P O S concepts on the vertical axis from 0 to about 90. Most cells show very low co occurrence values, forming a dark background. Scattered brighter cells appear at specific intersections, indicating higher co occurrence between certain detector objects and P O S concepts. Vertical streaks at some detector indices suggest repeated associations across multiple P O S concepts. The overall pattern shows sparse but structured co occurrence rather than uniform distribution.

The occurrence matrix of POS tagging concepts and the detection results. The x-axis is the 80 object classes from the pretrained detector from the MS-COCO (Lin et al., 2014) dataset. The y-axis is the top 100 concepts from the POS tagger

Close modal
Figure 6.
A word cloud shows clusters of frequently occurring visual concepts such as man, street, boat, bike, ball, dog, table, and glass, each surrounded by related terms.The word cloud presents several clusters of concepts grouped by co occurrence. One cluster centres on man, with related words including woman, people, boy, girl, and shirt. Another cluster focuses on street, surrounded by person, city, building, road, park, car, and sidewalk. A boat cluster includes river, wave, surf, surfer, dock, canoe, wetsuit, and fishing. A bike cluster contains bicycle, dirt, race, rider, helmet, track, and motorcycle. A ball cluster groups player, soccer, basketball, baseball, football, game, team, and uniform. A dog cluster includes field, beach, grass, sand, toy, snow, and jump. A table cluster shows baby, chair, pool, band, stage, room, and microphone. A glass cluster includes food, drink, kitchen, bar, cup, bottle, fruit, beer, and apron.

The visualization results of the clustering. The font size denotes the frequency of the word in the corpus

Figure 6.
A word cloud shows clusters of frequently occurring visual concepts such as man, street, boat, bike, ball, dog, table, and glass, each surrounded by related terms.The word cloud presents several clusters of concepts grouped by co occurrence. One cluster centres on man, with related words including woman, people, boy, girl, and shirt. Another cluster focuses on street, surrounded by person, city, building, road, park, car, and sidewalk. A boat cluster includes river, wave, surf, surfer, dock, canoe, wetsuit, and fishing. A bike cluster contains bicycle, dirt, race, rider, helmet, track, and motorcycle. A ball cluster groups player, soccer, basketball, baseball, football, game, team, and uniform. A dog cluster includes field, beach, grass, sand, toy, snow, and jump. A table cluster shows baby, chair, pool, band, stage, room, and microphone. A glass cluster includes food, drink, kitchen, bar, cup, bottle, fruit, beer, and apron.

The visualization results of the clustering. The font size denotes the frequency of the word in the corpus

Close modal
Figure 7.
A scatter plot shows data points along a feature dimension, divided by multiple dashed partition points, with a central solid line marking the optimal partition.The diagram titled optimal partition shows circular data points distributed along a horizontal feature dimension. Vertical dashed lines indicate candidate partition points that divide the feature space into segments. A single solid vertical line at the centre marks the selected optimal partition. Data points appear on both sides of each partition, with clusters forming within segments. The layout illustrates how different partition choices split the data, highlighting the central partition as the optimal separation among the available partition points.

Visualization of DFT. Red and orange dots represent the binary labels. The partition metric is the weighted sum of the left and right binary cross-entropy. Dashed lines denote the potential partition points

Figure 7.
A scatter plot shows data points along a feature dimension, divided by multiple dashed partition points, with a central solid line marking the optimal partition.The diagram titled optimal partition shows circular data points distributed along a horizontal feature dimension. Vertical dashed lines indicate candidate partition points that divide the feature space into segments. A single solid vertical line at the centre marks the selected optimal partition. Data points appear on both sides of each partition, with clusters forming within segments. The layout illustrates how different partition choices split the data, highlighting the central partition as the optimal separation among the available partition points.

Visualization of DFT. Red and orange dots represent the binary labels. The partition metric is the weighted sum of the left and right binary cross-entropy. Dashed lines denote the potential partition points

Close modal

3.2.2 Representation aggregation.

The clustering is based on the K-means algorithm. To ensure consistency of alignment and clustering, we use the l2-norm of normalized representations as a distance metric:

(8)

where u˜ and v˜ are the normalized representations, namely u˜=uu˜. The clustering probability can be denoted as:

(9)

where clusi and ceni represent the i-th cluster and i-th centroid vector, respectively. K represents the number of clusters and ϵ is a hyperparameter. If ϵ increases, the probability distribution will concentrate on a certain class. If ϵ decreases, the probability distribution will become uniform.

We can group images and texts based on their probabilities and then align them using contrastive learning within these groups. We can improve the contrastive learning process by using negative samples similar to positive ones. We use hard-sample mining to ensure sample diversity within each group. The global alignment process helps identify the most challenging cases. We can then enlarge the groups by selecting the K-top candidates from the previous alignments as negative samples.

To clarify the roles of K-means clustering and the choice of hyperparameters, we conducted experiments comparing K-means and Agglomerative Clustering and varying the number of clusters. As shown in Table 1, increasing the number of clusters improves the retrieval in certain settings. However, this also requires training additional alignment matrices for the clusters. Therefore, we set the number of clusters to eight to strike a balance between the number of trainable parameters and the performance. K-means clustering is selected in GEMMA due to its slight empirical advantage over agglomerative clustering.

Table 1.

Sensitivity to clustering methods, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset

Image-to-textText-to-image
Clustering#ClusterR@1R@5R@10R@1R@5R@10#Param
KMeans484.195.796.665.390.193.45.2M
886.398.299.473.294.297.210M
1686.498.199.673.494.297.320M
Agglomerative484.094.496.264.890.092.25.2M
885.597.798.772.992.896.110M
1686.096.999.573.493.797.020M

Clustering results provide pseudo-labels for further feature selection. The label can be denoted as:

(10)

Here, labelclusiu represents the label of the data point u whether it belongs to the group i, and T(0,1) is the self-definition threshold. With pseudo-labels, we can further adopt Discriminant Feature Selection (Yang et al., 2022) (DFT) to select informative features and reduce feature dimensions. DFT is a supervised feature selection process that measures dimension-wise importance. For a given 1D input feature, we can order the samples by the feature values and bind the feature dimension to the sample maximum and sample minimum. Then, we can partition the samples along the given dimension and calculate the partition purity by weighted cross-entropy with pseudo-labels obtained from Section 3.2. A feature is more discriminant if it has a lower loss value. Then, we can plot the loss value curve from the lowest to the highest and use the elbow point to select discriminant features from the whole feature set.

Separating the whole dataset into subsets allows us to conduct the discriminant feature test among the tokens with the pseudo-labels from the clustering results. Thus, token-level alignments can be performed using the same procedure as global-level alignment.

The overall alignment process can be divided into three modules:

  1. global matching;

  2. subdomain clustering; and

  3. subdomain matching.

The subdomain clustering and alignment will be conducted within the image and the text domain. We can aggregate the alignments in the subproblem to approximate the overall alignment:

(11)

where P(image|text) denotes the probability distribution of the images with a given query text. P(image) and P(text) denote the probability distribution of image and text, and cluster is the result of the clustering of our clustering modules. We further assume that the probability distribution within the cluster can be approximated as uniform. Conditional probability can reflect the stage-wise design in the proposed pipeline.

Furthermore, we use the similarity measurement to simplify the probability estimator, which means that we use sim(image,text) to represent P(image|text). In the work, we adopt the cosine similarity as:

(12)

where Gvis and Gtxt denote the pooled outputs from the feature extractors (global features), Tvis and Ttxt denote the token, i.e. fine-grained, features, W: denote the alignment matrices corresponding to different subsets from the clustering results. The [G; T] denotes the concatenated features of global and tokens. Due to computational cost, we cannot directly collect all token features. Therefore, we conduct the feature selection process based on the clustering results.

The feature selection process is an approximation based on the clustering results. The process is expressed as a combination of the conditional probabilities. For simplicity, we ignore the alignment matrix in the following representations:

(13)

where DFT(.) represents the feature selection and dimension reduction process in Section 3.3 and C1 and C2 represent the cluster sets of K-means. Instead of training a complicated alignment process from the token-level output of the feature extractor, we propose a stage-wise decomposition on the dataset and train simpler structures for the subsets. Meanwhile, the alignments in the stages are linear, which provides the inversion operation and preserves the dual accessibility in image and text domains.

We perform the image-to-text and text-to-image retrieval on the image-text benchmark: Flickr30k and MS-COCO. The Flickr30k dataset (Young et al., 2014) contains 31,000 images, and every image has five paired captions. The training set contains 29,000 images; the validation and testing sets contain 1,000. The MS-COCO (Lin et al., 2014) is a larger-scale dataset with 123,287 images, each containing at least five captions. We follow the ‘Karpathy’ splitting for the experiments (Karpathy and Fei-Fei, 2015): 113,287 images for training, 5,000 for validation, and 5,000 for testing. We use the two benchmarks with different sizes to demonstrate the scalability and generalizability of our approaches. The performance is evaluated using the Recall@K metric where K{1,5,10}. The notation K refers to the top-K matches of the retrieval results. A retrieval is considered a true positive if the predicted matches include at least one of the paired ground-truth captions. Specifically, if the top K matches contain one of the five corresponding captions for a given image, it is counted as a positive in the recall metrics.

The overall algorithm is trained stage by stage. We adopt K-means as the clustering algorithm. The number of clusters is 8, and ϵ is 50 for pseudo-labeling. For Flickr30K, we set the temperature parameter at 0.02, and the ratio between losses is set to α:β:γ=1:0.5:0.6 in the global alignment. In the alignment of the image subdomain, the temperature parameter is set to 0.015, and the ratio between the losses is set to α:β:γ=1:0.4:0.5. In the text subdomain alignment, the temperature parameter is set to 0.01, and the ratio between the losses is set to α:β:γ=1:0.3:0.4.

For the MS-COCO dataset, the temperature parameter is set to 0.05, and the ratio between the losses is set to α:β:γ=1:0.5:0.5 in the global alignment. In the alignment of the image subdomain, the temperature parameter is set to 0.03, and the ratio between the losses is set to α:β:γ=1:0.3:0.5. In the text subdomain alignment, the temperature parameter is set to 0.02, and the ratio between the losses is set to α:β:γ=1:0.2:0.4.

The dimension of the joint space is set to 768, which follows the token dimension of the transformer encoders. All optimization is performed using AdamW with the learning rate = 0.001.

We conducted the experiments and compared our alignment approach to the SOTA retrieval models. The results are shown in Table 2. We extract information from the frozen CLIP image and text encoder in the experiments. The CLIP encoders remain frozen during further alignments and serve as the baseline for our alignment process. The CLIP encoder contains more than 428M parameters. However, we do not fine-tune the overall encoder in our alignment process; instead, we train additional alignment matrices. The trainable parameters can be reduced from 428M to 9.43M (2.2%). The encoders remain untrainable during the training of alignment matrices. Therefore, GPU memory consumption is proportional to the trainable parameters, which can be reduced to less than 10 percent of the fully fine-tuned approach.

Table 2.

The Flickr30k(1k testing set) and MSCOCO(5k testing set) dataset retrieval performance. We compare the single-model performance among all multi-modal retrieval models. The numbers are taken from Diao et al. (2023) R@1 represents Recall@1 for simplicity

Flickr30k (1k testing set)MS-COCO (5k testing set)
Image-to-textText-to-imageImage-to-textText-to-image
R@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
SCAN (Lee et al., 2018)67.490.395.848.677.785.250.482.290.038.669.380.4
VSRN (Li et al., 2019)71.390.696.054.781.888.253.081.189.440.570.681.1
CAAN (Zhang et al., 2020)70.191.697.252.879.087.952.583.390.941.270.382.9
IMRAM (Chen et al., 2020)74.193.096.653.979.487.253.783.291.039.769.179.8
MMCA (Wei et al., 2020)74.292.896.454.881.487.854.082.590.738.769.780.8
GSMN (Liu et al., 2020)76.494.397.357.482.389.0
SGRAF (Diao et al., 2021)77.894.197.458.583.088.857.884.991.641.970.781.3
SHAN (Ji et al., 2021)74.693.596.955.381.388.4
WCGL (Wang et al., 2021)74.893.396.854.880.687.5
RCAR (Diao et al., 2023)78.794.697.659.584.089.559.685.892.442.571.781.8
SGRAFS (Jawade et al., 2023)79.295.397.758.383.189.258.085.191.641.771.281.5
CLIP (Radford et al., 2021)88.098.799.468.790.695.258.481.588.137.862.472.2
GEMMA(Ours)88.698.999.675.794.297.158.683.290.045.372.682.8

In Flickr30k (1k testing set), our approach outperforms other image-to-text and text-to-image retrieval methods. Alignment can improve recall @ 1 by 0. 6% in image-to-text retrieval. Meanwhile, our approach provides a 6% boost in text-to-image retrieval. RCAR (Diao et al., 2023) needs dual-way optimized models, namely image-to-text and text-to-image. Our method is optimized in a feed-forward manner, and it ensembles the substructures directly.

In MS-COCO (5k testing set), our method provides competitive performance in image-to-text retrieval and outperforms the others in text-to-image retrieval by a boost of 2.1% in Recall@1. We achieve the best text-to-image retrieval performance among the two datasets, showcasing our approach’s scalability.

This section demonstrates the alignment between the visual/text encoders, which are trained separately. The encoders remain frozen in the alignment process. All alignments are based on the grouping and linear projection proposed in our pipeline. The performance of CLIP visual and text encoders without GEMMA alignment is taken from the original CLIP paper (Radford et al., 2021). Starting from the jointly trained CLIP structure, we change the text encoders into the RoBERTa (Liu et al., 2019) and the visual encoder into a CNN-based object detector (Anderson et al., 2018). All experiments are carried out on the Flickr30k dataset and follow the parameter settings in Section 4.2.

The results are shown in Table 3. The best performance comes from the jointly trained models, whose representations are preliminarily aligned in the pre-training process. Compared to the CLIP visual encoder, the features of the object detector are weaker in the alignment process. However, the separately trained text encoder, RoBERTa (Liu et al., 2019), does not suffer from the unpaired training dataset. The representations from the CLIP visual encoder and the RoBERTa text encoder can provide competitive performance in image-to-text retrieval and better performance in text-to-image retrieval than the original CLIP. The encoder can be adapted to the retrieval application without fine-tuning with the paired image and text data.

Table 3.

The experiment results with different visual and text features for the alignment process. All the experiments are conducted in the Flickr30k dataset

Flickr30k (1k testing set)
AlignmentImage-to-textText-to-image
Visual enc.Text enc.(GEMMA)Recall@1Recall@5Recall@10Recall@1Recall@5Recall@10
CLIP vis (Radford et al., 2021)CLIP text (Radford et al., 2021)x88.098.799.468.790.695.2
DETR (Carion et al., 2020)RoBERTa (Liu et al., 2019)v66.789.593.656.784.590.3
DETR (Carion et al., 2020)CLIP text (Radford et al., 2021)v73.691.694.560.085.890.6
CLIP vis (Radford et al., 2021)RoBERTa (Liu et al., 2019)v86.398.299.473.294.297.2
CLIP vis (Radford et al., 2021)CLIP text (Radford et al., 2021)v88.698.999.674.894.297.1
Table 4.

Ablation Studies on different stages, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset.

AlignmentImage-to-textText-to-image
R@1R@5R@10R@1R@5R@10
Without alignment64.571.784.332.761.680.1
Global84.897.899.068.390.791.1
+Image cluster85.498.099.170.391.594.3
+Text cluster (Final)86.398.299.473.294.297.2
Table 5.

Experiments on Detector Features

Flickr30k (1k testing set)
Vis FeatImage-to-textText-to-image
Global FeatDetail FeatText FeatRecall@1Recall@5Recall@10Recall@1Recall@5Recall@10
CLIPCLIPCLIP85.391.993.372.190.692.2
DETR encoderDETR decoderCLIP18.335.141.819.525.345.9
ResNet BackboneDETR encoderCLIP66.789.593.356.784.590.3
ResNet BackboneDETR decoderCLIP72.491.695.159.585.790.5
ResNet BackboneDETR decoderRoBERTa64.584.588.453.383.387.3

In contrast, the Convolution Neural Network (CNN)-based object detector representation cannot be applied directly to rthe image-text retrieval task. The decrease in performance results from global understanding. The object detector efeatures are obtained from part of the image, and the representations lack a global understanding of the image. As dto CLIP visual encoders, the visual tokens’ pooled output contains the input images’ global information and has detailed token features for us to process further stage alignments. The visual example can be found in Section 4.6. If the alignment process misses the global information in the very beginning, then the alignment process on detailed information may lead to a misfocused result.

Due to the modularized design, we can compare the design from global alignment to subgroup alignment in the visual and textual domains. We choose encoders trained in different modalities to perform the alignment process. We use the CLIP visual encoder (Radford et al., 2021) and the RoBERTa (Liu et al., 2019) text encoder for the ablation study of stage-wise alignment on Flickr30k dataset (Young et al., 2014). The two encoders remain frozen in the experiments. The ‘without alignment’ setting means the direct dot product between the encoded features from two models. The two embeddings are located in different semantic latent spaces. Hence, the performance is the lowest compared to the other alignment processes.

With global alignment, the features can provide basic performance in retrieval tasks. However, a naive linear projection can not handle complex interactions between detailed information in the candidate set. Thus, recall rates increase as we add more stages in grouping, feature selection, and alignment. Feature selection provides statistical criteria for dimension reduction, preventing the latent dimension from increasing with additional tokens. We can take the essential features into the next stage and reduce computational cost simultaneously. Hence, the three-stage alignment can achieve the best performance with comparable efficiency.

To better understand the difference between the visual features of transformers and object detectors, we demonstrate the retrieval processing step by step.

The object detector can detect humans and vehicles, but the features lack a sensible relationship with each other. In the clustering stage, the clusters will focus on the specific object in the figure, that is, the bus in Figure 8. In global alignment, the paired sentence is fifth. However, the correct captions fall to the seventh when we perform the finer alignment, which clusters on cars and buses. Although object detectors can provide information fragments, the grouping process cannot link features. The detector features cannot find the central concept in the picture, but can be distracted by the surrounding objects.

Figure 8.
A visual text alignment example shows an urban street image with detected objects and two stages of caption alignment compared against ground truth descriptions.The figure combines an urban street photograph with alignment results. On the left, the image shows a police officer standing beside cars on a city street, with bounding boxes highlighting vehicles and a person. Below, ground truth captions describe an officer near a car on a busy city street. On the right, two columns list first stage alignment and second stage alignment captions, clustered by car, bus, and human. The first stage includes several general street descriptions, while the second stage refines the list to fewer captions, retaining the description of a police officer standing in front of a car on a busy street as the selected alignment.

Error cases of object detector alignment. The object detector will give all objects equal weights and try to include all the objects in the captions

Figure 8.
A visual text alignment example shows an urban street image with detected objects and two stages of caption alignment compared against ground truth descriptions.The figure combines an urban street photograph with alignment results. On the left, the image shows a police officer standing beside cars on a city street, with bounding boxes highlighting vehicles and a person. Below, ground truth captions describe an officer near a car on a busy city street. On the right, two columns list first stage alignment and second stage alignment captions, clustered by car, bus, and human. The first stage includes several general street descriptions, while the second stage refines the list to fewer captions, retaining the description of a police officer standing in front of a car on a busy street as the selected alignment.

Error cases of object detector alignment. The object detector will give all objects equal weights and try to include all the objects in the captions

Close modal

The object detector can provide the features with the local information, yet the patched information is not represented in a structured manner. That is, we can only obtain the partial contents in the image and lose the global semantic representation in the clustering process. We rely on the global and local information relationship to retrieve suitable captions in the proposed coarse-to-fine clustering process.

On the other hand, the visual transformer can provide more information about the tokens and integrate the rrepresentations through a global pooling process. Hence, the token information can be selected in our feature selection module (Section 3.3) and clustered according to the global features. The overall architecture can sort the rich representation in a coarse-to-fine manner and provide a better multi-modal alignment performance.

When comparing the alignment process across different features, it becomes evident that performance is influenced by the types of features used. However, the alignment process cannot transform weak visual features into strong ones. Instead, it aims to bridge the gap caused by differences in modality. Consequently, performance improves when encoded features have larger receptive fields. The proposed alignment does not require jointly fine-tuning the encoders in the limited paired multi-modal data and generalizes the single-modal encoder with additional alignment matrices.

Our approach can achieve outstanding performance in both image-to-text and text-to-image retrieval tasks. Furthermore, our method involves a step-by-step alignment process that maintains compatibility in the decision-making procedure. We divide the alignment into global and subdomain matching and apply a feature selection method to decrease the input feature dimensions. All subprocesses can be expressed mathematically and analyzed statistically, providing transparency compared to black-box output. To ensure computational efficiency, we froze the visual and text encoders and only trained the alignment matrices, which represent only about 3% of the parameters compared to the original model.

In addition, we conducted experiments on applying our alignment mechanism to individually trained text and image encoders. In the testing dataset, we found that the pre-trained text encoder can improve the performance of text-to-image retrieval. Replacement of the text encoder can also lead to similar performance in image-to-text retrieval.

We are working on developing a purely green learning solution for image understanding in the foreseeable future. By aiming not only for transparency but also computational efficiency, we can have a better understanding of the multi-modal information representation.

This work was supported by the DEVCOM Army Research Laboratory (ARL) under agreement W911NF2020157. Computation in the work was supported by the University of Southern California’s Center for Advanced Research Computing (carc.usc.edu).

Anderson
,
P.
,
He
,
X.
,
Buehler
,
C.
,
Teney
,
D.
,
Johnson
,
M.
,
Gould
,
S.
and
Zhang
,
L.
(
2018
), “
Bottom-up and top-down attention for image captioning and visual question answering
”,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp.
6077
-
6086
.
Carion
,
N.
,
Massa
,
F.
,
Synnaeve
,
G.
,
Usunier
,
N.
,
Kirillov
,
A.
and
Zagoruyko
,
S.
(
2020
), “
End-to-end object detection with transformers
”,
European Conference on Computer Vision
,
Springer
, pp.
213
-
229
.
Chen
,
T.
,
Kornblith
,
S.
,
Norouzi
,
M.
and
Hinton
,
G.
(
2020a
), “
A simple framework for contrastive learning of visual representations
”,
International Conference on Machine Learning
,
PMLR
, pp.
1597
-
1607
.
Chen
,
H.
,
Ding
,
G.
,
Liu
,
X.
,
Lin
,
Z.
,
Liu
,
J.
and
Han
,
J.
(
2020b
), “
Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval
”,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp.
12655
-
12663
.
Chen
,
Z.
,
Wu
,
J.
,
Wang
,
W.
,
Su
,
W.
,
Chen
,
G.
,
Xing
,
S.
,
Muyan
,
Z.
,
Zhang
,
Q.
,
Zhu
,
X.
,
Lu
,
L.
, et al. (
2023
), “
Internvl: scaling up vision foundation models and aligning for generic visual-linguistic tasks
”,
arXiv preprint
.
Cheng
,
M.
,
Sun
,
Y.
,
Wang
,
L.
,
Zhu
,
X.
,
Yao
,
K.
,
Chen
,
J.
,
Song
,
G.
,
Han
,
J.
,
Liu
,
J.
,
Ding
,
E.
, et al. (
2022
), “
Vista: vision and scene text aggregation for cross-modal retrieval
”,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp.
5184
-
5193
.
Deng
,
J.
,
Dong
,
W.
,
Socher
,
R.
,
Li
,
L.-J.
,
Li
,
K.
and
Fei-Fei
,
L.
(
2009
), “
ImageNet: a large-scale hierarchical image database
”,
CVPR09
.
Devlin
,
J.
(
2018
), “
Bert: pre-training of deep bidirectional transformers for language understanding
”,
arXiv preprint
.
Diao
,
H.
,
Zhang
,
Y.
,
Liu
,
W.
,
Ruan
,
X.
and
Lu
,
H.
(
2023
), “
Plug-and-play regulators for image-text matching
”,
IEEE Transactions on Image Processing.
Diao
,
H.
,
Zhang
,
Y.
,
Ma
,
L.
and
Lu
,
H.
(
2021
), “
Similarity reasoning and filtration for image-text matching
”,
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol.
35
, No.
2
, pp.
1218
-
1226
.
Frome
,
A.
,
Corrado
,
G.S.
,
Shlens
,
J.
,
Bengio
,
S.
,
Dean
,
J.
,
Ranzato
,
M.
and
Mikolov
,
T.
(
2013
), “
Devise: a deep visual-semantic embedding model
”,
Advances in Neural Information Processing Systems
, p.
26
.
Gao
,
T.
,
Yao
,
X.
and
Chen
,
D.
(
2021
), “
SIMCSE: simple contrastive learning of sentence embeddings
”,
arXiv preprint
.
Hadsell
,
R.
,
Chopra
,
S.
and
LeCun
,
Y.
(
2006
), “
Dimensionality reduction by learning an invariant mapping
”,
2006 IEEE Computer Society Conference On Computer Vision And Pattern Recognition (CVPR’06)
,
IEEE
, Vol.
2
, pp.
1735
-
1742
.
Ho
,
J.
,
Jain
,
A.
and
Abbeel
,
P.
(
2020
), “
Denoising diffusion probabilistic models
”,
Advances in neural information processing systems
, Vol.
33
, pp.
6840
-
6851
.
Jawade
,
B.
,
Mohan
,
D.D.
,
Ali
,
N.M.
,
Setlur
,
S.
and
Govindaraju
,
V.
(
2023
), “
NAPReg: nouns as proxies regularization for semantically aware crossmodal embeddings
”,
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
, pp.
1135
-
1144
.
Ji
,
Z.
,
Chen
,
K.
and
Wang
,
H.
(
2021
), “
Step-wise hierarchical alignment network for image-text matching
”,
arXiv preprint
.
Karpathy
A.
and
Fei-Fei
,
L.
(
2015
), “
Deep visual-semantic alignments for generating image descriptions
”,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp.
3128
-
3137
.
Kim
,
W.
,
Son
,
B.
and
Kim
,
I.
(
2021
), “
Vilt: vision-and-language transformer without convolution or region supervision
”,
International Conference on Machine Learning
,
PMLR
, pp.
5583
-
5594
.
Kuo
,
C.-C.J.
and
Madni
,
A.M.
(
2023
), “
Green learning: introduction, examples and outlook
”,
Journal of Visual Communication and Image Representation
, Vol.
90
, p.
103685
.
Kwon
,
G.
,
Cai
,
Z.
,
Ravichandran
,
A.
,
Bas
,
E.
,
Bhotika
,
R.
and
Soatto
,
S.
(
2022
), “
Masked vision and language modeling for multi-modal representation learning
”,
arXiv preprint
.
Lee
,
K.-H.
,
Chen
,
X.
,
Hua
,
G.
,
Hu
,
H.
and
He
,
X.
(
2018
), “
Stacked cross attention for image-text matching
”,
Proceedings of the European Conference on Computer Vision (ECCV)
, pp.
201
-
216
.
Li
,
K.
,
Zhang
,
Y.
,
Li
,
K.
,
Li
,
Y.
and
Fu
,
Y.
(
2019
), “
Visual semantic reasoning for image-text matching
”,
Proceedings of the IEEE/CVF International Conference on Computer Vision
, pp.
4654
-
4662
.
Lin
,
T.-Y.
,
Maire
,
M.
,
Belongie
,
S.
,
Hays
,
J.
,
Perona
,
P.
,
Ramanan
,
D.
,
Dollár
,
P.
and
Zitnick
,
C.L.
(
2014
), “
Microsoft coco: common objects in context
”,
Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13
,
Springer
, pp.
740
-
755
.
Liu
,
C.
,
Mao
,
Z.
,
Zhang
,
T.
,
Xie
,
H.
,
Wang
,
B.
and
Zhang
,
Y.
(
2020
), “
Graph structured network for image-text matching
”,
Proceedings of the IEEE/CVF Conference On Computer Vision And Pattern Recognition
, pp.
10921
-
10930
.
Liu
,
H.
,
Li
,
C.
,
Wu
,
Q.
and
Lee
,
Y.J.
(
2024
), “
Visual instruction tuning
”,
Advances in Neural Information Processing Systems
, p.
36
.
Liu
,
Y.
,
Ott
,
M.
,
Goyal
,
N.
,
Du
,
J.
,
Joshi
,
M.
,
Chen
,
D.
,
Levy
,
O.
,
Lewis
,
M.
,
Zettlemoyer
,
L.
and
Stoyanov
,
V.
(
2019
), “
Roberta: a robustly optimized Bert pretraining approach
”,
arXiv preprint
.
Nam
,
H.
,
Ha
,
J.-W.
and
Kim
,
J.
(
2017
), “
Dual attention networks for multimodal reasoning and matching
”,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp.
299
-
307
.
Oh Song
,
H.
,
Xiang
,
Y.
,
Jegelka
,
S.
and
Savarese
,
S.
(
2016
), “
Deep metric learning via lifted structured feature embedding
”,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp.
4004
-
4012
.
Pennington
,
J.
,
Socher
,
R.
and
Manning
,
C.D.
(
2014
), “
Glove: global vectors for word representation
”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.
1532
-
1543
.
Radford
,
A.
,
Kim
,
J.W.
,
Hallacy
,
C.
,
Ramesh
,
A.
,
Goh
,
G.
,
Agarwal
,
S.
,
Sastry
,
G.
,
Askell
,
A.
,
Mishkin
,
P.
,
Clark
,
J.
et al. (
2021
), “
Learning transferable visual models from natural language supervision
”,
International Conference on Machine Learning
,
PMLR
, pp.
8748
-
8763
.
Robinson
,
J.
,
Chuang
,
C.-Y.
,
Sra
,
S.
and
Jegelka
,
S.
(
2020
), “
Contrastive learning with hard negative samples
”,
arXiv preprint
.
Schroff
,
F.
,
Kalenichenko
,
D.
and
Philbin
,
J.
(
2015
), “
Facenet: a unified embedding for face recognition and clustering
”,
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp.
815
-
823
.
Singh
,
A.
,
Hu
,
R.
,
Goswami
,
V.
,
Couairon
,
G.
,
Galuba
,
W.
,
Rohrbach
,
M.
and
Kiela
,
D.
(
2022
), “
Flava: a foundational language and vision alignment model
”,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp.
15638
-
15650
.
Sohl-Dickstein
,
J.
,
Weiss
,
E.
,
Maheswaranathan
,
N.
and
Ganguli
,
S.
(
2015
), “
Deep unsupervised learning using nonequilibrium thermodynamics
”,
International Conference on Machine Learning
,
PMLR
, pp.
2256
-
2265
.
Sohn
,
K.
(
2016
), “
Improved deep metric learning with multi-class n-pair loss objective
”,
Advances in Neural Information Processing Systems
, p.
29
.
Vaswani
,
A.
,
Shazeer
,
N.
,
Parmar
,
N.
,
Uszkoreit
,
J.
,
Jones
,
L.
,
Gomez
,
Kaiser
,
A.N.
and
Polosukhin
,
I.
(
2017
), “
Attention is all you need
”,
Advances in Neural Information Processing Systems
, p.
30
.
Wang
,
L.
,
Li
,
Y.
,
Huang
,
J.
and
Lazebnik
,
S.
(
2018
), “
Learning two-branch neural networks for image-text matching tasks
”,
IEEE Transactions on Pattern Analysis and Machine Intelligence
, Vol.
41
No.
2
, pp.
394
-
407
.
Wang
,
P.
,
Wu
,
Q.
,
Cao
,
J.
,
Shen
,
C.
,
Gao
,
L.
and
Hengel
,
A.v.d.
(
2019
), “
Neighbourhood watch: referring expression comprehension via languageguided graph attention networks
”,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp.
1960
-
1968
.
Wang
,
W.
,
Bao
,
H.
,
Dong
,
L.
,
Bjorck
,
J.
,
Peng
,
Z.
,
Liu
,
Q.
,
Aggarwal
,
K.
,
Mohammed
,
O.K.
,
Singhal
,
S.
,
Som
,
S.
, et al., (
2022
), “
Image as a foreign language: Beit pretraining for all vision and vision-language tasks
”,
arXiv preprint
.
Wang
,
Y.
,
Zhang
,
T.
,
Zhang
,
X.
,
Cui
,
Z.
,
Huang
,
Y.
,
Shen
,
P.
,
Li
,
S.
and
Yang
,
J.
(
2021
), “
Wasserstein coupled graph learning for cross-modal retrieval
”,
2021 IEEE/CVF International Conference on Computer Vision (ICCV)
,
IEEE
, pp.
1793
-
1802
.
Wei
,
C.
,
Pang
,
R.
and
Kuo
,
C.-C.J.
(
2024
), “
GWPT: a GreenWord-embeddingbased POS tagger
”,
arXiv preprint
.
Wei
,
X.
,
Zhang
,
T.
,
Li
,
Y.
,
Zhang
,
Y.
and
Wu
,
F.
(
2020
), “
Multi-modality cross attention network for image and sentence matching
”,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp.
10941
-
10950
.
Xuan
,
H.
,
Stylianou
,
A.
,
Liu
,
X.
and
Pless
,
R.
(
2020
), “
Hard negative examples are hard, but useful
”,
Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16
,
Springer
, pp.
126
-
142
.
Yang
,
Y.
,
Wang
,
W.
,
Fu
,
H.
,
Kuo
,
C.-C.J.
et al. (
2022
), “
On supervised feature selection from high dimensional feature spaces
”,
APSIPA Transactions on Signal and Information Processing
, Vol.
11
No.
1
.
Young
,
P.
,
Lai
,
A.
,
Hodosh
,
M.
and
Hockenmaier
,
J.
(
2014
), “
From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions
”,
Transactions of the Association for Computational Linguistics
, Vol.
2
, pp.
67
-
78
.
Yu
,
J.
,
Wang
,
Z.
,
Vasudevan
,
V.
,
Yeung
,
L.
,
Seyedhosseini
,
M.
and
Wu
,
Y.
(
2022
), “
Coca: contrastive captioners are image-text foundation models
”,
arXiv preprint
.
Yu
,
L.
,
Shi
,
B.
,
Pasunuru
,
R.
,
Muller
,
B.
,
Golovneva
,
O.
,
Wang
,
T.
,
Babu
,
A.
,
Tang
,
B.
,
Karrer
,
B.
,
Sheynin
,
S.
et al. (
2023
), “
Scaling autoregressive multi-modal models: pretraining and instruction tuning
”,
arXiv preprint
, Vol.
2
No.
3
.
Zellers
,
R.
,
Bisk
,
Y.
,
Farhadi
,
A.
and
Choi
,
Y.
(
2019
), “
From recognition to cognition: visual commonsense reasoning
”,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp.
6720
-
6731
.
Zeng
,
Y.
,
Zhang
,
X.
,
Li
,
H.
,
Wang
,
J.
,
Zhang
,
J.
and
Zhou
,
W.
(
2023
), “
X 2-vlm: all-in-one pre-trained model for vision-language tasks
”,
IEEE Transactions on Pattern Analysis and Machine Intelligence.
Zhang
,
Q.
,
Lei
,
Z.
,
Zhang
,
Z.
and
Li
,
S.Z.
(
2020
), “
Context-aware attention network for image-text retrieval
”,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pp.
3536
-
3545
.
Zheng
,
Z.
,
Zheng
,
L.
,
Garrett
,
M.
,
Yang
,
Y.
,
Xu
,
M.
and
Shen
,
Y.-D.
(
2020
), “
Dualpath convolutional image-text embeddings with instance loss
”,
ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)
, Vol.
16
No.
2
, pp.
1
-
23
.
Zhu
,
Y.
,
Kiros
,
R.
,
Zemel
,
R.
,
Salakhutdinov
,
R.
,
Urtasun
,
R.
,
Torralba
,
A.
and
Fidler
,
S.
(
2015
), “
Aligning books and movies: towards story-like visual explanations by watching movies and reading books
”,
Proceedings of the IEEE International Conference on Computer Vision
, pp.
19
-
27
.
Published by Emerald Publishing Limited. This article is published under the Creative Commons Attribution (CC BY 4.0) licence. Anyone may reproduce, distribute, translate and create derivative works of this article (for both commercial and non-commercial purposes), subject to full attribution to the original publication and authors. The full terms of this licence may be seen at Link to the terms of the CC BY 4.0 licence

or Create an Account

Close Modal
Close Modal