Image-text retrieval via green explainable multi-modal alignment (GEMMA)

Carion

Massa

Synnaeve

Usunier

Kirillov

and

Zagoruyko

(

2020

), “

End-to-end object detection with transformers

”,

European Conference on Computer Vision

Springer

, pp.

213

229

Chen

Kornblith

Norouzi

and

Hinton

(

2020a

), “

A simple framework for contrastive learning of visual representations

”,

International Conference on Machine Learning

PMLR

, pp.

1597

1607

Chen

Ding

Liu

Lin

Liu

and

Han

(

2020b

), “

Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

12655

12663

Chen

Wang

Chen

Xing

Muyan

Zhang

Zhu

, et al. (

2023

), “

Internvl: scaling up vision foundation models and aligning for generic visual-linguistic tasks

”,

arXiv preprint

arXiv:2312.14238

Cheng

Sun

Wang

Zhu

Yao

Chen

Song

Han

Liu

Ding

, et al. (

2022

), “

Vista: vision and scene text aggregation for cross-modal retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

5184

5193

Deng

Dong

Socher

L.-J.

and

Fei-Fei

(

2009

), “

ImageNet: a large-scale hierarchical image database

”,

CVPR09

Devlin

(

2018

), “

Bert: pre-training of deep bidirectional transformers for language understanding

”,

arXiv preprint

arXiv:1810.04805

Diao

Zhang

Liu

Ruan

and

(

2023

), “

Plug-and-play regulators for image-text matching

”,

IEEE Transactions on Image Processing.

Diao

Zhang

and

(

2021

), “

Similarity reasoning and filtration for image-text matching

”,

Proceedings of the AAAI Conference on Artificial Intelligence

, Vol.

, No.

, pp.

1218

1226

Frome

Corrado

G.S.

Shlens

Bengio

Dean

Ranzato

and

Mikolov

(

2013

), “

Devise: a deep visual-semantic embedding model

”,

Advances in Neural Information Processing Systems

, p.

Gao

Yao

and

Chen

(

2021

), “

SIMCSE: simple contrastive learning of sentence embeddings

”,

arXiv preprint

arXiv:2104.08821

Hadsell

Chopra

and

LeCun

(

2006

), “

Dimensionality reduction by learning an invariant mapping

”,

2006 IEEE Computer Society Conference On Computer Vision And Pattern Recognition (CVPR’06)

IEEE

, Vol.

, pp.

1735

1742

Jain

and

Abbeel

(

2020

), “

Denoising diffusion probabilistic models

”,

Advances in neural information processing systems

, Vol.

, pp.

6840

6851

Jawade

Mohan

D.D.

Ali

N.M.

Setlur

and

Govindaraju

(

2023

), “

NAPReg: nouns as proxies regularization for semantically aware crossmodal embeddings

”,

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

, pp.

1135

1144

Chen

and

Wang

(

2021

), “

Step-wise hierarchical alignment network for image-text matching

”,

arXiv preprint

arXiv:2106.06509

Karpathy

and

Fei-Fei

(

2015

), “

Deep visual-semantic alignments for generating image descriptions

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

3128

3137

Kim

Son

and

Kim

(

2021

), “

Vilt: vision-and-language transformer without convolution or region supervision

”,

International Conference on Machine Learning

PMLR

, pp.

5583

5594

Kuo

C.-C.J.

and

Madni

A.M.

(

2023

), “

Green learning: introduction, examples and outlook

”,

Journal of Visual Communication and Image Representation

, Vol.

, p.

103685

Kwon

Cai

Ravichandran

Bas

Bhotika

and

Soatto

(

2022

), “

Masked vision and language modeling for multi-modal representation learning

”,

arXiv preprint

arXiv:2208.02131

Lee

K.-H.

Chen

Hua

and

(

2018

), “

Stacked cross attention for image-text matching

”,

Proceedings of the European Conference on Computer Vision (ECCV)

, pp.

201

216

Zhang

and

(

2019

), “

Visual semantic reasoning for image-text matching

”,

Proceedings of the IEEE/CVF International Conference on Computer Vision

, pp.

4654

4662

Lin

T.-Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

and

Zitnick

C.L.

(

2014

), “

Microsoft coco: common objects in context

”,

Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

Springer

, pp.

740

755

Liu

Mao

Zhang

Xie

Wang

and

Zhang

(

2020

), “

Graph structured network for image-text matching

”,

Proceedings of the IEEE/CVF Conference On Computer Vision And Pattern Recognition

, pp.

10921

10930

Liu

and

Lee

Y.J.

(

2024

), “

Visual instruction tuning

”,

Advances in Neural Information Processing Systems

, p.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

and

Stoyanov

(

2019

), “

Roberta: a robustly optimized Bert pretraining approach

”,

arXiv preprint

arXiv:1907.11692

Nam

J.-W.

and

Kim

(

2017

), “

Dual attention networks for multimodal reasoning and matching

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

299

307

Oh Song

Xiang

Jegelka

and

Savarese

(

2016

), “

Deep metric learning via lifted structured feature embedding

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

4004

4012

Pennington

Socher

and

Manning

C.D.

(

2014

), “

Glove: global vectors for word representation

”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.

1532

1543

Radford

Kim

J.W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

et al. (

2021

), “

Learning transferable visual models from natural language supervision

”,

International Conference on Machine Learning

PMLR

, pp.

8748

8763

Robinson

Chuang

C.-Y.

Sra

and

Jegelka

(

2020

), “

Contrastive learning with hard negative samples

”,

arXiv preprint

arXiv:2010.04592

Schroff

Kalenichenko

and

Philbin

(

2015

), “

Facenet: a unified embedding for face recognition and clustering

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

815

823

Singh

Goswami

Couairon

Galuba

Rohrbach

and

Kiela

(

2022

), “

Flava: a foundational language and vision alignment model

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

15638

15650

Sohl-Dickstein

Weiss

Maheswaranathan

and

Ganguli

(

2015

), “

Deep unsupervised learning using nonequilibrium thermodynamics

”,

International Conference on Machine Learning

PMLR

, pp.

2256

2265

Sohn

(

2016

), “

Improved deep metric learning with multi-class n-pair loss objective

”,

Advances in Neural Information Processing Systems

, p.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

Kaiser

A.N.

and

Polosukhin

(

2017

), “

Attention is all you need

”,

Advances in Neural Information Processing Systems

, p.

Wang

Huang

and

Lazebnik

(

2018

), “

Learning two-branch neural networks for image-text matching tasks

”,

IEEE Transactions on Pattern Analysis and Machine Intelligence

, Vol.

No.

, pp.

394

407

PubMed

Wang

Cao

Shen

Gao

and

Hengel

A.v.d.

(

2019

), “

Neighbourhood watch: referring expression comprehension via languageguided graph attention networks

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

1960

1968

Wang

Bao

Dong

Bjorck

Peng

Liu

Aggarwal

Mohammed

O.K.

Singhal

Som

, et al., (

2022

), “

Image as a foreign language: Beit pretraining for all vision and vision-language tasks

”,

arXiv preprint

arXiv:2208.10442

Wang

Zhang

Cui

Huang

Shen

and

Yang

(

2021

), “

Wasserstein coupled graph learning for cross-modal retrieval

”,

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

IEEE

, pp.

1793

1802

Wei

Pang

and

Kuo

C.-C.J.

(

2024

), “

GWPT: a GreenWord-embeddingbased POS tagger

”,

arXiv preprint

arXiv:2401.07475

Wei

Zhang

and

(

2020

), “

Multi-modality cross attention network for image and sentence matching

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

10941

10950

Xuan

Stylianou

Liu

and

Pless

(

2020

), “

Hard negative examples are hard, but useful

”,

Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16

Springer

, pp.

126

142

Yang

Wang

Kuo

C.-C.J.

et al. (

2022

), “

On supervised feature selection from high dimensional feature spaces

”,

APSIPA Transactions on Signal and Information Processing

, Vol.

No.

Young

Lai

Hodosh

and

Hockenmaier

(

2014

), “

From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions

”,

Transactions of the Association for Computational Linguistics

, Vol.

, pp.

Wang

Vasudevan

Yeung

Seyedhosseini

and

(

2022

), “

Coca: contrastive captioners are image-text foundation models

”,

arXiv preprint

arXiv:2205.01917

Shi

Pasunuru

Muller

Golovneva

Wang

Babu

Tang

Karrer

Sheynin

et al. (

2023

), “

Scaling autoregressive multi-modal models: pretraining and instruction tuning

”,

arXiv preprint

arXiv:2309.02591

, Vol.

No.

Zellers

Bisk

Farhadi

and

Choi

(

2019

), “

From recognition to cognition: visual commonsense reasoning

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

6720

6731

Zeng

Zhang

Wang

Zhang

and

Zhou

(

2023

), “

X 2-vlm: all-in-one pre-trained model for vision-language tasks

”,

IEEE Transactions on Pattern Analysis and Machine Intelligence.

Zhang

Lei

Zhang

and

S.Z.

(

2020

), “

Context-aware attention network for image-text retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

3536

3545

Zheng

Garrett

Yang

and

Shen

Y.-D.

(

2020

), “

Dualpath convolutional image-text embeddings with instance loss

”,

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

, Vol.

No.

, pp.

Zhu

Kiros

Zemel

Salakhutdinov

Urtasun

Torralba

and

Fidler

(

2015

), “

Aligning books and movies: towards story-like visual explanations by watching movies and reading books

”,

Proceedings of the IEEE International Conference on Computer Vision

, pp.

2025

Tsung-Shan Yang, Yun-Cheng Wang, Chengwei Wei, Suya You and C.-C. Jay Kuo

Figure 1.

The image shows the side of a tall, damaged building with peeling exterior surfaces and exposed concrete. In the foreground, a blue industrial lifting platform extends upward from the ground to the building wall. A single worker stands on the raised platform close to the structure, appearing to carry out construction or maintenance work. The platform arm reaches diagonally upward, positioning the worker several storeys above ground level. Adjacent buildings with windows are visible in the background, indicating an urban construction or renovation site. The scene focuses on elevated work using heavy machinery alongside an ageing building façade.

The example of image-to-text retrieval. By giving an image, we need to retrieve the paired captions from the candidate set

Figure 2.

A multi stage framework shows global alignment of paired image text data, followed by image domain and text domain clustering, feature selection, and subgroup alignment.

The diagram presents a structured framework divided into paired data, image domain, and text domain sections. Stage 1 shows global alignment of paired images and captions. In the image domain, stage 2a clusters similar images, stage 2b selects discriminant tokens, and stage 2c performs image domain subgroup alignment. In the text domain, captions are clustered in stage 3a, keywords are selected in stage 3b, and subgroup alignment is performed in stage 3c. Each stage is shown in separate labelled panels, with icons and example images or documents indicating the processing flow.

The overall algorithm design of Alignment. The first stage is the global alignment. The second and third stages include fine-grained clustering and feature selections in the image and text domain

Figure 3.

A dual stream model shows visual and text embeddings aligned into joint features, with inversion modules and reconstruction, cross reconstruction, and contrastive losses.

The diagram shows a dual pathway architecture for visual and text data. On the left, visual embedding d vis and text embedding d txt feed into visual alignment and text alignment modules, which map both into joint features d joint on the right. From the joint visual feature, a reconstruction loss connects back through visual inversion to the visual embedding. From the joint text feature, a reconstruction loss connects back through text inversion to the text embedding. Cross text inversion and cross visual inversion connect joint features back to the opposite modality, forming a cross reconstruction loss between visual and text embeddings. A contrastive loss links the joint visual feature and joint text feature. Blue boxes represent features, yellow boxes represent alignment modules, and red boxes represent trainable inversion parameters.

The illustration of the alignment process. The blue boxes are the features extracted by the frozen encoders. The orange boxes are the trainable transformation matrices. The red boxes are the auxiliary matrices for constraining the representations in the joint space

Figure 4.

A bar chart shows concept frequency by label, with a steep decline from the highest values and the top 100 concepts highlighted on the left.

The bar chart titled concept frequency plots labels on the horizontal axis and frequencies on the vertical axis, ranging from 0 to about 2000. Bars are ordered from highest to lowest frequency, forming a long tail distribution. The leftmost section is enclosed by a highlighted box and labelled top 100, showing the most frequent concepts. These bars start close to 2000 and decrease rapidly to around 1000 within the highlighted region. Beyond the top 100, frequencies continue to decline gradually across many labels, with most concepts appearing far less often than the highest ranked ones.

The frequency bar chart of the extracted corpus concepts. Top ten concepts and the corresponding counts are (‘man’, 36743), (‘woman’, 23845), (‘people’, 12810), (‘shirt’, 12743), (‘girl’, 10035), (‘dog’, 10030), (‘boy’, 9393), (‘men’, 8005), (‘child’, 7746), (‘street’, 7435), (‘group’, 6959), (‘front’, 6857), (‘water’, 5489), (‘hat’, 4075), (‘person’, 3810), (‘ball’, 3679), (‘jacket’, 3365), (‘building’, 3334), (‘hand’, 3113), and (‘player’, 3099)

Figure 5.

A heatmap shows a co occurrence matrix between detector objects and P O S concepts, with sparse high value points across a largely low intensity background.

The heatmap titled co occurrence matrix plots detector objects on the horizontal axis from 0 to about 80 and P O S concepts on the vertical axis from 0 to about 90. Most cells show very low co occurrence values, forming a dark background. Scattered brighter cells appear at specific intersections, indicating higher co occurrence between certain detector objects and P O S concepts. Vertical streaks at some detector indices suggest repeated associations across multiple P O S concepts. The overall pattern shows sparse but structured co occurrence rather than uniform distribution.

The occurrence matrix of POS tagging concepts and the detection results. The x-axis is the 80 object classes from the pretrained detector from the MS-COCO (Lin et al., 2014) dataset. The y-axis is the top 100 concepts from the POS tagger

Figure 6.

A word cloud shows clusters of frequently occurring visual concepts such as man, street, boat, bike, ball, dog, table, and glass, each surrounded by related terms.

The word cloud presents several clusters of concepts grouped by co occurrence. One cluster centres on man, with related words including woman, people, boy, girl, and shirt. Another cluster focuses on street, surrounded by person, city, building, road, park, car, and sidewalk. A boat cluster includes river, wave, surf, surfer, dock, canoe, wetsuit, and fishing. A bike cluster contains bicycle, dirt, race, rider, helmet, track, and motorcycle. A ball cluster groups player, soccer, basketball, baseball, football, game, team, and uniform. A dog cluster includes field, beach, grass, sand, toy, snow, and jump. A table cluster shows baby, chair, pool, band, stage, room, and microphone. A glass cluster includes food, drink, kitchen, bar, cup, bottle, fruit, beer, and apron.

The visualization results of the clustering. The font size denotes the frequency of the word in the corpus

Figure 7.

A scatter plot shows data points along a feature dimension, divided by multiple dashed partition points, with a central solid line marking the optimal partition.

The diagram titled optimal partition shows circular data points distributed along a horizontal feature dimension. Vertical dashed lines indicate candidate partition points that divide the feature space into segments. A single solid vertical line at the centre marks the selected optimal partition. Data points appear on both sides of each partition, with clusters forming within segments. The layout illustrates how different partition choices split the data, highlighting the central partition as the optimal separation among the available partition points.

Visualization of DFT. Red and orange dots represent the binary labels. The partition metric is the weighted sum of the left and right binary cross-entropy. Dashed lines denote the potential partition points

Figure 8.

A visual text alignment example shows an urban street image with detected objects and two stages of caption alignment compared against ground truth descriptions.

The figure combines an urban street photograph with alignment results. On the left, the image shows a police officer standing beside cars on a city street, with bounding boxes highlighting vehicles and a person. Below, ground truth captions describe an officer near a car on a busy city street. On the right, two columns list first stage alignment and second stage alignment captions, clustered by car, bus, and human. The first stage includes several general street descriptions, while the second stage refines the list to fewer captions, retaining the description of a police officer standing in front of a car on a busy street as the selected alignment.

Error cases of object detector alignment. The object detector will give all objects equal weights and try to include all the objects in the captions

Table 1.

Sensitivity to clustering methods, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset

		Image-to-text			Text-to-image
Clustering	#Cluster	R@1	R@5	R@10	R@1	R@5	R@10	#Param
KMeans	4	84.1	95.7	96.6	65.3	90.1	93.4	5.2M
	8	86.3	98.2	99.4	73.2	94.2	97.2	10M
	16	86.4	98.1	99.6	73.4	94.2	97.3	20M
Agglomerative	4	84.0	94.4	96.2	64.8	90.0	92.2	5.2M
	8	85.5	97.7	98.7	72.9	92.8	96.1	10M
	16	86.0	96.9	99.5	73.4	93.7	97.0	20M

		Image-to-text			Text-to-image
Clustering	#Cluster	R@1	R@5	R@10	R@1	R@5	R@10	#Param
KMeans	4	84.1	95.7	96.6	65.3	90.1	93.4	5.2M
	8	86.3	98.2	99.4	73.2	94.2	97.2	10M
	16	86.4	98.1	99.6	73.4	94.2	97.3	20M
Agglomerative	4	84.0	94.4	96.2	64.8	90.0	92.2	5.2M
	8	85.5	97.7	98.7	72.9	92.8	96.1	10M
	16	86.0	96.9	99.5	73.4	93.7	97.0	20M

Table 2.

The Flickr30k(1k testing set) and MSCOCO(5k testing set) dataset retrieval performance. We compare the single-model performance among all multi-modal retrieval models. The numbers are taken from Diao et al. (2023) R@1 represents Recall@1 for simplicity

	Flickr30k (1k testing set)						MS-COCO (5k testing set)
	Image-to-text			Text-to-image			Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
SCAN (Lee et al., 2018)	67.4	90.3	95.8	48.6	77.7	85.2	50.4	82.2	90.0	38.6	69.3	80.4
VSRN (Li et al., 2019)	71.3	90.6	96.0	54.7	81.8	88.2	53.0	81.1	89.4	40.5	70.6	81.1
CAAN (Zhang et al., 2020)	70.1	91.6	97.2	52.8	79.0	87.9	52.5	83.3	90.9	41.2	70.3	82.9
IMRAM (Chen et al., 2020)	74.1	93.0	96.6	53.9	79.4	87.2	53.7	83.2	91.0	39.7	69.1	79.8
MMCA (Wei et al., 2020)	74.2	92.8	96.4	54.8	81.4	87.8	54.0	82.5	90.7	38.7	69.7	80.8
GSMN (Liu et al., 2020)	76.4	94.3	97.3	57.4	82.3	89.0	–	–	–	–	–	–
SGRAF (Diao et al., 2021)	77.8	94.1	97.4	58.5	83.0	88.8	57.8	84.9	91.6	41.9	70.7	81.3
SHAN (Ji et al., 2021)	74.6	93.5	96.9	55.3	81.3	88.4	–	–	–	–	–	–
WCGL (Wang et al., 2021)	74.8	93.3	96.8	54.8	80.6	87.5	–	–	–	–	–	–
RCAR (Diao et al., 2023)	78.7	94.6	97.6	59.5	84.0	89.5	59.6	85.8	92.4	42.5	71.7	81.8
SGRAFS (Jawade et al., 2023)	79.2	95.3	97.7	58.3	83.1	89.2	58.0	85.1	91.6	41.7	71.2	81.5
CLIP (Radford et al., 2021)	88.0	98.7	99.4	68.7	90.6	95.2	58.4	81.5	88.1	37.8	62.4	72.2
GEMMA(Ours)	88.6	98.9	99.6	75.7	94.2	97.1	58.6	83.2	90.0	45.3	72.6	82.8

	Flickr30k (1k testing set)						MS-COCO (5k testing set)
	Image-to-text			Text-to-image			Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
SCAN (Lee et al., 2018)	67.4	90.3	95.8	48.6	77.7	85.2	50.4	82.2	90.0	38.6	69.3	80.4
VSRN (Li et al., 2019)	71.3	90.6	96.0	54.7	81.8	88.2	53.0	81.1	89.4	40.5	70.6	81.1
CAAN (Zhang et al., 2020)	70.1	91.6	97.2	52.8	79.0	87.9	52.5	83.3	90.9	41.2	70.3	82.9
IMRAM (Chen et al., 2020)	74.1	93.0	96.6	53.9	79.4	87.2	53.7	83.2	91.0	39.7	69.1	79.8
MMCA (Wei et al., 2020)	74.2	92.8	96.4	54.8	81.4	87.8	54.0	82.5	90.7	38.7	69.7	80.8
GSMN (Liu et al., 2020)	76.4	94.3	97.3	57.4	82.3	89.0	–	–	–	–	–	–
SGRAF (Diao et al., 2021)	77.8	94.1	97.4	58.5	83.0	88.8	57.8	84.9	91.6	41.9	70.7	81.3
SHAN (Ji et al., 2021)	74.6	93.5	96.9	55.3	81.3	88.4	–	–	–	–	–	–
WCGL (Wang et al., 2021)	74.8	93.3	96.8	54.8	80.6	87.5	–	–	–	–	–	–
RCAR (Diao et al., 2023)	78.7	94.6	97.6	59.5	84.0	89.5	59.6	85.8	92.4	42.5	71.7	81.8
SGRAFS (Jawade et al., 2023)	79.2	95.3	97.7	58.3	83.1	89.2	58.0	85.1	91.6	41.7	71.2	81.5
CLIP (Radford et al., 2021)	88.0	98.7	99.4	68.7	90.6	95.2	58.4	81.5	88.1	37.8	62.4	72.2
GEMMA(Ours)	88.6	98.9	99.6	75.7	94.2	97.1	58.6	83.2	90.0	45.3	72.6	82.8

Table 3.

The experiment results with different visual and text features for the alignment process. All the experiments are conducted in the Flickr30k dataset

			Flickr30k (1k testing set)
		Alignment	Image-to-text			Text-to-image
Visual enc.	Text enc.	(GEMMA)	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	x	88.0	98.7	99.4	68.7	90.6	95.2
DETR (Carion et al., 2020)	RoBERTa (Liu et al., 2019)	v	66.7	89.5	93.6	56.7	84.5	90.3
DETR (Carion et al., 2020)	CLIP text (Radford et al., 2021)	v	73.6	91.6	94.5	60.0	85.8	90.6
CLIP vis (Radford et al., 2021)	RoBERTa (Liu et al., 2019)	v	86.3	98.2	99.4	73.2	94.2	97.2
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	v	88.6	98.9	99.6	74.8	94.2	97.1

			Flickr30k (1k testing set)
		Alignment	Image-to-text			Text-to-image
Visual enc.	Text enc.	(GEMMA)	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	x	88.0	98.7	99.4	68.7	90.6	95.2
DETR (Carion et al., 2020)	RoBERTa (Liu et al., 2019)	v	66.7	89.5	93.6	56.7	84.5	90.3
DETR (Carion et al., 2020)	CLIP text (Radford et al., 2021)	v	73.6	91.6	94.5	60.0	85.8	90.6
CLIP vis (Radford et al., 2021)	RoBERTa (Liu et al., 2019)	v	86.3	98.2	99.4	73.2	94.2	97.2
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	v	88.6	98.9	99.6	74.8	94.2	97.1

Table 4.

Ablation Studies on different stages, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset.

Alignment	Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10
Without alignment	64.5	71.7	84.3	32.7	61.6	80.1
Global	84.8	97.8	99.0	68.3	90.7	91.1
+Image cluster	85.4	98.0	99.1	70.3	91.5	94.3
+Text cluster (Final)	86.3	98.2	99.4	73.2	94.2	97.2

Alignment	Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10
Without alignment	64.5	71.7	84.3	32.7	61.6	80.1
Global	84.8	97.8	99.0	68.3	90.7	91.1
+Image cluster	85.4	98.0	99.1	70.3	91.5	94.3
+Text cluster (Final)	86.3	98.2	99.4	73.2	94.2	97.2

Table 5.

Experiments on Detector Features

		Flickr30k (1k testing set)
Vis Feat			Image-to-text			Text-to-image
Global Feat	Detail Feat	Text Feat	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP	CLIP	CLIP	85.3	91.9	93.3	72.1	90.6	92.2
DETR encoder	DETR decoder	CLIP	18.3	35.1	41.8	19.5	25.3	45.9
ResNet Backbone	DETR encoder	CLIP	66.7	89.5	93.3	56.7	84.5	90.3
ResNet Backbone	DETR decoder	CLIP	72.4	91.6	95.1	59.5	85.7	90.5
ResNet Backbone	DETR decoder	RoBERTa	64.5	84.5	88.4	53.3	83.3	87.3

		Flickr30k (1k testing set)
Vis Feat			Image-to-text			Text-to-image
Global Feat	Detail Feat	Text Feat	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP	CLIP	CLIP	85.3	91.9	93.3	72.1	90.6	92.2
DETR encoder	DETR decoder	CLIP	18.3	35.1	41.8	19.5	25.3	45.9
ResNet Backbone	DETR encoder	CLIP	66.7	89.5	93.3	56.7	84.5	90.3
ResNet Backbone	DETR decoder	CLIP	72.4	91.6	95.1	59.5	85.7	90.5
ResNet Backbone	DETR decoder	RoBERTa	64.5	84.5	88.4	53.3	83.3	87.3

Anderson

Buehler

Teney

Johnson

Gould

and

Zhang

(

2018

), “

Bottom-up and top-down attention for image captioning and visual question answering

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

6077

6086

Carion

Massa

Synnaeve

Usunier

Kirillov

and

Zagoruyko

(

2020

), “

End-to-end object detection with transformers

”,

European Conference on Computer Vision

Springer

, pp.

213

229

Chen

Kornblith

Norouzi

and

Hinton

(

2020a

), “

A simple framework for contrastive learning of visual representations

”,

International Conference on Machine Learning

PMLR

, pp.

1597

1607

Chen

Ding

Liu

Lin

Liu

and

Han

(

2020b

), “

Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

12655

12663

Chen

Wang

Chen

Xing

Muyan

Zhang

Zhu

, et al. (

2023

), “

Internvl: scaling up vision foundation models and aligning for generic visual-linguistic tasks

”,

arXiv preprint

arXiv:2312.14238

Cheng

Sun

Wang

Zhu

Yao

Chen

Song

Han

Liu

Ding

, et al. (

2022

), “

Vista: vision and scene text aggregation for cross-modal retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

5184

5193

Deng

Dong

Socher

L.-J.

and

Fei-Fei

(

2009

), “

ImageNet: a large-scale hierarchical image database

”,

CVPR09

Devlin

(

2018

), “

Bert: pre-training of deep bidirectional transformers for language understanding

”,

arXiv preprint

arXiv:1810.04805

Diao

Zhang

Liu

Ruan

and

(

2023

), “

Plug-and-play regulators for image-text matching

”,

IEEE Transactions on Image Processing.

Diao

Zhang

and

(

2021

), “

Similarity reasoning and filtration for image-text matching

”,

Proceedings of the AAAI Conference on Artificial Intelligence

, Vol.

, No.

, pp.

1218

1226

Frome

Corrado

G.S.

Shlens

Bengio

Dean

Ranzato

and

Mikolov

(

2013

), “

Devise: a deep visual-semantic embedding model

”,

Advances in Neural Information Processing Systems

, p.

Gao

Yao

and

Chen

(

2021

), “

SIMCSE: simple contrastive learning of sentence embeddings

”,

arXiv preprint

arXiv:2104.08821

Hadsell

Chopra

and

LeCun

(

2006

), “

Dimensionality reduction by learning an invariant mapping

”,

2006 IEEE Computer Society Conference On Computer Vision And Pattern Recognition (CVPR’06)

IEEE

, Vol.

, pp.

1735

1742

Jain

and

Abbeel

(

2020

), “

Denoising diffusion probabilistic models

”,

Advances in neural information processing systems

, Vol.

, pp.

6840

6851

Jawade

Mohan

D.D.

Ali

N.M.

Setlur

and

Govindaraju

(

2023

), “

NAPReg: nouns as proxies regularization for semantically aware crossmodal embeddings

”,

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

, pp.

1135

1144

Chen

and

Wang

(

2021

), “

Step-wise hierarchical alignment network for image-text matching

”,

arXiv preprint

arXiv:2106.06509

Karpathy

and

Fei-Fei

(

2015

), “

Deep visual-semantic alignments for generating image descriptions

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

3128

3137

Kim

Son

and

Kim

(

2021

), “

Vilt: vision-and-language transformer without convolution or region supervision

”,

International Conference on Machine Learning

PMLR

, pp.

5583

5594

Kuo

C.-C.J.

and

Madni

A.M.

(

2023

), “

Green learning: introduction, examples and outlook

”,

Journal of Visual Communication and Image Representation

, Vol.

, p.

103685

Kwon

Cai

Ravichandran

Bas

Bhotika

and

Soatto

(

2022

), “

Masked vision and language modeling for multi-modal representation learning

”,

arXiv preprint

arXiv:2208.02131

Lee

K.-H.

Chen

Hua

and

(

2018

), “

Stacked cross attention for image-text matching

”,

Proceedings of the European Conference on Computer Vision (ECCV)

, pp.

201

216

Zhang

and

(

2019

), “

Visual semantic reasoning for image-text matching

”,

Proceedings of the IEEE/CVF International Conference on Computer Vision

, pp.

4654

4662

Lin

T.-Y.

Maire

Belongie

Hays

Perona

Ramanan

Dollár

and

Zitnick

C.L.

(

2014

), “

Microsoft coco: common objects in context

”,

Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13

Springer

, pp.

740

755

Liu

Mao

Zhang

Xie

Wang

and

Zhang

(

2020

), “

Graph structured network for image-text matching

”,

Proceedings of the IEEE/CVF Conference On Computer Vision And Pattern Recognition

, pp.

10921

10930

Liu

and

Lee

Y.J.

(

2024

), “

Visual instruction tuning

”,

Advances in Neural Information Processing Systems

, p.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

and

Stoyanov

(

2019

), “

Roberta: a robustly optimized Bert pretraining approach

”,

arXiv preprint

arXiv:1907.11692

Nam

J.-W.

and

Kim

(

2017

), “

Dual attention networks for multimodal reasoning and matching

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

299

307

Oh Song

Xiang

Jegelka

and

Savarese

(

2016

), “

Deep metric learning via lifted structured feature embedding

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

4004

4012

Pennington

Socher

and

Manning

C.D.

(

2014

), “

Glove: global vectors for word representation

”, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.

1532

1543

Radford

Kim

J.W.

Hallacy

Ramesh

Goh

Agarwal

Sastry

Askell

Mishkin

Clark

et al. (

2021

), “

Learning transferable visual models from natural language supervision

”,

International Conference on Machine Learning

PMLR

, pp.

8748

8763

Robinson

Chuang

C.-Y.

Sra

and

Jegelka

(

2020

), “

Contrastive learning with hard negative samples

”,

arXiv preprint

arXiv:2010.04592

Schroff

Kalenichenko

and

Philbin

(

2015

), “

Facenet: a unified embedding for face recognition and clustering

”,

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

, pp.

815

823

Singh

Goswami

Couairon

Galuba

Rohrbach

and

Kiela

(

2022

), “

Flava: a foundational language and vision alignment model

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

15638

15650

Sohl-Dickstein

Weiss

Maheswaranathan

and

Ganguli

(

2015

), “

Deep unsupervised learning using nonequilibrium thermodynamics

”,

International Conference on Machine Learning

PMLR

, pp.

2256

2265

Sohn

(

2016

), “

Improved deep metric learning with multi-class n-pair loss objective

”,

Advances in Neural Information Processing Systems

, p.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

Kaiser

A.N.

and

Polosukhin

(

2017

), “

Attention is all you need

”,

Advances in Neural Information Processing Systems

, p.

Wang

Huang

and

Lazebnik

(

2018

), “

Learning two-branch neural networks for image-text matching tasks

”,

IEEE Transactions on Pattern Analysis and Machine Intelligence

, Vol.

No.

, pp.

394

407

PubMed

Wang

Cao

Shen

Gao

and

Hengel

A.v.d.

(

2019

), “

Neighbourhood watch: referring expression comprehension via languageguided graph attention networks

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

1960

1968

Wang

Bao

Dong

Bjorck

Peng

Liu

Aggarwal

Mohammed

O.K.

Singhal

Som

, et al., (

2022

), “

Image as a foreign language: Beit pretraining for all vision and vision-language tasks

”,

arXiv preprint

arXiv:2208.10442

Wang

Zhang

Cui

Huang

Shen

and

Yang

(

2021

), “

Wasserstein coupled graph learning for cross-modal retrieval

”,

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

IEEE

, pp.

1793

1802

Wei

Pang

and

Kuo

C.-C.J.

(

2024

), “

GWPT: a GreenWord-embeddingbased POS tagger

”,

arXiv preprint

arXiv:2401.07475

Wei

Zhang

and

(

2020

), “

Multi-modality cross attention network for image and sentence matching

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

10941

10950

Xuan

Stylianou

Liu

and

Pless

(

2020

), “

Hard negative examples are hard, but useful

”,

Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16

Springer

, pp.

126

142

Yang

Wang

Kuo

C.-C.J.

et al. (

2022

), “

On supervised feature selection from high dimensional feature spaces

”,

APSIPA Transactions on Signal and Information Processing

, Vol.

No.

Young

Lai

Hodosh

and

Hockenmaier

(

2014

), “

From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions

”,

Transactions of the Association for Computational Linguistics

, Vol.

, pp.

Wang

Vasudevan

Yeung

Seyedhosseini

and

(

2022

), “

Coca: contrastive captioners are image-text foundation models

”,

arXiv preprint

arXiv:2205.01917

Shi

Pasunuru

Muller

Golovneva

Wang

Babu

Tang

Karrer

Sheynin

et al. (

2023

), “

Scaling autoregressive multi-modal models: pretraining and instruction tuning

”,

arXiv preprint

arXiv:2309.02591

, Vol.

No.

Zellers

Bisk

Farhadi

and

Choi

(

2019

), “

From recognition to cognition: visual commonsense reasoning

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

6720

6731

Zeng

Zhang

Wang

Zhang

and

Zhou

(

2023

), “

X 2-vlm: all-in-one pre-trained model for vision-language tasks

”,

IEEE Transactions on Pattern Analysis and Machine Intelligence.

Zhang

Lei

Zhang

and

S.Z.

(

2020

), “

Context-aware attention network for image-text retrieval

”,

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

, pp.

3536

3545

Zheng

Garrett

Yang

and

Shen

Y.-D.

(

2020

), “

Dualpath convolutional image-text embeddings with instance loss

”,

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM)

, Vol.

No.

, pp.

Zhu

Kiros

Zemel

Salakhutdinov

Urtasun

Torralba

and

Fidler

(

2015

), “

Aligning books and movies: towards story-like visual explanations by watching movies and reading books

”,

Proceedings of the IEEE International Conference on Computer Vision

, pp.