Table 2. The Flickr30k(1k testing set)...

Table 2.

The Flickr30k(1k testing set) and MSCOCO(5k testing set) dataset retrieval performance. We compare the single-model performance among all multi-modal retrieval models. The numbers are taken from Diao et al. (2023) R@1 represents Recall@1 for simplicity

	Flickr30k (1k testing set)						MS-COCO (5k testing set)
	Image-to-text			Text-to-image			Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
SCAN (Lee et al., 2018)	67.4	90.3	95.8	48.6	77.7	85.2	50.4	82.2	90.0	38.6	69.3	80.4
VSRN (Li et al., 2019)	71.3	90.6	96.0	54.7	81.8	88.2	53.0	81.1	89.4	40.5	70.6	81.1
CAAN (Zhang et al., 2020)	70.1	91.6	97.2	52.8	79.0	87.9	52.5	83.3	90.9	41.2	70.3	82.9
IMRAM (Chen et al., 2020)	74.1	93.0	96.6	53.9	79.4	87.2	53.7	83.2	91.0	39.7	69.1	79.8
MMCA (Wei et al., 2020)	74.2	92.8	96.4	54.8	81.4	87.8	54.0	82.5	90.7	38.7	69.7	80.8
GSMN (Liu et al., 2020)	76.4	94.3	97.3	57.4	82.3	89.0	–	–	–	–	–	–
SGRAF (Diao et al., 2021)	77.8	94.1	97.4	58.5	83.0	88.8	57.8	84.9	91.6	41.9	70.7	81.3
SHAN (Ji et al., 2021)	74.6	93.5	96.9	55.3	81.3	88.4	–	–	–	–	–	–
WCGL (Wang et al., 2021)	74.8	93.3	96.8	54.8	80.6	87.5	–	–	–	–	–	–
RCAR (Diao et al., 2023)	78.7	94.6	97.6	59.5	84.0	89.5	59.6	85.8	92.4	42.5	71.7	81.8
SGRAFS (Jawade et al., 2023)	79.2	95.3	97.7	58.3	83.1	89.2	58.0	85.1	91.6	41.7	71.2	81.5
CLIP (Radford et al., 2021)	88.0	98.7	99.4	68.7	90.6	95.2	58.4	81.5	88.1	37.8	62.4	72.2
GEMMA(Ours)	88.6	98.9	99.6	75.7	94.2	97.1	58.6	83.2	90.0	45.3	72.6	82.8

	Flickr30k (1k testing set)						MS-COCO (5k testing set)
	Image-to-text			Text-to-image			Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10	R@1	R@5	R@10
SCAN (Lee et al., 2018)	67.4	90.3	95.8	48.6	77.7	85.2	50.4	82.2	90.0	38.6	69.3	80.4
VSRN (Li et al., 2019)	71.3	90.6	96.0	54.7	81.8	88.2	53.0	81.1	89.4	40.5	70.6	81.1
CAAN (Zhang et al., 2020)	70.1	91.6	97.2	52.8	79.0	87.9	52.5	83.3	90.9	41.2	70.3	82.9
IMRAM (Chen et al., 2020)	74.1	93.0	96.6	53.9	79.4	87.2	53.7	83.2	91.0	39.7	69.1	79.8
MMCA (Wei et al., 2020)	74.2	92.8	96.4	54.8	81.4	87.8	54.0	82.5	90.7	38.7	69.7	80.8
GSMN (Liu et al., 2020)	76.4	94.3	97.3	57.4	82.3	89.0	–	–	–	–	–	–
SGRAF (Diao et al., 2021)	77.8	94.1	97.4	58.5	83.0	88.8	57.8	84.9	91.6	41.9	70.7	81.3
SHAN (Ji et al., 2021)	74.6	93.5	96.9	55.3	81.3	88.4	–	–	–	–	–	–
WCGL (Wang et al., 2021)	74.8	93.3	96.8	54.8	80.6	87.5	–	–	–	–	–	–
RCAR (Diao et al., 2023)	78.7	94.6	97.6	59.5	84.0	89.5	59.6	85.8	92.4	42.5	71.7	81.8
SGRAFS (Jawade et al., 2023)	79.2	95.3	97.7	58.3	83.1	89.2	58.0	85.1	91.6	41.7	71.2	81.5
CLIP (Radford et al., 2021)	88.0	98.7	99.4	68.7	90.6	95.2	58.4	81.5	88.1	37.8	62.4	72.2
GEMMA(Ours)	88.6	98.9	99.6	75.7	94.2	97.1	58.6	83.2	90.0	45.3	72.6	82.8

[ViewLarge]

Sharing Unavailable