Table 4. Ablation Studies on different...

Table 4.

Ablation Studies on different stages, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset.

Alignment	Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10
Without alignment	64.5	71.7	84.3	32.7	61.6	80.1
Global	84.8	97.8	99.0	68.3	90.7	91.1
+Image cluster	85.4	98.0	99.1	70.3	91.5	94.3
+Text cluster (Final)	86.3	98.2	99.4	73.2	94.2	97.2

Alignment	Image-to-text			Text-to-image
	R@1	R@5	R@10	R@1	R@5	R@10
Without alignment	64.5	71.7	84.3	32.7	61.6	80.1
Global	84.8	97.8	99.0	68.3	90.7	91.1
+Image cluster	85.4	98.0	99.1	70.3	91.5	94.3
+Text cluster (Final)	86.3	98.2	99.4	73.2	94.2	97.2

[ViewLarge]

Sharing Unavailable