Table 1. Sensitivity to clustering...

Table 1.

Sensitivity to clustering methods, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset

		Image-to-text			Text-to-image
Clustering	#Cluster	R@1	R@5	R@10	R@1	R@5	R@10	#Param
KMeans	4	84.1	95.7	96.6	65.3	90.1	93.4	5.2M
	8	86.3	98.2	99.4	73.2	94.2	97.2	10M
	16	86.4	98.1	99.6	73.4	94.2	97.3	20M
Agglomerative	4	84.0	94.4	96.2	64.8	90.0	92.2	5.2M
	8	85.5	97.7	98.7	72.9	92.8	96.1	10M
	16	86.0	96.9	99.5	73.4	93.7	97.0	20M

		Image-to-text			Text-to-image
Clustering	#Cluster	R@1	R@5	R@10	R@1	R@5	R@10	#Param
KMeans	4	84.1	95.7	96.6	65.3	90.1	93.4	5.2M
	8	86.3	98.2	99.4	73.2	94.2	97.2	10M
	16	86.4	98.1	99.6	73.4	94.2	97.3	20M
Agglomerative	4	84.0	94.4	96.2	64.8	90.0	92.2	5.2M
	8	85.5	97.7	98.7	72.9	92.8	96.1	10M
	16	86.0	96.9	99.5	73.4	93.7	97.0	20M

[ViewLarge]

Sharing Unavailable