Table 1.

Sensitivity to clustering methods, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset

Image-to-textText-to-image
Clustering#ClusterR@1R@5R@10R@1R@5R@10#Param
KMeans484.195.796.665.390.193.45.2M
886.398.299.473.294.297.210M
1686.498.199.673.494.297.320M
Agglomerative484.094.496.264.890.092.25.2M
885.597.798.772.992.896.110M
1686.096.999.573.493.797.020M

or Create an Account

Close Modal
Close Modal