Table 4.

Ablation Studies on different stages, where R@k presents the top-k recalls and #Param denotes the number of trainable parameters. All the experiment is based on CLIP (Radford et al., 2021) visual encoder and RoBERTa (Liu et al., 2019) text encoder with Flickr30k (Young et al., 2014) dataset.

AlignmentImage-to-textText-to-image
R@1R@5R@10R@1R@5R@10
Without alignment64.571.784.332.761.680.1
Global84.897.899.068.390.791.1
+Image cluster85.498.099.170.391.594.3
+Text cluster (Final)86.398.299.473.294.297.2

or Create an Account

Close Modal
Close Modal