Table 3.

The experiment results with different visual and text features for the alignment process. All the experiments are conducted in the Flickr30k dataset

Flickr30k (1k testing set)
AlignmentImage-to-textText-to-image
Visual enc.Text enc.(GEMMA)Recall@1Recall@5Recall@10Recall@1Recall@5Recall@10
CLIP vis (Radford et al., 2021)CLIP text (Radford et al., 2021)x88.098.799.468.790.695.2
DETR (Carion et al., 2020)RoBERTa (Liu et al., 2019)v66.789.593.656.784.590.3
DETR (Carion et al., 2020)CLIP text (Radford et al., 2021)v73.691.694.560.085.890.6
CLIP vis (Radford et al., 2021)RoBERTa (Liu et al., 2019)v86.398.299.473.294.297.2
CLIP vis (Radford et al., 2021)CLIP text (Radford et al., 2021)v88.698.999.674.894.297.1

or Create an Account

Close Modal
Close Modal