Table 2.

The Flickr30k(1k testing set) and MSCOCO(5k testing set) dataset retrieval performance. We compare the single-model performance among all multi-modal retrieval models. The numbers are taken from Diao et al. (2023) R@1 represents Recall@1 for simplicity

Flickr30k (1k testing set)MS-COCO (5k testing set)
Image-to-textText-to-imageImage-to-textText-to-image
R@1R@5R@10R@1R@5R@10R@1R@5R@10R@1R@5R@10
SCAN (Lee et al., 2018)67.490.395.848.677.785.250.482.290.038.669.380.4
VSRN (Li et al., 2019)71.390.696.054.781.888.253.081.189.440.570.681.1
CAAN (Zhang et al., 2020)70.191.697.252.879.087.952.583.390.941.270.382.9
IMRAM (Chen et al., 2020)74.193.096.653.979.487.253.783.291.039.769.179.8
MMCA (Wei et al., 2020)74.292.896.454.881.487.854.082.590.738.769.780.8
GSMN (Liu et al., 2020)76.494.397.357.482.389.0
SGRAF (Diao et al., 2021)77.894.197.458.583.088.857.884.991.641.970.781.3
SHAN (Ji et al., 2021)74.693.596.955.381.388.4
WCGL (Wang et al., 2021)74.893.396.854.880.687.5
RCAR (Diao et al., 2023)78.794.697.659.584.089.559.685.892.442.571.781.8
SGRAFS (Jawade et al., 2023)79.295.397.758.383.189.258.085.191.641.771.281.5
CLIP (Radford et al., 2021)88.098.799.468.790.695.258.481.588.137.862.472.2
GEMMA(Ours)88.698.999.675.794.297.158.683.290.045.372.682.8

or Create an Account

Close Modal
Close Modal