The Flickr30k(1k testing set) and MSCOCO(5k testing set) dataset retrieval performance. We compare the single-model performance among all multi-modal retrieval models. The numbers are taken from Diao et al. (2023) R@1 represents Recall@1 for simplicity
| Flickr30k (1k testing set) | MS-COCO (5k testing set) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Image-to-text | Text-to-image | Image-to-text | Text-to-image | |||||||||
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
| SCAN (Lee et al., 2018) | 67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 | 50.4 | 82.2 | 90.0 | 38.6 | 69.3 | 80.4 |
| VSRN (Li et al., 2019) | 71.3 | 90.6 | 96.0 | 54.7 | 81.8 | 88.2 | 53.0 | 81.1 | 89.4 | 40.5 | 70.6 | 81.1 |
| CAAN (Zhang et al., 2020) | 70.1 | 91.6 | 97.2 | 52.8 | 79.0 | 87.9 | 52.5 | 83.3 | 90.9 | 41.2 | 70.3 | 82.9 |
| IMRAM (Chen et al., 2020) | 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 | 53.7 | 83.2 | 91.0 | 39.7 | 69.1 | 79.8 |
| MMCA (Wei et al., 2020) | 74.2 | 92.8 | 96.4 | 54.8 | 81.4 | 87.8 | 54.0 | 82.5 | 90.7 | 38.7 | 69.7 | 80.8 |
| GSMN (Liu et al., 2020) | 76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 | – | – | – | – | – | – |
| SGRAF (Diao et al., 2021) | 77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 | 57.8 | 84.9 | 91.6 | 41.9 | 70.7 | 81.3 |
| SHAN (Ji et al., 2021) | 74.6 | 93.5 | 96.9 | 55.3 | 81.3 | 88.4 | – | – | – | – | – | – |
| WCGL (Wang et al., 2021) | 74.8 | 93.3 | 96.8 | 54.8 | 80.6 | 87.5 | – | – | – | – | – | – |
| RCAR (Diao et al., 2023) | 78.7 | 94.6 | 97.6 | 59.5 | 84.0 | 89.5 | 59.6 | 85.8 | 92.4 | 42.5 | 71.7 | 81.8 |
| SGRAFS (Jawade et al., 2023) | 79.2 | 95.3 | 97.7 | 58.3 | 83.1 | 89.2 | 58.0 | 85.1 | 91.6 | 41.7 | 71.2 | 81.5 |
| CLIP (Radford et al., 2021) | 88.0 | 98.7 | 99.4 | 68.7 | 90.6 | 95.2 | 58.4 | 81.5 | 88.1 | 37.8 | 62.4 | 72.2 |
| GEMMA(Ours) | 88.6 | 98.9 | 99.6 | 75.7 | 94.2 | 97.1 | 58.6 | 83.2 | 90.0 | 45.3 | 72.6 | 82.8 |
| Flickr30k (1k testing set) | MS-COCO (5k testing set) | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Image-to-text | Text-to-image | Image-to-text | Text-to-image | |||||||||
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | |
| 67.4 | 90.3 | 95.8 | 48.6 | 77.7 | 85.2 | 50.4 | 82.2 | 90.0 | 38.6 | 69.3 | 80.4 | |
| 71.3 | 90.6 | 96.0 | 54.7 | 81.8 | 88.2 | 53.0 | 81.1 | 89.4 | 40.5 | 70.6 | 81.1 | |
| 70.1 | 91.6 | 97.2 | 52.8 | 79.0 | 87.9 | 52.5 | 83.3 | 90.9 | 41.2 | 70.3 | 82.9 | |
| 74.1 | 93.0 | 96.6 | 53.9 | 79.4 | 87.2 | 53.7 | 83.2 | 91.0 | 39.7 | 69.1 | 79.8 | |
| 74.2 | 92.8 | 96.4 | 54.8 | 81.4 | 87.8 | 54.0 | 82.5 | 90.7 | 38.7 | 69.7 | 80.8 | |
| 76.4 | 94.3 | 97.3 | 57.4 | 82.3 | 89.0 | – | – | – | – | – | – | |
| 77.8 | 94.1 | 97.4 | 58.5 | 83.0 | 88.8 | 57.8 | 84.9 | 91.6 | 41.9 | 70.7 | 81.3 | |
| 74.6 | 93.5 | 96.9 | 55.3 | 81.3 | 88.4 | – | – | – | – | – | – | |
| 74.8 | 93.3 | 96.8 | 54.8 | 80.6 | 87.5 | – | – | – | – | – | – | |
| 78.7 | 94.6 | 97.6 | 59.5 | 84.0 | 89.5 | |||||||
| 79.2 | 95.3 | 97.7 | 58.3 | 83.1 | 89.2 | 58.0 | 41.7 | 71.2 | 81.5 | |||
| 58.4 | 81.5 | 88.1 | 37.8 | 62.4 | 72.2 | |||||||
| 83.2 | 90.0 | |||||||||||