The experiment results with different visual and text features for the alignment process. All the experiments are conducted in the Flickr30k dataset
| Flickr30k (1k testing set) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Alignment | Image-to-text | Text-to-image | ||||||
| Visual enc. | Text enc. | (GEMMA) | Recall@1 | Recall@5 | Recall@10 | Recall@1 | Recall@5 | Recall@10 |
| CLIP vis (Radford et al., 2021) | CLIP text (Radford et al., 2021) | x | 88.0 | 98.7 | 99.4 | 68.7 | 90.6 | 95.2 |
| DETR (Carion et al., 2020) | RoBERTa (Liu et al., 2019) | v | 66.7 | 89.5 | 93.6 | 56.7 | 84.5 | 90.3 |
| DETR (Carion et al., 2020) | CLIP text (Radford et al., 2021) | v | 73.6 | 91.6 | 94.5 | 60.0 | 85.8 | 90.6 |
| CLIP vis (Radford et al., 2021) | RoBERTa (Liu et al., 2019) | v | 86.3 | 98.2 | 99.4 | 73.2 | 94.2 | 97.2 |
| CLIP vis (Radford et al., 2021) | CLIP text (Radford et al., 2021) | v | 88.6 | 98.9 | 99.6 | 74.8 | 94.2 | 97.1 |
| Flickr30k (1k testing set) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Alignment | Image-to-text | Text-to-image | ||||||
| Visual | Text | ( | Recall@1 | Recall@5 | Recall@10 | Recall@1 | Recall@5 | Recall@10 |
| x | 68.7 | 90.6 | 95.2 | |||||
| RoBERTa ( | v | 66.7 | 89.5 | 93.6 | 56.7 | 84.5 | 90.3 | |
| v | 73.6 | 91.6 | 94.5 | 60.0 | 85.8 | 90.6 | ||
| RoBERTa ( | v | 86.3 | 98.2 | |||||
| v | ||||||||