Experiments on Detector Features
| Flickr30k (1k testing set) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Vis Feat | Image-to-text | Text-to-image | ||||||
| Global Feat | Detail Feat | Text Feat | Recall@1 | Recall@5 | Recall@10 | Recall@1 | Recall@5 | Recall@10 |
| CLIP | CLIP | CLIP | 85.3 | 91.9 | 93.3 | 72.1 | 90.6 | 92.2 |
| DETR encoder | DETR decoder | CLIP | 18.3 | 35.1 | 41.8 | 19.5 | 25.3 | 45.9 |
| ResNet Backbone | DETR encoder | CLIP | 66.7 | 89.5 | 93.3 | 56.7 | 84.5 | 90.3 |
| ResNet Backbone | DETR decoder | CLIP | 72.4 | 91.6 | 95.1 | 59.5 | 85.7 | 90.5 |
| ResNet Backbone | DETR decoder | RoBERTa | 64.5 | 84.5 | 88.4 | 53.3 | 83.3 | 87.3 |
| Flickr30k (1k testing set) | ||||||||
|---|---|---|---|---|---|---|---|---|
| Vis Feat | Image-to-text | Text-to-image | ||||||
| Global Feat | Detail Feat | Text Feat | Recall@1 | Recall@5 | Recall@10 | Recall@1 | Recall@5 | Recall@10 |
| 85.3 | 91.9 | 93.3 | 72.1 | 90.6 | 92.2 | |||
| 18.3 | 35.1 | 41.8 | 19.5 | 25.3 | 45.9 | |||
| ResNet Backbone | 66.7 | 89.5 | 93.3 | 56.7 | 84.5 | 90.3 | ||
| ResNet Backbone | 72.4 | 91.6 | 95.1 | 59.5 | 85.7 | 90.5 | ||
| ResNet Backbone | RoBERTa | 64.5 | 84.5 | 88.4 | 53.3 | 83.3 | 87.3 | |