Table 3. The experiment results with... | Emerald Publishing

Table 3.

The experiment results with different visual and text features for the alignment process. All the experiments are conducted in the Flickr30k dataset

			Flickr30k (1k testing set)
		Alignment	Image-to-text			Text-to-image
Visual enc.	Text enc.	(GEMMA)	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	x	88.0	98.7	99.4	68.7	90.6	95.2
DETR (Carion et al., 2020)	RoBERTa (Liu et al., 2019)	v	66.7	89.5	93.6	56.7	84.5	90.3
DETR (Carion et al., 2020)	CLIP text (Radford et al., 2021)	v	73.6	91.6	94.5	60.0	85.8	90.6
CLIP vis (Radford et al., 2021)	RoBERTa (Liu et al., 2019)	v	86.3	98.2	99.4	73.2	94.2	97.2
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	v	88.6	98.9	99.6	74.8	94.2	97.1

			Flickr30k (1k testing set)
		Alignment	Image-to-text			Text-to-image
Visual enc.	Text enc.	(GEMMA)	Recall@1	Recall@5	Recall@10	Recall@1	Recall@5	Recall@10
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	x	88.0	98.7	99.4	68.7	90.6	95.2
DETR (Carion et al., 2020)	RoBERTa (Liu et al., 2019)	v	66.7	89.5	93.6	56.7	84.5	90.3
DETR (Carion et al., 2020)	CLIP text (Radford et al., 2021)	v	73.6	91.6	94.5	60.0	85.8	90.6
CLIP vis (Radford et al., 2021)	RoBERTa (Liu et al., 2019)	v	86.3	98.2	99.4	73.2	94.2	97.2
CLIP vis (Radford et al., 2021)	CLIP text (Radford et al., 2021)	v	88.6	98.9	99.6	74.8	94.2	97.1