The purpose of this study is to explore and evaluate advanced text-to-image synthesis methods that generate realistic and semantically aligned images from textual descriptions. By leveraging modern deep learning approaches, the research aims to improve image generation quality, diversity and textual coherence using artificial intelligence techniques.
The research focuses on designing, implementing and training four different text-to-image generator architectures based on generative adversarial networks (GANs) and transformer-based text embeddings. Two distinct text-to-image fusion strategies were applied: deep fusion (DF) via affine transformations and a semantic-spatial attention mechanism. The models were trained on three large datasets (CUB-200, MS-COCO and ImageNet), resulting in 12 unique generator-discriminator configurations. Performance was evaluated using the inception score and Fréchet inception distance (FID).
The proposed architecture, combining DF blocks and Semantic-Spatial Aware Convolution Network (SSACN) blocks, achieved competitive results, outperforming several existing models such as AttnGAN, MirrorGAN and DF-GAN in terms of FID. The best-performing model demonstrated its ability to generate diverse and high-quality images that are semantically consistent with the input captions. The use of semantic-spatial fusion further improved the focus and alignment of generated content to the relevant regions described in the text.
This work contributes to the field of text-to-image synthesis by introducing and experimentally validating a hybrid fusion approach that integrates global and spatially aware semantic conditioning. The developed models, supported by a systematic evaluation across multiple datasets, demonstrate improved performance over several state-of-the-art solutions, offering a valuable framework for future research in multimodal content generation.
1. Introduction
Text-to-image synthesis, the task of generating realistic images from textual descriptions, has gained significant attention due to its applications in graphic design, marketing and gaming [1]. Advances in deep learning, particularly generative adversarial networks (GANs) [2] and transformers [3], have enabled models to produce high-quality, semantically consistent images [4]. However, challenges remain in effectively fusing textual and visual features to ensure alignment between generated images and captions.
This paper investigates GAN-based text-to-image synthesis architectures, focusing on novel techniques to integrate textual information into image generation. We propose models that leverage advanced fusion mechanisms to enhance semantic consistency, building on recent methods ([5, 6]). Our approach combines multiple generator configurations, trained and evaluated on large-scale datasets [7], including CUB-200 [8], MS-COCO [9] and ImageNet [10]. Using the inception score (IS) [11] and Fréchet inception distance (FID) [12], we assess the quality and diversity of generated images, comparing our results to state-of-the-art models. Our findings demonstrate improved performance in generating diverse, high-quality images aligned with textual inputs, with potential applications in automated content creation and visualization.
2. Related work
Early approaches to artificial intelligence (AI)-generated art focused on style transfer, from simple image analogies [13] to neural feature visualization like DeepDream [14]. Optimization-based methods [15] were later accelerated by feed-forward generation [16].
The introduction of GANs [2] enabled novel applications such as image-to-image translation [17], artistic style deviation [18] and improved realism through large-scale models [19]. In text-to-image synthesis, early models [20] introduced text-conditioned GANs, extended in StackGAN [21] via multi-stage generation and conditioning augmentation.
More recently, transformer and diffusion-based models such as DALL·E [22], Guided Language to Image Diffusion for Generation and Editing (GLIDE) [23], Imagen [24] and Stable Diffusion [25] have set new benchmarks in fidelity and text-image alignment [26]. However, these models require massive resources and often lack transparency.
Our work builds on GAN-based methods, introducing a hybrid architecture that integrates deep fusion (DF) and semantic-spatial attention. It achieves competitive results on public datasets (CUB-200, MS-COCO and ImageNet), outperforming baselines like AttnGAN, MirrorGAN and DF-GAN in FID, while maintaining interpretability and efficiency.
3. Methodology
This study develops and evaluates text-to-image synthesis models based on GANs, incorporating advanced text-to-image fusion techniques. The proposed system consists of a generator and a discriminator, both conditioned on textual captions encoded using the Deep Attentional Multimodal Similarity Model (DAMSM) [27].
Generator architecture. The generator is a deep deconvolutional neural network that maps a random noise vector and a text embedding vector to an red green blue (RGB) image of dimensions 3 × 256 × 256. The architecture comprises seven fusion blocks, each followed by an UpBlock that doubles the feature map size using nearest-neighbor interpolation, starting from ngf×8 × 4 × 4 to ngf×256×256 (where ngf = 64). The final layers include batch normalization, a leaky ReLU (slope 0.2), a deconvolutional layer (kernel size 3×3, stride 1 and valid padding) and a hyperbolic tangent activation to produce the output image. Two types of fusion blocks are employed: DF Blocks [5] and Semantic-Spatial Aware Convolution Network (SSACN) Blocks [6].
DF Block: Inputs feature maps and a text embedding, processed through two affine transformations (each with fully connected layers to predict scaling () and shifting () parameters), followed by ReLU and a convolutional layer (kernel 3×3, stride 1, valid padding). This enables DF of textual and visual features [5].
SSACN Block: Extends the DF Block by incorporating a mask predictor that generates a spatial mask to focus text fusion on relevant image subregions. The mask predictor consists of a convolutional layer (kernel 3×3, 100 output channels), batch normalization, ReLU, another convolutional layer (kernel 1×1, 1 output channel) and a sigmoid activation. The mask modulates the affine transformation parameters, enhancing semantic consistency [6].
Four generator configurations were designed by varying the fusion blocks: (1) 7 DF Blocks (7DF, 28.7M parameters), (2) 4 DF Blocks followed by 3 SSACN Blocks (4DF-3SSACN, 29.5M parameters), (3) 4 SSACN Blocks followed by 3 DF Blocks (4SSACN-3DF, 30.5M parameters) and (4) 7 SSACN Blocks (7SSACN, 31.4M parameters).
Discriminator architecture. The discriminator is a deep convolutional neural network that evaluates image-caption pairs, outputting a score in [0, 1] (0 for fake/mismatched, 1 for real/matched). It processes an RGB image (3 × 256 × 256) and a text embedding () through seven convolutional layers. The first layer uses a 3×3 kernel, stride 1 and valid padding, producing ndf feature maps (ndf = 64). The next five layers use 4×4 kernels, stride 2 and padding 1, doubling the channels and halving the spatial dimensions. The final layer (kernel 4×4, stride 2 and valid padding) outputs a scalar. Leaky ReLU (slope 0.2) follows all but the first layer. The total parameters are approximately 80.4M.
The models were evaluated using the IS and FID on 10,000 generated images per model, compared against real images from each dataset. The 4DF-3SSACN configuration was selected for comparison with baselines (AttnGAN, MirrorGAN, DF-GAN and Semantic-Spatial Aware Generative Adversarial Network (SSA-GAN) based on its balanced performance.
The IS evaluates the quality and diversity of generated images [11]. It is computed as:
where denotes the expectation over generated images , is the Kullback-Leibler divergence, is the conditional class probability distribution from a pretrained Inception v3 model [28] and is the marginal class distribution. The KL divergence is defined as:
Higher IS values indicate high-quality and diverse images [29].
The FID measures similarity between feature distributions of real and synthetic images [26]. It is computed as:
where and are the mean and covariance matrix of features from synthetic images, extracted from the last pooling layer of Inception v3, and are the mean and covariance of real image features, is the squared L2 norm and is the matrix trace (sum of diagonal elements). Lower FID values indicate better quality and diversity.
4. Experiment setup and result analysis
The models were trained on three datasets: CUB-200 (11,788 images, 200 bird species) [8], MS-COCO (328,000 images, diverse scenes) [9] and ImageNet (1,200,000 images, 1,000 categories) [10]. Captions were encoded using the DAMSM [27], producing 256-dimensional sentence feature vectors. The training followed a minimax optimization, where the generator minimizes and the discriminator maximizes , with generating images from noise and text embedding , classifying image-caption pairs (: real image, : mismatched caption) and balancing the loss contributions of mismatched and fake pairs. Manifold interpolation, blending text embeddings to create synthetic captions, was applied to enhance data diversity [20]. The Adam optimizer (learning rate 0.0002, , ) was used, with 500 epochs for CUB-200 and MS-COCO and 20 epochs for ImageNet, using a batch size of 32.
Table 1 shows the results of the FID and IS measures on the CUB-200 test images. The first column lists the generator models by their abbreviations, the second column holds the achieved FID scores and the third column lists the achieved IS scores. As noted before, a lower FID value means higher quality of images, and conversely, a lower IS value means higher quality and variety of images.
The results of FID and IS on the CUB-200 images
| Model | FID (↓) | IS (↑) |
|---|---|---|
| 7DF | 27.11 | 4.02 |
| 4DF-3SSACN | 24.77 | 4.45 |
| 4SSACN-3DF | 31.57 | 4.10 |
| 7SSACN | 24.16 | 3.94 |
| Model | FID (↓) | IS (↑) |
|---|---|---|
| 7DF | 27.11 | 4.02 |
| 4DF-3SSACN | 24.77 | 4.45 |
| 4SSACN-3DF | 31.57 | 4.10 |
| 7SSACN | 24.16 | 3.94 |
The 7SSACN generator model achieves the best FID score of 24.16, followed by the 4DF-3SSACN model with a slightly higher score of 24.77. The model 4SSACN-3DF yields the lowest FID score of 31.57, which is significantly worse than the other three models. The 4DF-3SSACN generator achieved the highest IS value of 4.45, while the 7SSACN model produced the lowest value of 3.94.
Table 2 shows the results of the different generator configurations on the MS-COCO dataset. The 7SSACN model achieves the lowest FID score of 21.18, and the 7DF and 4DF-3SSACN models follow with slightly lower scores. The 4SSACN-3DF performs worst, with a FID score of 22.62. As for the IS measure, the 4SSACN-3DF achieves the worst result of 18.82, while the 7DF performs best, with a score of 19.68.
The results of FID and IS on the MS-COCO images
| Model | FID (↓) | IS (↑) |
|---|---|---|
| 7DF | 21.49 | 19.68 |
| 4DF-3SSACN | 21.61 | 18.85 |
| 4SSACN-3DF | 22.62 | 18.82 |
| 7SSACN | 21.18 | 19.14 |
| Model | FID (↓) | IS (↑) |
|---|---|---|
| 7DF | 21.49 | 19.68 |
| 4DF-3SSACN | 21.61 | 18.85 |
| 4SSACN-3DF | 22.62 | 18.82 |
| 7SSACN | 21.18 | 19.14 |
Lastly, Table 3 shows the results of FID and IS measures on the ImageNet images. The model with the best performance was 7DF, with a FID score of 129.72, and an IS of 5.74. The worst model considering both measures was 4DF-3SSACN, with a FID of 138.17, and an IS value of 5.27. The FID scores of the ImageNet images are significantly higher than those of the other two datasets. The reason for this is the failure of the models to converge during the training process.
The results of FID and IS on the ImageNet images
| Model | FID (↓) | IS (↑) |
|---|---|---|
| 7DF | 129.72 | 5.74 |
| 4DF-3SSACN | 138.17 | 5.27 |
| 4SSACN-3DF | 132.46 | 5.60 |
| 7SSACN | 136.11 | 5.33 |
| Model | FID (↓) | IS (↑) |
|---|---|---|
| 7DF | 129.72 | 5.74 |
| 4DF-3SSACN | 138.17 | 5.27 |
| 4SSACN-3DF | 132.46 | 5.60 |
| 7SSACN | 136.11 | 5.33 |
Considering the results on all three datasets, the 7SSACN model proved to be the most effective. Its FID score was the lowest on both the CUB-200 and MS-COCO datasets, and its IS score was satisfactory on all three datasets. Although slightly worse than the 7SSACN model, the 4DF-3SSACN model also achieved significant results. It had the highest IS value on the CUB-200 dataset, and it also achieved a considerably lower FID score than the 4SSACN-3DF and 7DF models on the CUB-200 dataset.
The generator model that undoubtedly performed the worst was 4SSACN-3DF. It achieved the lowest FID score on the CUB-200 and MS-COCO datasets and the lowest IS score on the MS-COCO dataset. The reason for this could be the use of SSACN blocks in shallow layers of the generator architecture. In Table 4, we can see that the early layers of the generator network have smaller sizes of feature maps. Consequently, the spatial masks corresponding to these feature maps are also smaller. For example, the first layer outputs feature maps of size (4, 4), which means the mask predictor component produces a spatial mask of the same size. As the size of feature maps gradually increases throughout the network, the application of such a small mask affects a large part of the final image. Because of this, text-to-image fusion possibly isn't accurate in some text-relevant subregions of the final image, which can result in images that are less semantically consistent with the provided captions.
Comparison of FID results on the CUB-200 and MS-COCO images
| Model | CUB-200 | MS-COCO |
|---|---|---|
| AttnGAN | 23.98 | 35.49 |
| MirrorGAN | 18.34 | 34.71 |
| DF-GAN | 19.24 | 28.92 |
| SSA-GAN | 15.61 | 19.37 |
| 4DF-3SSACN (ours) | 24.77 | 21.180 |
| Model | CUB-200 | MS-COCO |
|---|---|---|
| AttnGAN | 23.98 | 35.49 |
| MirrorGAN | 18.34 | 34.71 |
| DF-GAN | 19.24 | 28.92 |
| SSA-GAN | 15.61 | 19.37 |
| 4DF-3SSACN (ours) | 24.77 | 21.180 |
In Table 4, we show a comparison of our FID results with results of some of the existing research efforts in the field of text-to-image synthesis. Specifically, we compare it with the following GAN models – AttnGAN [27], MirrorGAN [30], DF-GAN [5] and SSA-GAN [6]. As shown in the table, the SSA-GAN achieved the lowest FID scores on both datasets, surpassing the other models by wide margins. Although our model performed worst on the CUB-200 images, it was second best on the MS-COCO dataset, significantly outperforming the other three models, along with the SSA-GAN model. The model with the worst FID score on the MS-COCO dataset is the AttnGAN, with a value of 35.49.
In Table 5, we show a comparison of our IS results with the results of the other GAN models. Because some of the IS scores on the MS-COCO dataset are unavailable, we only show the results on the CUB-200 dataset. As we can see from the table, the SSA-GAN has the highest IS score, with a value of 5.17. With all results being similar, our model slightly outperformed the AttnGAN, which achieved an IS of 4.36.
Comparison of IS results on the CUB-200 images
| Model | IS (↑) |
|---|---|
| AttnGAN | 4.36 |
| MirrorGAN | 4.56 |
| DF-GAN | 4.86 |
| SSA-GAN | 5.17 |
| 4DF-3SSACN (ours) | 4.45 |
| Model | IS (↑) |
|---|---|
| AttnGAN | 4.36 |
| MirrorGAN | 4.56 |
| DF-GAN | 4.86 |
| SSA-GAN | 5.17 |
| 4DF-3SSACN (ours) | 4.45 |
Upon closer analysis of the evaluation results of the models, we can notice that most of the achieved IS values are very low. Considering that the upper limit of the IS is 1,000, it is reasonable to assume that state-of-the-art models, such as SSA-GAN, would yield a significantly higher IS value than 5.17. An explanation for these unexpected results can be found in Ref. [29], where the authors call for researchers in the field of text-to-image to be cautious when using the IS metric to compare different generative models.
One of the limitations of the IS, which certainly disrupts the scores we achieved on the CUB-200 and MS-COCO datasets, is that it doesn't provide reliable results on datasets other than ImageNet, which was used to train its underlying Inception network. As the classes in the CUB-200 and MS-COCO aren't analogous to the ones in the ImageNet dataset, the predicted classes of the Inception network and the actual classes of images generated by the trained models are expected to mismatch to a certain degree.
More specifically, the dataset classes can be misaligned in two different ways – some classes can be present in the ImageNet dataset and not in the other two datasets, and conversely, there are classes that might be present in the CUB-200 and MS-COCO datasets but not in ImageNet. In the first scenario, the larger number of ImageNet classes reduces the calculated entropy of the marginal probability distribution , estimating the diversity of images and consequently decreases the computed IS. For instance, if a text-to-image synthesis model is trained on images of animals and is meant to generate images of this type only, the computed IS is going to be poor, no matter how effectively the model generates images of animals. In the second scenario, the lack of classes in the inception network causes it to incorrectly classify objects depicted in the generated images, lowering the entropy of the conditional probability distribution , i.e. the estimated quality of images and the overall IS of the dataset. For instance, if the evaluated text-to-image model generates a large number of images showing a specific type of animal that the inception network is unfamiliar with, the resulting IS is likely going to be low, even if the images are of high quality and diverse.
This issue with the IS is especially prominent in the CUB-200 dataset, since its class labels consist of 200 different bird species, and the ImageNet dataset only contains 57 class labels related to bird species [31]. Because of this misalignment, the inception network is ideally only able to differentiate 57 different classes when calculating the IS on synthetic images generated by the model trained on the CUB-200 dataset. Evidently, the second mentioned scenario of misalignment in class labels is also true, as ImageNet contains a five times larger number of class labels when compared against the number of CUB-200 dataset classes, resulting in a low estimation of diversity in generated images. Lastly, we point out that out of all ImageNet images labeled as a type of bird, 7% are incorrectly annotated, decreasing the accuracy of the inception network and the overall IS [31]. Considering these observations, we conclude that the IS is not the preferred metric when dealing with the CUB-200 dataset, and a similar inference can be made in the case of the MS-COCO dataset.
Another undesirable property of the IS worth considering, which might have also influenced the results obtained in the testing procedure of our models, is the sensitivity of the metric to the version of the Inception network used to calculate the scores. More specifically, minor changes in weights of the inception network have proved to result in extreme changes in the output IS for the identical input dataset. Although the inception networks with slightly altered weights produce almost exactly the same classification accuracies when run on a validation dataset, the difference in the computed IS values can reach up to 11.5% [29]. Even if all of the models being compared use the same version of the inception network in the testing procedure, which is in our case the Inception V3 architecture, the mere difference in the implementation, i.e. framework used to evaluate the generative model, can lead to a significant difference in the obtained results [32].
Next, we review the quality of images created by the generator models trained on the CUB-200 and MS-COCO datasets and asses their semantic consistency with the provided textual descriptions. As the generator models trained on the ImageNet dataset didn't converge, and the images created using these models seem distorted, we don't include them in the qualitative evaluation.
Figure 1 shows several images generated by the generator model trained on the CUB-200 dataset. More specifically, the images were created using the generator model, which consisted of four DF Blocks, followed by three SSACN Blocks. The images in each column correspond to the same textual description, which is displayed in above the column.
The first column is labeled “The bird has a yellow breast and belly as well as a small bill”, showing birds with small bills perched on branches. The second column is labeled “This bird has a white and grey overall body color aside from its black head”, displaying birds with distinct head features and contrasting body patterns. The third column is labeled “This bird has a pointed bill, with an orange breast”, showing birds with pointed bills perched on branches.Images generated by the generator model developed in this study, trained on the CUB-200 dataset [8]
The first column is labeled “The bird has a yellow breast and belly as well as a small bill”, showing birds with small bills perched on branches. The second column is labeled “This bird has a white and grey overall body color aside from its black head”, displaying birds with distinct head features and contrasting body patterns. The third column is labeled “This bird has a pointed bill, with an orange breast”, showing birds with pointed bills perched on branches.Images generated by the generator model developed in this study, trained on the CUB-200 dataset [8]
The generated images are mostly semantically consistent with the given description. All three birds depicted in the first column have a yellow breast and belly and a relatively small bill. The birds look realistic, and the fine-grained details, such as the eyes, wings and bill, are outlined accurately. The birds in the second column also match the description well – all three have a white and grey body and a black head. The shape of the birds is natural and realistic. Lastly, the birds in the third column are of the right color and have a pointy bill, which is consistent with the description.
Although the images seem authentic and the portrayed birds all have adequate proportions and shape, some imperfections can be noticed. One that is common to most of the displayed images is the inaccuracy in depicting the bird's feet and talons. These parts of the images are often blurry. An example of this can be seen in the third image of the second column, where the branch that the bird is standing on appears distorted.
Figure 2 shows images created by the generator trained on the MS-COCO dataset. Same as before, the generator model consisted of four DF Blocks, followed by three SSACN Blocks.
The collage contains three labeled columns, each with three pictures and descriptive text above them. The first column is labeled “A very large pizza covered in cheese and toppings”, showing three pizzas on plates or trays. The second column is labeled “A giraffe in a field with trees in the background”, showing giraffes standing in open fields with trees behind them. The third column is labeled “A skier is in the snow going downhill”, showing skiers dressed in winter clothing gliding down snowy slopes.Images generated by the same model architecture, trained on the MS-COCO dataset [9]
The collage contains three labeled columns, each with three pictures and descriptive text above them. The first column is labeled “A very large pizza covered in cheese and toppings”, showing three pizzas on plates or trays. The second column is labeled “A giraffe in a field with trees in the background”, showing giraffes standing in open fields with trees behind them. The third column is labeled “A skier is in the snow going downhill”, showing skiers dressed in winter clothing gliding down snowy slopes.Images generated by the same model architecture, trained on the MS-COCO dataset [9]
In the first column, the images of pizzas are all consistent with the description given. The depicted pizzas appear large and are covered in toppings. Also, apart from the slightly odd shape of the pizza in the second image, the images are of high quality and seem authentic. In the second column, the giraffes in the images also appear realistic. The long legs and neck, along with the skin patterns of dark brown spots and bright stripes, make the giraffes easily recognizable. In all three images, the trees are visible in the background, as the caption suggests. The images in the third column clearly show a person skiing. All three are of high quality and realistic, and fine-grained details such as the jacket of the person in the third image are depicted accurately. With a clear line between the snow and the sky, or forest in the case of the second image, the background of these images is also outlined very well. The images even show details such as ski traces and small bumps on the snow surface.
As for the imperfections in the images generated, a similar problem as in the case of CUB-200 images occurs. More specifically, the legs of the giraffes appear distorted, which is most visible in the third image of the second column. This flaw can also be seen in the first two images of the third column, where the arms and legs of the people skiing have odd shapes. The reason for this inability of our models to accurately portray arms and legs could be the complex structure of these body parts. The multiple joints in arms and legs enable a large variety of different positions and gestures, making it challenging for the generator model to correctly recognize a pattern. Even if the proportions of the arms and legs are portrayed correctly, their positioning seems unnatural and off-balance in most cases. This issue could possibly be resolved by expanding our datasets with diverse images of arms and legs in various positions [33].
5. Conclusion
In this work [1], we examined several different text-to-image architectures based on GANs and transformers. Using fully connected layers to compress the text embeddings into parameters of the standard affine transformation and applying channel-wise scaling and shifting operations, we achieved deep text-to-image fusion in the generator models. We also used a modified, semantic-spatial aware version of the affine transformation, which predicts a spatial mask in order to only fuse textual features into text-relevant subregions of the feature maps. By combining these two text-to-image fusion techniques, we constructed four different generator network architectures. The discriminator network, which is the same for all four generator network architectures, is a convolutional neural network that classifies input images as real or fake. In the GAN training process, the generator and discriminator models compete in a minmax optimization problem, where the objective of the generator is to model the distribution of real images, and the objective of the discriminator is to learn to separate real images from fake ones.
We trained four different GAN configurations on three large text-to-image datasets – CUB-200, MS-COCO and ImageNet – resulting in a total of 12 different generator-discriminator model pairs. Using the IS and FID, we evaluated the quality and variety of images created with the trained generator models, determined the generator architecture that proved to be most efficient and compared the achieved results to some of the other existing research efforts in the field of text-to-image synthesis. Our generator model consisted of four DF Blocks, followed by three SSACN Blocks and achieved a decent IS score and a better FID score than several other existing GAN-based text-to-image synthesis systems. Outperforming models such as AttnGAN, MirrorGAN and DF-GAN, our model proved efficient in generating diverse and high-quality images that are semantically consistent with the provided captions.
Our methodology has promising applications in fields requiring high-fidelity, text-guided image synthesis. For example, in digital content creation, our model can streamline the production of tailored visuals for advertising [34]. In education, it can generate illustrative diagrams from textual descriptions [35], enhancing learning materials [36]. Additionally, in virtual reality, our approach supports the creation of immersive environments by generating contextually relevant visuals, leveraging the demonstrated semantic consistency.
Generative AI
Generative AI (ChatGPT developed by OpenAI) was used in the writing process to improve the readability and language of the manuscript.

