Confidence-aware Reward Optimization for
Fine-tuning Text-to-Image Models

Kyuyoung Kim1 Jongheon Jeong21 Minyong An3
Mohammad Ghavamzadeh4 Krishnamurthy Dvijotham5 Jinwoo Shin1 Kimin Lee1
1KAIST 2Korea University 3Yonsei University 4Amazon 5Google DeepMind


Fine-tuning text-to-image models with reward functions trained on human feedback data has proven effective for aligning model behavior with human intent. However, excessive optimization with such reward models, which serve as mere proxy objectives, can compromise the performance of fine-tuned models, a phenomenon known as reward overoptimization. To investigate this issue in depth, we introduce the Text-Image Alignment Assessment (TIA2) benchmark, which comprises a diverse collection of text prompts, images, and human annotations. Our evaluation of several state-of-the-art reward models on this benchmark reveals their frequent misalignment with human assessment. We empirically demonstrate that overoptimization occurs notably when a poorly aligned reward model is used as the fine-tuning objective. To address this, we propose TextNorm, a simple method that enhances alignment based on a measure of reward model confidence estimated across a set of semantically contrastive text prompts. We demonstrate that incorporating the confidence-calibrated rewards in fine-tuning effectively reduces overoptimization, resulting in twice as many wins in human evaluation for text-image alignment compared against the baseline reward models.

1 Introduction↩︎

Large-scale text-to-image models have been successful in generating high-quality and creative images given text prompts as input [1][3]. However, current models have several weaknesses [4], including limited ability in compositional generation [5], inaccurate text rendering [6], and difficulty with spatial understanding [7]. Moreover, large-scale datasets used to train state-of-the-art text-to-image models often contain malicious content [8] and undesirable biases [9] that models can potentially learn from during training.

Learning from human feedback has emerged as a powerful method for addressing these limitations and aligning text-to-image models with human intent [9][11]. This method involves learning a reward function that approximates human objective (i.e., the true objective) from human feedback data, followed by optimizing the model against the learned reward to enhance alignment. However, optimizing too much against proxy objectives can hinder the true objective, a phenomenon commonly known as reward overoptimization [12].

In this work, we investigate the issue of overoptimization in text-to-image generation, evaluating several state-of-the-art reward models. To facilitate our evaluation, we introduce the Text-Image Alignment Assessment2 (TIA2) benchmark, a diverse compilation of text prompts, images, and human annotations. Our findings indicate that reward models fine-tuned on human feedback data, such as ImageReward [11] and PickScore [13], exhibit stronger correlations with human assessments compared to pre-trained models like CLIP [14]. Nevertheless, all of the models struggle to fully capture human preferences. As Figure 1 demonstrates, excessive optimization can compromise both text-image alignment and image fidelity. We show that this is particularly the case when the reward signal is poorly aligned with human judgment.

To alleviate overoptimization, we propose textual normalization (TextNorm), a simple method to enhance the alignment of reward models by calibrating rewards based on a measure of model confidence. Specifically, we consider a set of semantically contrastive prompts and adjust the reward conditioned on the input prompt relative to those conditioned on the contrastive prompts. The key idea is to leverage the relative comparison of the rewards as a measure of model confidence to calibrate rewards. In constructing the set of contrastive prompts, we show that both simple rule-based approaches and leveraging large language models (LLMs) such as ChatGPT [15] can be effective. We also propose and demonstrate that ensemble methods, which combine multiple reward models, can be used to achieve further improvement. Our experimental results demonstrate that TextNorm significantly enhances alignment with human judgment on the TIA2 benchmark. This improvement renders the fine-tuning of text-to-image models more robust against overoptimization, a conclusion supported by human evaluations.

Our main contributions are as follows:

  • We introduce a benchmark for evaluating reward models in text-to-image generation, providing key insights into the alignment of several state-of-the-art models with human judgment.

  • We empirically demonstrate the adverse effects of excessive optimization against learned reward models. Importantly, we show that overoptimization is conceivable, even for reward models trained on extensive human preference data.

  • We propose TextNorm, a simple method for enhancing alignment by calibrating rewards based on a measure of model confidence. Through extensive experiments, we demonstrate that our approach could substantially mitigate overoptimization.

Figure 1: Images generated using Stable Diffusion v2.1 [2] fined-tuned with CLIP [14], BLIP-2 [16], ImageReward (IR; [11]), and PickScore (PS; [13]). Both text-image alignment and image fidelity exhibit degradation when subjected to excessive optimization. Our proposed method (TextNorm) demonstrates its robustness against overoptimization, as illustrated in the last column.

2 Related Work↩︎

Text-to-image generation. Diffusion model [17], [18] is a family of generative models that have achieved state-of-the-art results across various domains, including image synthesis [19], 3D synthesis [20], and robotics [21]. Different conditioning mechanisms guide the diffusion process allowing, for instance, generation of images from textual descriptions. This has led to the development of many of the recent state-of-the-art text-to-image models [1], [2], [22]. However, generating images that fully respect the input prompt remains a challenge.

Evaluating text-to-image generation. Evaluation of text-to-image models often requires considerable human effort, prompting a rising interest in the development of automatic evaluation methods. Vision-language models (VLMs), such as CLIP [14] and BLIP [16], measure text-image alignment using the cosine similarity between their embeddings. Methods such as DALL-Eval [23] and VISOR [7] utilize object detection to determine whether the objects are present in the correct quantities and locations. ImageReward [11] and PickScore [13] fine-tune VLMs on a large set of human preference data to directly predict preference. In this work, we examine multiple of these models and propose a method for improving their alignment with human judgment.

Reward overoptimization. Training a reward model on human feedback data and fine-tuning LLMs on this reward signal has proven effective in aligning the models with human objectives [24], [25]. Similar methods have been applied to improve text-to-image models [9][11], enabling the models to generate more human-preferred images. However, excessive optimization against a reward model, which is an imperfect proxy, can degrade model quality [12], [26]. In this study, we present a comprehensive empirical demonstration of overoptimization in text-to-image generation and show that employing better calibrated rewards can effectively alleviate the issue.

3 TIA2: A Benchmark for Text-Image Alignment Assessment↩︎

3.1 Data collection and statistics↩︎

We collect 100 text prompts sampled from diverse sources, including DrawBench [1], PartiPrompt [3], ImageRewardDB [11], and Pick-a-Pic [13], to construct the comprehensive set. We additionally create synthetic prompts describing quantities of objects or combinations of two distinct objects to form the counting and composition sets. For the synthetic prompts, we utilize a total of 25 object classes, comprising 10 from CIFAR-10 [27] and 15 from MS-COCO [28]. The counting set contains prompts describing the quantity of objects from 1 to 6 for each object class, such as “three deer,” while the composition set comprises prompts describing combinations of two distinct object classes, such as “a cat and a dog.” For each prompt, we generate 50 images using Stable Diffusion v2.1 [2] to include in the benchmark. Table 2 provides basic statistics on the benchmark. More details on the dataset can be found in Appendix 8.

To evaluate text-image alignment, we gather binary feedback (good or bad) from three human annotators for each text-image pair. We consolidate annotators’ responses by assigning a good label if at least two out of three are positive; otherwise, we assign a bad label. Figure [fig:motiv95imgs] illustrates two sets of images, highlighting instances where the reward models fail to fully capture human assessment.

3.2 Benchmarking reward models↩︎

Problem setup. Our benchmark presents a dataset \(\mathcal{D} := \{(x_i, y_i, z_i)\}_{i=1}^{n}\) comprising triplets \((x, y, z)\) of a text prompt \(x \in \mathcal{X}\), an image \(y \in \mathcal{Y}\), and a binary human label \(z \in \{0, 1\}\) indicating whether \(x\) and \(y\) are semantically consistent (“1”) or not (“0”). The goal of reward modeling for predicting text-image semantic consistency is then to derive a function \(r({x}, {y}) \in \mathbb{R}\) based on which we can infer the label \(z\) for the prompt \(x\) and the image \(y\). In this view, we can frame reward modeling as a binary classification task and assess reward models based on their performance as binary classifiers of human labels. For instance, rewards computed using \(r\) can be converted to binary predictions based on a chosen threshold, and then compared to the human labels.

Figure 2: Sample images for which the reward models do not fully agree with human labels.

Evaluation metrics. We employ standard metrics for binary classification, including the Area Under the Receiver Operating Characteristic (AUROC) and the Area Under the Precision-Recall Curve (AUPRC), to evaluate the ability of reward models as classifiers in distinguishing between semantically consistent and inconsistent samples. These threshold-independent metrics are used to assess the overall alignment with human labels. When reward models are used as optimization objective, ensuring that the high-scoring samples align closely with human assessment becomes particularly important. To gauge this alignment, we use average precision (AP) at \(k \in \{ 5, 10, 25 \}\) to evaluate how well the rewards match the human labels for the top-\(k\) samples. Lastly, we consider rank correlation statistics, namely Spearman’s \(\rho\) and Kendall’s \(\tau\), as additional metrics to assess the similarity between the rewards and human labels. For a more fine-grained comparison, we aggregate the three individual human labels for each example. We convert a “good” label to 1, a “bad” label to 0, and an “inconclusive” label to 0.5. The score is then computed as the mean of these three numbers, and rank correlations are calculated between the rewards and these aggregate scores. We compute the metrics for each prompt using the 50 samples provided in the benchmark.

4 Confidence-calibrated Reward Design↩︎

In this section, we explore methods that enhance the alignment of reward models based on a measure of model confidence. We quantitatively assess a range of reward models on our benchmark, showing that even those fine-tuned on human feedback data often inadequately capture human preferences. Subsequently, we demonstrate the effectiveness of our method in comparison.

4.1 Textual normalization↩︎

Given a reward model \(r(x_0, y) \in \mathbb{R}\) that measures text-image semantic consistency between a text prompt \(x_0\) and an image \(y\), we propose a simple method that leverages a measure of reward model confidence for enhancing its alignment. Specifically, we consider a set of alternative prompts \(X\), and normalize the reward \(r(x_0, y)\) using the softmax function based on the rewards conditioned on these prompts. The idea is to view the relative comparison of the rewards as a measure of model confidence in \(r(x_0, y)\) and calibrate the reward accordingly. For instance, if the rewards conditioned on the alternative prompts are relatively close to \(r(x_0, y)\), it suggests that the model is less confident in \(r(x_0, y)\), prompting a more significant adjustment in the reward. This softmax-based approach, while simple, can be effective in estimating model confidence [29].

With access to all semantically distinct prompts, we could compute this model confidence precisely. However, computing the softmax function over this prompt set would involve an infinite sum. Instead, we approximate this sum by considering a finite set \(X\) of prompts each of which shares syntactic similarity with \(x_0\) while being semantically distinct. The rationale behind this approach is based on the hypothesis that prompts that differ from \(x_0\) in both syntax and semantics are unlikely to yield high reward values \(r(x, y)\), thus making negligible contributions to the softmax score. We empirically evaluate this hypothesis and confirm that it holds in practice. Using the set of prompts \(X = \{x_j\}_{j=1}^m\), we define the following score for the reward model \(r\): \[\label{eq:prompt95norm} r^{X}(x_0, y) := \frac{\exp\left(\tfrac{1}{\tau} \cdot r(x_0, y)\right)}{\exp\left(\tfrac{1}{\tau} \cdot r(x_0, y) \right) + \sum_{i=1}^m \exp\left(\tfrac{1}{\tau} \cdot r(x_i, y) \right)},\tag{1}\] where \(\tau > 0\) is a temperature scale.

To illustrate the types of prompts in \(X\), consider the text prompt \(x_0 = \textit{``a photo of two dogs''}\). We could include prompts such as “a photo of three dogs” and “a photo of three cats” that are syntactically similar to but semantically different from \(x_0\). In contrast, prompts such as “describe a colorful abstract painting” would not be as useful to be part of \(X\).

Prompt set synthesis via language models. An important step in implementing TextNorm involves creating contrastive prompts over which to normalize rewards. For simple input prompts, a rule-based approach can often be effective. For instance, when the input prompt describes a quantity of an object, prompts that describe various other quantities of the same object can be used. For more linguistically complex input prompts, we can leverage LLMs to generate contrastive prompts, providing appropriate few-shot examples to guide the generation:

“Create captions that are different from the original input used for the text-to-image generation model, referencing the provided failure cases ... [few-shot examples] ... [input prompt] ...”

When composing the LLM prompt, we analyze common failure scenarios of text-to-image models for the type of the input \(x_0\) and present their textual descriptions as few-shot examples. For instance, we observed that models often generate images containing incorrect quantities or types of objects compared to those described in \(x_0\). Incorporating these prompts into \(X\) results in better calibrated rewards, particularly when the image is indeed inconsistent with the input prompt. Also, depending on the nature of \(x_0\), we include few-shot examples considering other relevant aspects such as spatial relationship, color, and material. See Appendix 10 for further details on the use of LLMs.

Figure 3: TextNorm normalizes rewards over a set of contrastive prompts generated using an LLM. Combining an ensemble of normalized rewards of multiple models can further enhance alignment.

4.2 Reward model ensemble↩︎

On our benchmark, we found PickScore to be overall better aligned with human judgment than other baselines (see Section 4.3 for a more comprehensive evaluation). However, we noted instances where competitive baselines such as ImageReward showed stronger alignment, as shown in the examples in Appendix 8.4. Motivated by the observation, we propose combining an ensemble of reward models to further enhance alignment. Given a set of reward models \(\{ r_1, \dots, r_k \}\), we first apply TextNorm to derive corresponding rewards \(\{ r^{X}_1, \dots, r^{X}_k \}\) normalized over the set of prompts \(X\). This ensemble of normalized rewards can then be combined in several natural ways. See Figure 3 for an illustration of an overview of the method.

Mean ensemble. A method as simple as averaging can be effective, particularly when the reward models within the ensemble are of comparable quality. \[\label{eq:mean95ensemble} r_{\mu}^{X}(x_0, y) \mathrel{\vcenter{:}}= \frac{1}{k} \sum_i^k r^{X}_i(x_0, y),\tag{2}\] where \(x_0\) is the prompt, and \(y\) denotes the image.

Uncertainty-regularized ensemble. If the reward models in the ensemble disagree significantly, a penalty based on the variance of the normalized rewards can act as a form of regularization. That is, \[\label{eq:up95ensemble} r_{\lambda}^{X}(x_0, y) \mathrel{\vcenter{:}}= r_{\mu}^{X}(x_0, y) - \lambda \cdot \frac{1}{k} \sum_i^k \left( r^{X}_i(x_0, y) - r_{\mu}^{X}(x_0, y) \right)^2,\tag{3}\] where \(x_0\) is the prompt, \(y\) denotes the image, and \(\lambda\) is the weight of the variance penalty.

Note that while the ensemble methods above may visually resemble those proposed in the concurrent work of [30], we consider an ensemble of reward models not necessarily trained on identical data or with identical hyperparameters. Moreover, we combine rewards normalized according to our proposed TextNorm method and demonstrate the approach in the context of text-to-image generation, while [30] focuses on language modeling. Evaluating the effectiveness of various ensemble techniques across diverse domains would be an interesting future research.

4.3 Quantitative evaluation↩︎



Table 1: Ablation study results.
\(X\) \(\lambda\) AUROC AUPRC
0.764 0.669
Rand 0.641 0.540
LLM 0.780 0.704
LLM 0.822 0.752

We evaluate TextNorm using the metrics outlined in Section 3.2, averaged across prompts within each set in the benchmark, with the following four baselines: CLIP, BLIP-2, ImageReward, and PickScore. We use an ensemble of ImageReward and PickScore with appropriately constructed prompt sets which we release alongside the benchmark. To demonstrate both ensemble methods proposed, we use the mean ensemble for the composition set prompts and the uncertainty-regularized ensemble for prompts from the counting and comprehensive sets.

Table [table:metric95comparison] summarizes the quantitative results. We note that ImageReward and PickScore, which have been trained on considerable human feedback data, generally outperform other baseline models. Nevertheless, they exhibit weaker alignment, particularly on certain prompt types, such as those in the counting set. In comparison, TextNorm notably improves alignment across all three sets of prompts on the benchmark, as evidenced by improvements in all measured metrics.

We also conduct ablation studies to assess TextNorm based on the prompt set type used and whether an ensemble method is utilized. Table 1 summarizes the results in terms of AUROC and AUPRC averaged over all prompts in the benchmark, with PickScore as the baseline. The column \(X\) denotes the use of TextNorm and specifies whether it is applied, and if so, whether a random prompt set or a set constructed with an LLM is used. The column \(\lambda\) denotes whether an ensemble method is additionally used. The results indicate that (a) leveraging TextNorm with a suitable prompt set can enhance alignment, and (b) using an ensemble method can yield further improvement.

5 Experiments↩︎

In this section, we assess the effectiveness of TextNorm in mitigating overoptimization across three optimization methods: best-of-\(n\) sampling, supervised fine-tuning (SFT), and policy gradient-based reinforcement learning (RL). We first present qualitative comparisons in , followed by quantitative results from human evaluations in Section 5.4.

5.1 Setup↩︎

In the experiments, we use Stable Diffusion v2.1 (SD v2.1) as our base text-to-image model. For best-of-\(n\) sampling, we generate a set of images using this base model and then select a subset based on the reward models for comparison. For SFT and RL fine-tuning, we use low-rank adaptation (LoRA; [31]) to fine-tune SD v2.1 with the reward models and compare the images generated using the fine-tuned text-to-image models. We choose the earliest checkpoint at which the number of generated images with higher scores, as measured by the reward model, than those generated using SD v2.1 reaches the maximum. Hence, a better aligned reward model improves fine-tuning by providing (a) more aligned signals for optimization and (b) a better measure based on which to select checkpoints. We apply the same parameters used for the quantitative evaluation in Section 4.3 to TextNorm for these experiments. We select and use 30 prompts from each of the comprehensive, counting, and composition sets in our benchmark for best-of-\(n\) sampling, and 10 prompts from the 30 for the SFT and RL fine-tuning experiments. Further experimental details can be found in Appendix 7.

5.2 Best-of-n sampling↩︎

We begin by evaluating all reward models using best-of-\(n\) sampling, which is a simple yet effective inference-time algorithm that selects the optimal sample from a set of \(n\) candidates based on a given reward model. This allows us to assess the alignment of the rewards in isolation without involving fine-tuning. Specifically, for a given prompt, we generate a set of \(n \in \{16, 64, 256\}\) images using the text-to-image model and select the image with the highest reward.

Figure 4: Images sampled using best-of-\(n\) for \(n \in \{16, 64, 256\}\) with the five reward models.

Figure 4 illustrates best-of-\(n\) samples selected based on the five reward models. It highlights that selecting the best image from a larger pool of samples may be less desirable, especially when a poorly aligned reward model is used. Specifically, the quality of images chosen using BLIP-2 degrades as the value of \(n\) increases. Other reward models generally yield better results; however, with baseline models, some selected images often depict objects that are only partially visible (see ImageReward on the left), or are arguably less realistic for higher values of \(n\) (see CLIP on the right).

5.3 Fine-tuning with reward models↩︎


Supervised fine-tuning with RWR


RL fine-tuning with DDPO

Figure 5: Images generated using SD v2.1 fine-tuned with the reward models..

We further our evaluation by fine-tuning SD v2.1 using the reward models and comparing the images generated using the resulting models. We consider both SFT and RL-based fine-tuning.

Supervised fine-tuning. We adopt a setup similar to that in [10], using reward-weighted regression (RWR) with the reward model scores as weights for fine-tuning. For each text prompt, we generate 100 images, then select the top 10 based on the reward model to create \(\mathcal{D}^{\text{model}}\), and fine-tune SD v2.1 to maximize the reward-weighted likelihood of the data: \[\mathcal{J}(\theta) = \mathbb{E}_{(x,y) \sim \mathcal{D}^{\tt{model}}} [r(x,y) \log{p_\theta}(y\mid x) + \beta \log{ \left( p_\theta(y \mid x)/p_{\theta_0}(y \mid x) \right)} ],\] where \(\beta\) is the coefficient for the Kullback-Leibler (KL) regularizer used to prevent excessive deviation from the original model \(p_{\theta_0}\) [9]. We experiment with multiple KL coefficients and select the one that achieves the most improvement.

RL fine-tuning. We also consider RL-based fine-tuning, where we iteratively collect data using the current policy and perform optimization, instead of fitting a fixed distribution. We use the denoising diffusion policy optimization (DDPO; [26]), where the denoising process is treated as a multi-step Markov decision process, allowing fine-tuning of models using policy gradient methods. Given a reward model \(r\) and a prompt distribution \(p(x)\), the following objective is maximized: \[\mathcal{J}(\theta) = \mathbb{E}_{x \sim p(x), y \sim p_\theta(y \mid x)} [r(x, y) + \lambda(\theta_0, \theta, x, y)],\] where \(\lambda\) is a divergence-based regularizer to penalize deviating too far from the original model \(\theta_0\).

Figure 5 shows sample images generated using the fine-tuned models. Common text-image alignment issues we observe with the models include: (a) missing an object (upper row 1, column 2), (b) containing an unmentioned object (lower row 2, column 2), (c) depicting incorrect quantities of objects (upper rows 1 and 2, column 3), and (d) disregarding the relations between objects as described in the prompt (lower row 1, column 5). TextNorm, which aligns more closely with human judgment, provides improved rewards for optimization and a more reliable metric for selecting checkpoints, leading to images with better text-image alignment.


Supervised fine-tuning with RWR


RL fine-tuning with DDPO

Figure 6: Human evaluation results. We compare the models fine-tuned with TextNorm against SD v2.1 (Original) and those fine-tuned with ImageReward (IR) and PickScore (PS). TextNorm consistently achieves significantly better alignment with comparable image quality in the case of SFT, though with a slightly greater sacrifice in the case of RL..

5.4 Human evaluation↩︎

For human evaluation, we generate eight images per prompt using the fine-tuned models and create two sets of four images, resulting in a total of 60 sets evaluated by three independent annotators. Figure 6 reports the overall win, tie, and loss rates of the models fine-tuned with TextNorm compared to SD v2.1 and those fine-tuned with ImageReward and PickScore. For SFT, TextNorm achieves approximately twice as many wins for text-image alignment, with comparable or even slightly better image quality in all three comparisons. For RL, TextNorm achieves an even more dramatic improvement in text-image alignment but at the expense of image quality. We suspect that this is because in constructing the benchmark data, which are used to tune the parameters of TextNorm, the annotators primarily considered alignment in their assessment. As RL is a more powerful optimization algorithm, it possibly optimized more for alignment at the expense of image quality. Considering multiple objectives, such as alignment and quality, and leveraging the corresponding reward models in fine-tuning would be a promising approach to achieving a better balance among the objectives.

The complete human evaluation results for the best-of-\(n\) sampling experiment, which also demonstrate the effectiveness of TextNorm, are provided in Appendix 9.

6 Conclusion↩︎

Fine-tuning models on a reward function trained on human feedback data has emerged as a promising method for aligning models with human intent. However, excessive optimization can degrade model performance. In this work, we introduce the TIA2 benchmark to assess several state-of-the-art reward models in text-to-image generation. We find that the reward models, even those trained on human data, are often not well-aligned with human assessment. We demonstrate that overoptimization occurs particularly when poorly aligned reward models are used for fine-tuning. To address this issue, we introduce TextNorm, a simple method that enhances alignment through reward calibration using semantically contrastive prompts. We demonstrate both quantitatively and qualitatively that the confidence-calibrated scores effectively reduces overoptimization.

Ethics Statement↩︎

Generative models for image synthesis, like many machine learning models in general, are susceptible to learning biases inherent in the training data [32]. While fine-tuning pre-trained models with a suitable reward can enhance the models, overoptimzing against an imperfect proxy objective can instead degrade model quality, as we investigate in depth in this work. Hence, it is important to understand the limitations of both the pre-trained models and the rewards with which they are fine-tuned. Our work is a timely exploration of this issue in text-to-image generation, examining various state-of-the-art reward models and reporting their limitations as both evaluation metrics and training objectives. We also introduce a simple method for better aligning reward models with human intent. While we demonstrate the effectiveness of our method in reducing overoptimization, the implementation relies on external models such as an LLM and a VLM. Therefore, careful selection of these models is crucial, as the success of the implementation hinges on their quality.

Reproducibility Statement↩︎

We provide implementation details including the experiment setup, the models used, and the training specifications (e.g., hyperparameters) in Section 5 and in Appendix 7.


We thank Seunghwan Lee, Sojeong Kim, Dongjun Lee, Changyeon Kim, Jongjin Park, Yisol Choi, Hyunsub Jeong, Junesuk Choi, and Younghyun Kim for their help with labeling the data of the main experiments of this work.

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST); No.2021-0-02068, Artificial Intelligence Innovation Hub; No.2022-0-00959, Few-shot Learning of Casual Inference in Vision and Language for Decision Making).

7 Experimental Details↩︎

7.1 Baseline reward models↩︎

We use publicly available implementations of the baseline reward models. For CLIP, we use the CLIP ViT-B/32 model from the official CLIP implementation provided by OpenAI.3 For BLIP-2, we use the pretrain version of the blip2 model from the Salesforce LAVIS library.4 For ImageReward5 and PickScore,6 we use the official models provided by the authors of the papers, which are based on the BLIP and CLIP ViT-H/14 architectures, respectively.

7.2 Fine-tuning text-to-image models↩︎

Our fine-tuning implementations are based on publicly available Python libraries. Specifically, we used the diffusers library7 for the SFT experiments and the trl library8 for the RL experiments. For SFT, we used stabilityai/stable-diffusion-2-1 as our base model and fine-tuned it for up to 1,000 steps. We experimented with KL coefficients ranging from 0.00 to 1.00, in increments of 0.25, and selected the one that yielded the most improvement. For RL, we used stabilityai/stable-diffusion-2-1-base as our base model and fine-tuned it for up to 500 epochs over the training prompts. Note that we have conducted minimal hyperparameter tuning and, in many cases, adopted the default values provided by the library code.

l|l*2|Y & Parameters & RWR & DDPO
Diffusion & Denoising steps & 50 & 50
& Guidance scale & 7.5 & 5.0
Optimization & Optimizer & AdamW & AdamW
& Learning rate & 1e-5 & 3e-4
& Weight decay & 1e-2 & 1e-4
& \(\beta_1\) & 0.9 & 0.9
& \(\beta_2\) & 0.999 & 0.999
& \(\epsilon\) & 1e-8 & 1e-8
& Max gradient norm & 1.0 & 1.0
Training & Batch size & 32 & 64
& Samples per iteration & - & 256
& Gradient updates per iteration & - & 1
& Mixed precision & fp16 & fp16

7.3 Parameters for TextNorm↩︎

The TextNorm score used for both the quantitative analysis in Section 4.3 and the experiments in Section 5 is derived from an ensemble of ImageReward and PickScore. To demonstrate both ensemble methods discussed in Section 4.2, we employed the mean ensemble for the composition set prompts and the uncertainty-regularized ensemble for prompts from the counting and comprehensive sets. We only lightly tuned the temperature parameter and the uncertainty penalty coefficient based on the average of the three AP metrics considered in this work.

8 Details on the TIA2 Benchmark↩︎

8.1 Examples and statistics↩︎

Table [table:benchmark95examples] provides sample prompts from each of the three sets in the benchmark.


Table [benchmark:objects] reports the complete set of object classes used to create the synthetic prompts. Table [table:benchmark95open95stats] provides further statistics on the subcategories of the comprehensive set.


Statistics on each subcategory of the comprehensive set in TIA2.

8.2 Labeling procedure↩︎

For each text-image pair in the benchmark, we asked three human annotators to assign a binary label indicating whether the text and image are semantically aligned, or to indicate that the assessment is inconclusive. We asked the annotators to assess text-image alignment first and foremost, and to consider image fidelity at their discretion if it was noticeable enough to affect alignment. Table 2 shows an excerpt of the labeling instructions we provided to the annotators for the task. Figure 7 is a screenshot of the labeling interface the annotators used.

Table 2: Excerpt from the labeling instructions given to annotators.
Options = {1: Good, 2: Bad and 3: Skip (hard to answer)}
(1) If the provided prompt aligns well with the image, select 1; otherwise, select 2.
(2) If the quality of the image is so poor that it affects the alignment, select 2.
(3) If you feel you mislabeled something, you can relabel it.

Figure 7: Screenshot of the labeling interface.

8.3 Additional analysis↩︎

We conduct additional analysis on the reward models, reporting the average AUROC for each synthetic prompt included in the benchmark. Since each synthetic prompt corresponds to either a pair of two object classes or a combination of an object class and a count, this analysis offers deeper insights into how well the reward models align with human judgment, depending on the type and quantity of object classes considered in the dataset.

In the tables below, we present heat maps displaying the per-prompt average AUROC values. Cells with a value of 0.75 remain uncolored, while those with higher values are shaded in varying intensities of blue, indicating higher scores. Conversely, cells with lower values are shaded in varying intensities of red, denoting lower scores. As the heat maps suggest, both ImageReward and PickScore generally attain higher AUROC values compared to CLIP and BLIP-2, with TextNorm consistently outperforming all baselines.











8.4 Reward model scores on sample images↩︎

While PickScore generally outperforms the other baselines, there are instances where those baselines exhibit stronger alignment as shown in Figure 8. This observation naturally led to considering ensemble methods to achieve further improvement.



Figure 8: Reward model scores on sample images and prompts.. a — image, b — image

9 More Human Evaluation Results↩︎

For the best-of-\(n\) sampling experiment, annotators evaluate the top four images selected from the \(n\) samples based on the reward models for each prompt. As illustrated by the evaluation results summarized in Figure 9, TextNorm consistently outperforms all baselines in terms of text-image alignment, often by a substantial margin, across all three values of \(n\). Especially compared to CLIP and BLIP-2, TextNorm achieves notably better text-image alignment as well as image quality. Even when compared to ImageReward and PickScore, TextNorm achieves two to three times as many wins as losses in alignment, with comparable or only minor compromises in image quality.







Figure 9: Results of human evaluation of the images selected using best-of-\(n\) sampling..

10 Details on Prompt Synthesis via LLMs↩︎

Algorithm 10 outlines detailed pseudocode for instructing ChatGPT to generate a contrastive prompt set based on a given input prompt \(x_0\). Depending on the nature of \(x_0\), we assign it appropriate categories and provide an optional list of few-shot examples to guide generation through in-context learning. For example, given the text prompt “A black colored car”, we categorize it under colors and counting and provide few-shot examples that encompass common failure cases for the two categories. In our experiments, we use the gpt-4-0613 model for contrastive prompt generation, configuring the temperature to \(0.0\) and the frequency penalty to \(0.2\). These hyperparameters affect the generation process based on the token frequencies.

Table 3 presents the few-shot examples considered in our experiments for eight distinct categories: text, style, composition, counting, creative, location, colors, and spatial. To generate diverse and high-quality prompts, we vary the quantity and type of objects described in the input prompt, along with other relevant properties like spatial relations.

Figure 10: Prompt set synthesis via ChatGPT

Table 3: Few-shot examples for each category of prompts used for in-context learning.
Input: “A sign that says ‘Diffusion’."
Output: [“A sign misspelled as ‘Difision’.",”A sign containing a bizarre accent ‘Difśion’."]
Category: Style
Input: “Greek statue of a man tripping over a cat.”
Output: [“Greek statue of a man",”Greek statue of two men",“Greek statue of two men tripping over a cat.",”Greek statue of a man tripping over a dog.",“Greek statue of two men tripping over a dog.",”Greek statue of a man tripping over a ball."]
Category: Composition
Input: “A red car and a white sheep.”
Output: [“A red car and two white sheep.”,“A red car and a herd of sheep.”,“A red car and a dominant white sheep among the gray ones.”,“A red car with two real sheep on the side.”]
Category: Counting
Input: “Four cars on the street.”
Output: [“Two cars on the street.”,“Three cars on the street.”,“Five cars on the street.”,“Six cars on the street.”]
Category: Creative
Input: “A heart made of chocolate”
Output: [“a star made of caramel”,“a flower made of marshmallows,”,“a diamond made of gummy bears”,“a moon made of licorice”,“a sun made of jelly beans”,“a butterfly made of lollipops”,“a crown made of cotton candy”,“a rainbow made of skittles”,“a cloud made of cotton candy”,“a tree made of chocolate-covered pretzels”,“a snowflake made of peppermint candies”,“a fish made of sour gummies”,“a bird made of chocolate-covered almonds”,“a car made of chocolate bars”,“a house made of chocolate cookies”,“a boat made of chocolate-covered strawberries”,“an airplane made of chocolate truffles”,“a guitar made of chocolate wafer sticks”,“a camera made of chocolate coins”,“a dinosaur made of chocolate eggs”]
Category: Location
Input: “A glowing mushroom in the forest”
Output: [“a sparkling flower in the garden",”a luminous firefly in the night sky",“a shimmering starfish in the ocean",”a radiant sunflower in the field",“a glowing jellyfish in the deep sea",”a gleaming diamond in the jewelry store",“a luminescent moon in the night sky",”a glowing firefly in the meadow",“a sparkling gemstone in the cave",”a luminous butterfly in the garden",“a shimmering seashell on the beach",”a radiant rainbow in the sky",“a glowing lantern in the dark",”a luminescent lightning bug in the field",“a sparkling crystal in the cave",”a shimmering waterfall in the forest",“a radiant star in the night sky",”a glowing firefly in the park",“a luminous pearl in the oyster",”a sparkling diamond in the jewelry box"]
Category: Colors
Input: “A blue bird and a brown bear.”
Output: [“A pair of bears, one blue and the other brown.”,“A blue bear and a brown bear, no bird in sight.”,“Two bears, both brown, no blue bird.”,“Two bears, one brown and the other unexpectedly blue.”]
Category: Spatial
Input: “An umbrella on top of a spoon.”
Output: [“A spoon.”,“An umbrella.”,“An umbrella on the right of a spoon.”,“An umbrella on the left of a spoon.”,“An umbrella at the bottom of a spoon.”,“Two umbrellas on top of a spoon.”]


Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. In Advances in Neural Information Processing Systems, 2022.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition, 2022.
Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, exander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation. Transactions on Machine Learning Research, 2022. ISSN 2835-8856.
Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. : Accurate and interpretable text-to-image faithfulness evaluation with question answering. In International Conference on Computer Vision, 2023.
Weixi Feng, Xuehai He, Tsu-Jui Fu, Varun Jampani, Arjun Akula, Pradyumna Narayana, Sugato Basu, Xin Eric Wang, and William Yang Wang. Training-free structured diffusion guidance for compositional text-to-image synthesis. In International Conference on Learning Representations, 2023.
Rosanne Liu, Dan Garrette, Chitwan Saharia, William Chan, Adam Roberts, Sharan Narang, Irina Blok, RJ Mical, Mohammad Norouzi, and Noah Constant. Character-aware models improve visual text rendering. arXiv preprint arXiv:2212.10562, 2022.
Tejas Gokhale, Hamid Palangi, Besmira Nushi, Vibhav Vineet, Eric Horvitz, Ece Kamar, Chitta Baral, and Yezhou Yang. Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015, 2022.
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. : Open dataset of CLIP-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. : Reinforcement learning for fine-tuning text-to-image diffusion models. In Advances in Neural Information Processing Systems, 2023.
Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. : Learning and evaluating human preferences for text-to-image generation. In Advances in Neural Information Processing Systems, 2023.
Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, 2023.
Yuval Kirstain, Adam Polyak, Uriel Singer, Shahbuland Matiana, Joe Penna, and Omer Levy. Pick-a-Pic: An open dataset of user preferences for text-to-image generation. In Advances in Neural Information Processing Systems, 2023.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
OpenAI. Introducing ChatGPT., 2022.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. -2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015.
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020.
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. In Advances in Neural Information Processing Systems, 2021.
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3D using 2D diffusion. In International Conference on Learning Representations, 2023.
Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Robotics: Science and Systems, 2023.
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents. arXiv preprint arXiv:2204.06125, 2022.
Jaemin Cho, Abhay Zala, and Mohit Bansal. : Probing the reasoning skills and social biases of text-to-image generative models. In International Conference on Computer Vision, 2023.
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020.
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. University of Toronto, 2009.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In European Conference on Computer Vision, 2014.
Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.

  1. Work done at KAIST.↩︎

  2. The benchmark data is available at↩︎