Fine-Grained Semantically Aligned
Vision-Language Pre-Training

Juncheng Li\(~\textsuperscript{\rm 1, 2}\)1  2 Xin He\(~\textsuperscript{\rm 2}\)3 Longhui Wei\(~\textsuperscript{\rm 2}\)4 Long Qian\(~\textsuperscript{\rm 1}\) Linchao Zhu\(~\textsuperscript{\rm 1}\) Lingxi Xie\(~\textsuperscript{\rm 2}\) Yueting Zhuang\(~\textsuperscript{\rm 1}\)5 Qi Tian\(~\textsuperscript{\rm 2}\)6 Siliang Tang\(~\textsuperscript{\rm 1}\)7
\(~\textsuperscript{\rm 1}\) Zhejiang University, \(~\textsuperscript{\rm 2}\) Huawei Cloud
{junchengli, qianlong0926, yzhuang, siliang}``@zju.edu.cn
{hexin80, weilonghui1, tian.qi1}``@huawei.com
{zhulinchao7, 198808xc}``@gmail.com


Abstract

Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts, or advanced cross-modal attention upon image and text features. However, they fail to explicitly learn the fine-grained semantic alignment between visual regions and textual phrases, as only global image-text alignment information is available. In this paper, we introduce LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions. To efficiently compute the game-theoretic interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Experiments show that LOUPE achieves state-of-the-art performance on a variety of vision-language tasks. Furthermore, without any object-level human annotations and fine-tuning, LOUPE achieves competitive performance on object detection and visual grounding. More importantly, LOUPE opens a new promising direction of learning fine-grained semantics from large-scale raw image-text pairs. The repository of this work is at https://github.com/YYJMJC/LOUPE.

1 Introduction↩︎

Learning transferable cross-modal representations from large-scale vision-language pre-training has exhibited remarkable performance on a wide variety of downstream tasks. Most existing works can be classified into two categories: dual-encoder and fusion-encoder. The dual-encoder methods [1][4] adopt two separate encoders to embed images and texts, and model the cross-modal alignment by the cosine similarity between the global features of images and texts. While such architecture is efficient for large-scale image-text retrieval by pre-computing image and text representations offline, they fail to model fine-grained semantic alignment between visual regions and textual phrases. On the other hand, the fusion-encoder methods [5][12] attempt to use a single multi-modal encoder to jointly model the concatenated sequence of images and texts. These methods simulate soft alignment via advanced cross-modal attention [13]. However, they can only learn implicit alignment by end-to-end training, lacking explicit supervision to encourage semantic alignment between visual regions and textual phrases. And the learned cross-modal attention matrices are often scattering and uninterpretable. Further, they are inefficient for retrieval since it requires jointly encoding every image-text pair during inference.

Learning fine-grained semantic alignment from image-text pre-training is crucial to many cross-modal reasoning tasks (e.g., visual grounding [14], image captioning [15]), but it is particularly challenging as the alignment information between visual regions and textual phrases is not available, posing fine-grained semantic alignment learning a weakly-supervised learning problem. In this paper, we address this problem while simultaneously maintaining high retrieval efficiency by proposing LOUPE, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, from the novel perspective of game theory. We formulate input patch and word tokens as multiple players into a cooperative game and quantify game-theoretic interactions (i.e., Shapley Interaction  [16], [17]) among them to investigate the semantic alignment information. LOUPE learns fine-grained semantic alignment from two stages: token-level Shapley interaction modeling and semantics-level Shapley interaction modeling, where we first learn to identify semantic regions of images that correspond to some semantically meaningful entities, and then align these regions with phrases in the paired text.

Specifically, token-level Shapley interaction modeling aims to group patch tokens of images into semantic regions that semantically correspond to some visual instances. From the game-theoretic view, we take patch tokens as players and the similarity score between images and texts as the game function. Intuitively, supposing a set of patch tokens correspond to a visual instance in the image, then they tend to have strong interaction to form the complete semantics of the corresponding instance, which contributes to the better similarity judgment with the paired text. Based on this insight, we take the token-level Shapley interaction as soft supervision labels to encourage the model to capture semantic regions from images. Then, semantics-level Shapley interaction modeling infers the fine-grained semantic alignment between semantic regions and phrases. We consider every region and phrase as players and define a fine-grained similarity score as the game function. If a region and a phrase have strong correspondence, they tend to interact with each other and contribute to the fine-grained similarity score. By measuring the Shapley interaction between each region-phrase pair, we obtain the alignment information to guide the pre-training model. As computing the exact Shapley interaction is an NP-hard problem [18], existing methods mainly employ sampling-based method [19] to obtain unbiased estimation. However, as the number of players grows, they require thousands of model evaluations. To reduce the computational cost, we further propose an efficient hybrid Shapley interaction learning strategy, where an uncertainty-aware neural Shapley interaction learning module cooperates with the sampling-based method. Experimental results show that our hybrid strategy significantly saves the computational cost while maintaining the estimation accuracy. More analysis is shown in Section 4.5.

Our framework serves as a proxy training objective that explicitly establishes the fine-grained semantic alignment between local region and phrase representations. This proxy objective can be directly removed for downstream tasks, rendering an efficient and semantics-sensitive dual-encoder model. Experiments show that LOUPE achieves new state-of-the-art on image-text retrieval benchmarks. For text-to-image retrieval on MSCOCO, LOUPE surpasses its strongest competitor by 4.2% on recall@1. Further, without any fine-tuning, LOUPE successfully transfers to object detection and visual grounding tasks in a zero-shot manner. For object detection, it achieves 12.1% mAP on COCO and 19.5% mAP on PASCAL VOC. For visual grounding, it achieves 26.8% accuracy on RefCOCO and 23.6% accuracy on RefCOCO+. Our contributions are summarized as follows:

  • We propose LOUPE that explicitly learns fine-grained semantic alignment between visual regions and textual phrases while preserving the high retrieval efficiency of dual-encoder.

  • We introduce an efficient and effective hybrid Shapley interaction learning strategy, based on an uncertainty-aware neural Shapley interaction learning module and a sampling-based method.

  • Pre-trained on image-text data, LOUPE achieves new state-of-the-art on image-text retrieval and successfully transfers to the tasks that require more fine-grained object-level visual understanding (i.e., object detection and visual grounding) without any fine-tuning.

  • As manual annotations for masses of object categories is time-consuming and unscalable, our work demonstrates a promising alternative, that is, learning fine-grained semantics from raw texts about images, which are easily available and contain a broader set of visual concepts.

2 Related Work↩︎

Vision-Language Pre-Training. The great success of pre-train-and-fine-tune paradigm in natural language processing [20], [21] and computer vision [22][24] has been expanded to the joint domain of vision and language [25][27]. Dominant vision-language pre-training models can be categorized into two groups: dual-encoder and fusion-encoder. The dual-encoder methods [1][4] adopt two individual encoders to embed images and texts separately, and model the cross-modal interaction by cosine similarity. Such architecture is efficient for large-scale image-text retrieval as image and text representations can be pre-computed offline. However, simply measuring the cosine similarity between global representations is shallow to capture fine-grained semantic relationships between regions and phrases. The fusion-encoder methods [5][12], [28][32] adopt a single multi-modal encoder to jointly model the concatenated sequence of images and texts, which achieves deeper cross-modality interaction. However, these methods are less efficient as images and texts are intertwined to compute the cross-modal attention and can not be pre-computed offline. Further, there are no explicit supervision signals to encourage the alignment between regions and phrases. Some works [5], [8], [9], [11], [30][33] attempt to leverage an off-the-shelf object detector to extract object features for pre-training. However, the detector is usually pre-trained on limited object categories. Furthermore, considering the excessive demand on memory and computation, existing methods usually fix the parameters of detection models and regard region detection as a pre-processing step, disconnected with vision-language pre-training. Thus, the performance is also restricted by the quality of detection models. FILIP [4] uses a token-wise maximum similarity to enhance the cross-modal interaction of dual-encoder methods. To learn explicit fine-grained semantic alignment, GLIP [34] and X-VLM [35] utilize human-annotated datasets, where regions with bounding-box annotations are aligned with text descriptions. Such a manner is time-consuming and hard to scale to larger raw image-text data from the Internet. In contrast, our proposed framework explicitly learns the fine-grained semantic alignment from raw image-text data and at the same time maintains the high efficiency of dual-encoder. Detailed discussions can be found in Appendix K.

Shapley Values. The Shapley value [17] was initially introduced in game theory. It has been theoretically proven to be the unique metric to fairly estimate the contribution of each player in a cooperative game such that certain desirable axioms are satisfied [36]. With solid theoretic foundations, Shapley value has recently been studied as post-hoc explanation methods for Deep Neural Networks (DNN) [37][39]. Lundberg et al. [38] propose a unified attribution method based on Shapley value to interpret the predictions of DNN. Ren et al. [40] propose to explain adversarial attacks by Shapley value. In this paper, we propose to model fine-grained semantic alignment by game-theoretic interactions, along with an efficient Shapley interaction learning strategy.

3 Method↩︎

In this section, we first introduce the problem formulation of fine-grained semantically aligned vision-language pre-training in Section 3.1. Then, we propose the corresponding LOUPE framework for fine-grained semantic alignment learning in Section 3.2 and an efficient approach for Shapley interaction learning in Section 3.3.

3.1 Problem Formulation and Model Overview↩︎

Generally, vision-language pre-training aims to learn an image encoder \(f_\mathrm{I}\) and a text encoder \(f_\mathrm{T}\) by cross-modal contrastive learning, where the matched image-text pairs are optimized to get closer and the mismatched pairs are optimized to get further. Let \(f_\mathrm{I}(I_i)\) and \(f_\mathrm{T}(T_i)\) denote the global representations of the image and text. Then the cross-modal contrastive loss can be formulated as:

\[\mathcal{L}_\mathrm{CMC} = - \log\frac{\exp(f_\mathrm{I}(I_i)^\top f_\mathrm{T}(T_i)/\tau))}{\sum_{j}^B \exp(f_\mathrm{I}(I_i)^\top f_\mathrm{T}(T_j)/\tau)} - \log\frac{\exp(f_\mathrm{I}(I_i)^\top f_\mathrm{T}(T_i)/\tau))}{\sum_{j}^B \exp(f_\mathrm{I}(I_j)^\top f_\mathrm{T}(T_i)/\tau))}\] where \(B\) is the batch size and \(\tau\) is the temperature hyper-parameter.

While intuitive, such a manner can only learn coarse alignment between images and texts but fails to explicitly capture the fine-grained semantic alignment between visual regions and textual phrases. To learn fine-grained semantic alignment while simultaneously maintaining high retrieval efficiency, we propose LOUPE, a fine-grained semantically aligned vision-language pre-training framework that germinates from cooperative game theory.

As illustrated in Figure 1, LOUPE learns fine-grained semantic alignment from two stages: token-level Shapley interaction modeling and semantics-level Shapley interaction modeling. For token-level Shapley interaction modeling, we learn to aggregate patch tokens of images into semantic regions that semantically correspond to some visual concepts, under the guidance of token-based semantic aggregation loss \(\mathcal{L}_\mathrm{TSA}\). As for semantics-level Shapley interaction modeling, the semantic alignment between the aggregated regions and textual phrases is learned, supervised by the fine-grained semantic alignment loss \(\mathcal{L}_\mathrm{FSA}\). Combined with the two newly proposed losses, the full objective of fine-grained semantically aligned vision-language pre-training can be formulated as:

\[\mathcal{L} = \mathcal{L}_\mathrm{CMC} + \mathcal{L}_\mathrm{TSA} + \mathcal{L}_\mathrm{FSA}\]

Such a new pre-training objective enforces the image encoder to capture semantic regions and establishes fine-grained semantic alignment between visual regions and textual phrases. During inference, it can be directly removed, rendering an efficient and semantics-sensitive dual-encoder.

Figure 1: Overview of LOUPE. Our framework serves as a proxy training objective that encourages the image encoder to capture semantic regions and establishes the semantic alignment between region and phrase representations. The proxy training objective can be easily removed for downstream tasks, rendering an efficient and semantics-sensitive dual-encoder.

3.2 Interpreting Fine-Grained Semantic Alignment as Game-Theoretic Interaction↩︎

3.2.1 Preliminaries↩︎

Shapley Values. The Shapley value [17] is a classic game theory solution for the unbiased estimation of the importance or contribution of each player in a cooperative game. Considering a game with \(\mathcal{N} = \{1, ..., n\}\) players, \(\mathcal{S} \subseteq \mathcal{N}\) denotes a potential subset of players. A game \(v(\cdot)\) is implemented as a function that maps each subset \(\mathcal{S}\) of players to a score, modeling the outcome of a game when players in \(\mathcal{S}\) participate in. Specifically, \(v(\mathcal{N}) - v(\varnothing)\) denotes the contribution obtained by all players in the game. The Shapley value \(\phi(i|\mathcal{N})\) for player \(i\) is defined as the average marginal contribution of player \(i\) to all possible coalitions \(\mathcal{S}\) that are formed without \(i\):

\[\phi(i|\mathcal{N}) = \sum_{\mathcal{S} \subseteq \mathcal{N}\setminus \{i\} } p(\mathcal{S}) [v(\mathcal{S}\cup\{i\}) - v(\mathcal{S})], \quad p(\mathcal{S}) = \frac{|\mathcal{S}|!(|\mathcal{N}| - |\mathcal{S}| - 1)!}{|\mathcal{N}|!} \label{e3}\tag{1}\] where \(p(\mathcal{S})\) is the likelihood of \(\mathcal{S}\) being sampled. The Shapley value has been proved to be the unique metric that satisfies the following axioms: Linearity, Symmetry, Dummy, and Efficiency [36]. We summarize these axioms in Appendix B.

Shapley Interaction. In the game theory, some players tend to form a coalition and always participate in the game together. The players in the coalition might interact or cooperate with each other, which brings additional contributions to the game. The Shapley interaction [16] measures this additional contributions brought by the coalition compared with the case when the players work individually. For a coalition \(\mathcal{S}\), we consider \([\mathcal{S}]\) as a single hypothetical player, which is the union of the players in \(\mathcal{S}\). Then, the reduced game is formed by removing the individual players in \(\mathcal{S}\) from the game and adding \([\mathcal{S}]\) to the game. The Shapley value \(\phi([\mathcal{S}]|\mathcal{N} \setminus \mathcal{S} \cup \{[\mathcal{S}]\})\) for player \([\mathcal{S}]\) can be computed using Equation 1 over the reduced game. Similarly, we can obtain \(\phi(i|\mathcal{N} \setminus \mathcal{S} \cup \{i\})\), where \(i\) is the individual player in \(S\). Finally, the Shapley interaction for coalition \(\mathcal{S}\) is formulated as:

\[\mathfrak{I}([\mathcal{S}]) = \phi([\mathcal{S}]|\mathcal{N} \setminus \mathcal{S} \cup \{[\mathcal{S}]\}) - \sum_{i \in \mathcal{S}} \phi(i|\mathcal{N} \setminus \mathcal{S} \cup \{i\}) \label{e4}\tag{2}\]

In this way, \(\mathfrak{I}([\mathcal{S}])\) reflects the interactions inside \(\mathcal{S}\). The higher value of \(\mathfrak{I}([\mathcal{S}])\) indicates that players in \(\mathcal{S}\) cooperate closely with each other.

3.2.2 Token-Level Shapley Interaction Modeling↩︎

Due to inherent semantic unit mismatch between texts and images, it is ineffective to directly compute the alignment between words and pixels (patches). A textual phrase usually refers to a specific image region, which is composed of multiple patches and represents a visual instance. Thus, we first introduce token-level Shapley interaction modeling to aggregate patches into semantic regions.

Input Representations. Given an image-text pair, the input image \(I\) is sliced into patches and flattened. Followed by linear projection layer and position embeddings, we obtain patch token sequence \(\mathcal{X}^I = \{\mathbf{x}^I_i\}_{i=1}^{L_1}\) with an additional [CLS_I] token embedding. The input text \(T\) is tokenized and embedded into word token sequence \(\mathcal{X}^T = \{\mathbf{x}^T_i\}_{i=1}^{L_2}\), added with position embeddings. We also prepend a learnable special token [CLS_T] to the word token sequence. Then, we adopt a dual-encoder structure to encode the patch token sequence and word token sequence, separately. On top of the image and text encoders, we obtain the representations of patch token sequence \(\tilde{\mathcal{X}}^I = \{\tilde{\mathbf{x}}^I_i\}_{i=1}^{\tilde{L}_1}\) and word token sequence \(\tilde{\mathcal{X}}^T = \{\tilde{\mathbf{x}}^T_i\}_{i=1}^{\tilde{L}_2}\). We take the learned representations of [CLS_I] and [CLS_T] tokens as the global representations for images and texts. And the global similarity of image-text pairs is measured by the cosine similarity between them.

Understanding Semantic Region via Shapley Interaction. Supposing a set of patches represent a complete visual instance in an image, then they tend to have a strong Shapley interaction because they work jointly to form a visual instance, which contributes to the better similarity judgment with the text. From the game-theoretic view, we take patch tokens and word tokens as players \(\mathcal{X} = \mathcal{X}^I \cup \mathcal{X}^T\), and the global similarity between images and texts as the game score \(v_1(\cdot)\). To compute \(v_1(\mathcal{S})\), we keep tokens in \(\mathcal{S}\) and mask input tokens in \(\mathcal{X} \setminus \mathcal{S}\) to zeros. Thus, the global similarity only considers the tokens in \(\mathcal{S}\), which reflects the contribution of the tokens in \(\mathcal{S}\) to the global similarity judgment.

Semantic Region Generation. Inspired by YOLOv3 [41], we design a lightweight region generation module. It takes each patch token representation \(\tilde{\mathbf{x}}^I_i\) as input and generates a bounding box prediction centered on \(\tilde{\mathbf{x}}^I_i\), which corresponds to a visual region \(\mathcal{R}_i = \{\mathbf{x}^I_{i, k} \}_{k=1}^{K_i}\) with \({K_i}\) patch tokens. The region generation module also predicts a confidence score \(s(\mathcal{R}_i)\) for each region. We select the top-\(M\) predictions as the semantic regions. Then, the Shapley interaction of \(\mathcal{R}_i\) can be defined as:

\[\mathfrak{I}([\mathcal{R}_i]) = \phi([\mathcal{R}_i]|X \setminus \mathcal{R}_i \cup \{[\mathcal{R}_i]\}) - \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i} \phi(\mathbf{x}_{i, k}^I |\mathcal{X} \setminus \mathcal{R}_i \cup \{\mathbf{x}_{i,k}^I\})\]

According to the Equation 1 , we can reformulate Shapley value into the form of expectation:

\[\phi([\mathcal{R}_i]|\mathcal{X} \setminus \mathcal{R}_i \cup \{[\mathcal{R}_i]\}) = \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c} [v_1(\mathcal{S} \cup \mathcal{R}_i) - v_1(\mathcal{S})] \}\] where \(c\) represents the coalition size. \(\phi(\mathbf{x}_{i, k}^I |X \setminus \mathcal{R}_i \cup \{\mathbf{x}_{i, k}^I\})\) can be defined in a similar manner, and the Shapley interaction of \(\mathcal{R}_i\) can be reformulated as (we provide the proof in Appendix C):

\[\mathfrak{I}([\mathcal{R}_i]) = \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c} [v_1(\mathcal{S} \cup \mathcal{R}_i) - \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i} v_1(\mathcal{S} \cup \{\mathbf{x}_{i, k}^I\}) + (K - 1) v_1(\mathcal{S})]\} \label{e7}\tag{3}\]

Taking normalized \(\mathfrak{I}'([R_i])\) as the soft supervision label, the token-based semantic aggregation loss is defined as cross-entropy loss:

\[\mathcal{L}_\mathrm{TSA} = -\frac{1}{M} \sum_{i=1}^{M} [ \mathfrak{I}'([\mathcal{R}_i]) \log(s(\mathcal{R}_i)) + (1 - \mathfrak{I}'([\mathcal{R}_i]) )\log(1 - s(\mathcal{R}_i)) ]\] which propagates gradients to the region generation module and image encoder to adjust bounding box predictions such that more accurate semantic regions can be captured.

3.2.3 Semantics-Level Shapley Interaction Modeling↩︎

After obtaining the inferred semantic regions, we propose semantics-level Shapley interaction modeling to explicitly model the fine-grained semantic alignment between regions and phrases. We first define the fine-grained similarity score and then explain semantic alignment based on game theory.

We adopt Avg-Pooling over learned patch representations in each \(\mathcal{R}_i\) to obtain region representation \(\mathbf{h}_i^I \in \mathbb{R}^d\). We employ an off-the-shelf constituency parser to extract phrases from text and obtain phrase representation \(\mathbf{h}_i^T \in \mathbb{R}^d\) by Avg-Pooling. Totally, we obtain \(M\) regions \(\mathcal{H}^I = \{\mathbf{h}^I_i\}_{i=1}^{M}\) and \(N\) phrases \(\mathcal{H}^T = \{\mathbf{h}^T_j\}_{j=1}^{N}\). And the alignment matrix can be defined as: \(\mathbf{A} = [a_{ij}]^{M\times N}\), where \(a_{ij}={\mathbf{h}^I_i}^\top \mathbf{h}^T_j\) represents the alignment score between \(i\)-th region and \(j\)-th phrase. Next, we apply softmax-normalization over each row of \(\mathbf{A}\), obtaining \(\tilde{\mathbf{A}}\). For the \(i\)-th region, we calculate its maximum alignment score as \(\max_j \tilde{a}_{ij}\). Then, we use the average maximum alignment score over all regions as the fine-grained image-to-text similarity \(p_1\). Similarly, we can obtain the fine-grained text-to-image similarity \(p_2\), and the total fine-grained similarity score can be defined: \(p = (p_1 + p_2)/2\).

Understanding Semantic Alignment via Shapley Interaction. If a region and a phrase have strong semantic correspondence, then they tend to cooperate with each other and contribute to the fine-grained similarity score. Thus, we can consider \(\mathcal{H} = \mathcal{H}^I \cup \mathcal{H}^T\) as the players and the fine-grained similarity score \(p\) as the game score \(v_2(\cdot)\). The Shapley interaction of them can be formulated as:

\[\begin{align} \mathfrak{I}([\mathcal{H}_{ij}]) &= \phi([\mathcal{H}_{ij}]|\mathcal{H} \setminus \mathcal{H}_{ij} \cup \{[\mathcal{H}_{ij}]\}) - \phi(\mathbf{h}^I_i|\mathcal{H} \setminus \mathcal{H}_{ij} \cup \{\mathbf{h}^I_i\}) - \phi(\mathbf{h}^T_j| \mathcal{H} \setminus \mathcal{H}_{ij} \cup \{\mathbf{h}^T_j\}) \\ & = \mathop{\mathbb{E}}\limits_{c} \{\mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{H} \setminus \mathcal{H}_{ij} \atop |\mathcal{S}| = c} [v_2(\mathcal{S} \cup \mathcal{H}_{ij}) - v_2(\mathcal{S} \cup \{\mathbf{h}^I_i\}) - v_2(\mathcal{S} \cup \{\mathbf{h}^T_j\}) + v_2(\mathcal{S})] \} \label{e10} \end{align}\tag{4}\] where \([\mathcal{H}_{ij}]\) represents the single player formed by the coalition of \(i\)-th region and \(j\)-th phrase. Taking normalized \(\mathfrak{I}'([\mathcal{H}_{ij}])\) as soft labels, the fine-grained semantic alignment loss can be defined as:

\[\mathcal{L}_\mathrm{FSA} = -\frac{1}{MN} \sum_{i=1}^{M} \sum_{j=1}^{N} \mathfrak{I}'([\mathcal{H}_{ij}]) \log (\tilde{a}_{ij})\]

3.3 Uncertainty-Aware Neural Shapley Interaction Learning↩︎

According to Equation 1 and Equation 2 , computing exact the Shapley value is an NP-hard problem [18]. Previous methods mainly apply sampling-based method [19] to approximate it. While sampling-based approximation is unbiased, an accurate approximation requires thousands of model evaluations. To reduce the computational cost, we propose an uncertainty-aware neural Shapley interaction learning (UNSIL) module to cooperate with the sampling-based method, rendering an efficient and effective hybrid strategy.

Specifically, the sampling-based method [19] estimates the expectation terms in Equation 3 and Equation 4 by sampling to compute the Shapley interaction. Inspired by noisy label learning [42], the UNSIL module learns to predict the Shapley interaction and the corresponding uncertainty \(\sigma~\in~(0, 1)\). Intuitively, if the UNSIL module makes a prediction with low uncertainty, we can directly apply its prediction to \(\mathcal{L}_\mathrm{TSA}\) and \(\mathcal{L}_\mathrm{FSA}\), avoiding thousands of model evaluations. If the uncertainty is high, we then resort to the sampling-based method for a more accurate estimation.

During training, the UNSIL module first predicts the target interaction with uncertainty \(\sigma\). Then, we sample a value \(\epsilon\) from a uniform distribution on \((0, 1)\). If \(\epsilon > \sigma\), we directly use its prediction. If \(\epsilon \leq \sigma\), we then use the sampling-based method to compute the Shapley Interaction and update the UNSIL module based on the sampling-based results. Note that, for the first few iterations, we employ the sampling-based method directly, and use its results to train the UNSIL module.

Let \({\mathfrak{I}}^*\) and \(\hat{\mathfrak{I}}\) denote the results from the sampling-based method and UNSIL module, respectively. Taking \({\mathfrak{I}}^*\) as the ground-truth, the UNSIL module is trained by:

\[\mathcal{L}_\mathrm{UNSIL} = \frac{1}{\beta_1\sigma} \mathcal{L}_\mathrm{MSE}( \hat{\mathfrak{I}}, {\mathfrak{I}}^*) + \beta_2 \sigma\]

where the first term is mean squared error \(\mathcal{L}_\mathrm{MSE}\) weighted by the uncertainty, the second term serves as a regularization term for the prediction uncertainty, and \(\beta\) is the weight hyper-parameter. The UNSIL module implicitly learns the uncertainty from the regression loss function. We discuss the implementation details of the UNSIL module in Section 4.5 and Appendix D.

4 Experiments↩︎

4.1 Pre-training Details↩︎

As sufficient data is a prerequisite for vision-language pre-training, we construct a dataset with 240M image-text pairs from the Internet. We implement the image encoder by Swin-L [43] and the text encoder by BERT-Small [21]. The input images are resized to \(224 \times 224\) and the input texts are tokenized by WordPiece with a maximum length of 60. We pre-train the model for 20 epochs using a batch size of 512 on 128 NVIDIA V100 GPUs. We utilize AdamW [44] optimizer with a learning rate of \(2\times10^{-4}\) and a weight decay of 0.01. More pre-training and evaluation details are provided in Appendix D, E. We also analyze the image encoder and training efficiency in Appendix G, J.

ll| cccccc |cccccc & &
& & & &
&R@1 &R@5 &R@10 &R@1 &R@5 &R@10 &R@1 &R@5 &R@10 &R@1 &R@5 &R@10
&70.7 &90.2 &94.0 &54.3 &79.6 &87.5 &44.0 &71.2 &80.4 &32.3 &59.0 &70.2
&83.6 &95.7 &97.7 &68.7 &89.2 &93.9 &- &- &- &- &- &-
&88.0 &98.7 &99.4 &68.7 &90.6 &95.2 &58.4 &81.5 &88.1 &37.8 &62.4 &72.2
&88.6 &98.7 &99.7 &75.7 &93.8 &96.8 &58.6 &83.0 &89.7 &45.6 &69.8 &78.6
&89.8 &99.2 &99.8 &75.0 &93.4 &96.3 &61.3 &84.3 &90.4 &45.9 &70.6 &79.3
&90.5 &99.5 &99.8 &76.3 &93.9 &96.7 &62.3 &85.1 &91.2 &50.1 &75.4 &83.3

Table 1: Top-1 accuracy (%) of zero-shot image classification over 11 datasets.
CLIP 96.2 92.9 77.3 67.7 78.7 34.9 57.7 36.1 93.5 92.6 75.3 73.0
LOUPE 95.9 94.3 79.9 69.8 87.4 37.8 53.3 54.9 94.1 93.9 76.1 76.1

4.2 Zero-Shot Image-Text Retrieval↩︎

We compare LOUPE on the widely used MSCOCO [45] and Flickr30K [46] datasets. First, the results in Table [t95retrieval] show that LOUPE achieves new state-of-the-art zero-shot performance on most metrics of the two datasets, demonstrating the stronger generalizability of our pre-training framework. Second, while previous works mainly pre-train on larger datasets (CLIP 400M, ALIGN 1800M, FILIP 340M), LOUPE still achieves superior performance using less training data (240M). Third, compared with FILIP which directly computes token-wise similarity, our model captures semantic alignment between visual regions and textual phrases, which is more semantically meaningful. For text-to-image retrieval on MSCOCO, LOUPE significantly outperforms FILIP by 4.2% on recall@1.

4.3 Zero-Shot Image Classification↩︎

In this section, we evaluate LOUPE on the zero-shot image classification task. We compare LOUPE with CLIP on 11 downstream classification datasets, following the same evaluation setting as CLIP [3]. Table 1 summarizes the results. As shown in Table 1, our LOUPE outperforms CLIP with average improvement of 3.1%. Notably, on ImageNet, the largest dataset among 11 datasets, our LOUPE surpasses CLIP by 0.8%. Also, we observe that LOUPE achieves substantial performance gains on several fine-grained image classification datasets (i.e., Flowers102 and Aircrafts). It demonstrates the superiority of our LOUPE on fine-grained semantics understanding.

We also evaluate the linear probing performance of our LOUPE on image classification. The detailed results can be found in Appendix I.

ll| cc|cc| ccc| ccc & & & &
&mAP@0.3 &mAP@0.5 &mAP@0.3 &mAP@0.5 &val &testA &testB &val &testA &testB
&8.5 &4.5 &18.2 &7.3 &6.7 &6.2 &5.8 &6.1 &7.0 &5.7
&6.4 &1.9 &11.7 &4.8 &2.1 &2.3 &1.7 &1.7 &2.0 &2.8
&7.1 &3.2 &19.1 &8.2 &5.5 &5.2 &4.8 &4.4 &5.6 &4.9
&14.9 &6.6 &28.7 &12.9 &16.7 &18.4 &18.0 &17.5 &18.9 &19.6
&25.3 &12.1 &30.3 &19.5 &25.2 &26.8 &24.5 &22.9 &23.3 &23.6

Figure 2: An example of LOUPE zero-shot transferring to object detection using prompt templates.

4.4 Zero-Shot Transfer to Object Detection and Visual Grounding↩︎

To answer whether our model has learned fine-grained semantics, we further evaluate LOUPE on object detection [47] and visual grounding [14], which require more fine-grained semantic understanding ability to identify specific visual regions in images according to the object labels or referring expressions. Visual grounding can be seen as generalized object detection, where the pre-defined class labels are replaced by language referring expression sentences. As LOUPE can generate a set of semantic regions that are aligned with textual phrases, it can be easily applied to object detection and visual grounding without structure modification. For visual grounding, we take referring expressions as input text. For object detection, as illustrated in Figure 2, we use prompt to expand detection labels to input text. Then, we encode input text by the learned text encoder, and these tasks can be completed by measuring the similarity between candidate region representations and text representations.

For comparison, we zero-shot transfer CLIP (ViT-L/14) to object detection and visual grounding by applying several non-parametric approaches on the spatial feature maps of CLIP. We also compare with AdaptCLIP [48], which is a concurrently unpublished method that leverages classic super-pixel (SLIC [49]) and bounding box proposal (selective search [50]) methods to zero-shot transfer CLIP to phrase localization. We use its public official implementations to get the experiment results. For object detection, we evaluate their mean Average Precision (mAP) at IoU thresholds of \(\{0.3, 0.5\}\) on COCO [45] (65 classes) and PASCAL VOC [51] (20 classes). For visual grounding, we evaluate their top-1 accuracy at IoU thresholds of 0.5 on RefCOCO [14] and RefCOCO+ [14]. The experiment details of CLIP variants and LOUPE are provided in Appendix E.

Table [t95detection95grounding] summarizes the results. 1) Overall, LOUPE outperforms all CLIP variants by a large margin. The significantly higher performance illustrates the stronger zero-shot transfer ability of our fine-grained semantically aligned pre-training paradigm. 2) Second, all CLIP variants rely on pre-processing steps on CLIP’s feature map (e.g., AdaptCLIP first uses SLIC to group pixels and then uses selective search to generate a large number of proposals), which is time-consuming. In contrast, our method directly predicts the semantic regions based on the patch token representations. 3) Third, the consistently competitive performance across four benchmarks validates that LOUPE can learn fine-grained semantics from raw text supervision. LOUPE demonstrates a promising alternative, that is, learning fine-grained semantics from large-scale raw image-text pairs, which are easily available and contain a broader set of visual concepts.

As time-consuming human annotations are unscalable for massive object classes in the real world, some recent works [52], [53] target at training object detectors with annotations on base object classes to generalize to the remaining object classes of the same dataset. The latest works [54], [55] leverage the generalizability of vision-language pre-training models to further improve the zero-shot performance on novel classes. However, these zero-shot approaches still require bounding box annotations on base classes for task-specific supervised learning. In contrast, our LOUPE is trained on large-scale raw image-text pairs, which are already accessible on the Internet and contain more diverse semantics.

lll| cc| cc |ccc|c & & & & &Training Time
& &I2T &T2I &mAP@0.3 &mAP@0.5 &val &testA &testB &(sec/iter)
& &31.0 &24.8 &3.8 &1.0 &1.3 &0.9 &0.8 &1.17
2& &32.4 &26.2 &7.6 &3.3 &1.8 &2.0 &2.6 &8.38
3& &33.5 &28.3 &9.4 &5.9 &4.1 &4.6 &4.3 &9.90
4& &33.3 &28.1 &9.0 &5.6 &4.5 &4.9 &4.4 &1.93

Figure 3: (a) Instability of the Shapley interaction approximation with respect to different sampling numbers. (b, c) Uncertainty and error of the UNSIL module with different structures.

4.5 Ablation Study↩︎

Effectiveness of Individual Components. In this section, we investigate the effectiveness of each component in Table [t95ablation]. Given the costly training time, all ablation studies are based on a relatively small dataset (Conceptual Captions 3M [56]). We start with the backbone model that consists of a dual-encoder trained by cross-modal contrastive loss. We then gradually add token-level Shapley interaction modeling supervision \(\mathcal{L}_\mathrm{TSA}\) (Row 2), semantics-level Shapley interaction modeling supervision \(\mathcal{L}_\mathrm{FSA}\) (Row 3), and UNSIL module (Row 4). For Row 2 and Row 3, the Shapley interaction is only computed by the sampling-based method. The results in Table [t95ablation] show that both \(\mathcal{L}_\mathrm{TSA}\) and \(\mathcal{L}_\mathrm{FSA}\) bring significant improvement for all tasks. We observe that \(\mathcal{L}_\mathrm{TSA}\) boosts a 3.8% improvement on object detection. And the improved fine-grained visual semantic understanding further facilitates the cross-modal retrieval performance (+1.4%). The semantics-level Shapley interaction modeling further improves the performance on all tasks by modeling the semantic alignment between visual regions and textual phrases. Comparing Row 3 and Row 4, we observe that the UNSIL module maintains the estimation accuracy while avoiding intensive computations. The averaged training time is reduced from 9.90 seconds per iteration to 1.93 seconds per iteration.

Accuracy of the Shapley Interaction Learning. Since we use the sampling-based method [19] to compute the Shapley Interaction and train the UNSIL module, we conduct a study to evaluate the accuracy of the sampling-based method and the error of the UNSIL module. As [39], we compute the interaction multiple times and measure the instability of them. A lower instability means that we obtain similar interactions from different sampling processes. It indicates a high accuracy. Specifically, the instability is defined as \(\frac{\mathbb{E}_{u, v: u \neq v}|\mathfrak{I}_u - \mathfrak{I}_v|}{\mathbb{E}_w |\mathfrak{I}_w|}\), where \(\mathfrak{I}_w\) denotes the interaction computed in the \(w\)-th time. We average the instability values over Shapley interaction of 100 image-text pairs. We report the average instability values with respect to different sampling numbers. As shown in Figure 3 (a), the instability decreases along with the increase of the sampling number. When the sampling number is larger than 500, the approximated Shapley interaction is stable enough with instability less than 0.06. Further, we attempt different models (i.e., Conv1D, 3-Layer MLP + Attention, 3-Layer Transformer) to implement the UNSIL module (please see Appendix D for more details). We test them on 1000 samples and report their mean uncertainty and relative error in Figure 3 (b) and (c). We observe that MLP + Attention is good enough to predict the interaction with lower complexity. Thus, we implement the UNSIL module by MLP + Attention.

Figure 4: Qualitative examples of object detection on COCO and visual grounding on RefCOCO+.
Figure 5: Visualization of learned fine-grained semantic alignment and corresponding Shapley interaction values. The values in the red boxes represent the Shapley interaction of regions.

4.6 Qualitative Analysis↩︎

Qualitative Examples. As shown in Figure 4, LOUPE successfully captures the regions that correspond to the detected objects, and grounds the referring expressions onto the referred regions.

Visualization of Learned Fine-Grained Semantic Alignment. In Figure 5, we visualize some key semantic regions and corresponding alignment matrices inferred by LOUPE. We present the regions with top-3 confidence (Region 1 – 3) and two randomly sampled regions (white boxes). The red boxes at the bottom of bounding boxes indicate their normalized token-level Shapley interaction values. Comparing their Shapley interaction values, we observe that the token-level Shapley interaction successfully distinguishes semantic regions from randomly sampled regions. The semantically meaningful regions tend to have stronger interaction. It indicates that token-level Shapley interaction can provide correct supervision for semantic region generation. Further, we show the alignment matrices inferred by semantics-level Shapley interaction and LOUPE, respectively. As shown in the right case of Figure 5, LOUPE successfully recognizes the leash region and aligns it with the “a leash” phrase. Note that existing object detection datasets do not contain the “leash” category.

5 Conclusion↩︎

This paper introduces a novel vision-language pre-training framework, LOUPE, which models the fine-grained semantic alignment between visual regions and textual phrases by game-theoretic interactions. To efficiently compute the interactions, we further propose an uncertainty-aware neural Shapley interaction learning module. Comprehensive experiments show that LOUPE achieves new state-of-the-art on image-text retrieval datasets and can transfer to object detection and visual grounding in a zero-shot manner. This work demonstrates a new promising direction of learning fine-grained semantics from large-scale raw image-text data.

Limitations. 1) The phrases are extracted by off-the-shelf constituency parsers, whose predictions might not be completely accurate. 2) The web data might inevitably contain mismatched image-text pairs, leading to noisy supervision.

Social Impacts. Our model is trained on noisy data from the Internet, which may contain unsuitable images, violent text, or private information. Thus, additional analysis of the data is necessary. Further, the use of our model for privacy surveillance or other nefarious purposes should be prohibited.

Acknowledgment. This work has been supported in part by the National Key Research and Development Program of China (2018AAA0101900), Zhejiang NSF (LR21F020004), Key Research and Development Program of Zhejiang Province, China (No. 2021C01013), Chinese Knowledge Center of Engineering Science and Technology (CKCEST). We thank all the reviewers for their valuable comments.

6 Appendix Overview↩︎

In this appendix, we present:

  • Axiomatic Properties of Shapley Value (Section 7).

  • Proofs of Equation 7 and Equation 10 (Section 8).

  • Hyperparameters and Implementation Details (Section 9).

  • Pre-Training and Evaluation Details (Section 10).

  • More Experiment Results on Downstream Vision-Language Generation Task (Section 11).

  • Further Analysis on the Image Encoder (Section 12).

  • More Qualitative Examples on Object Detection and Visual Grounding (Section 13).

  • Linear Probing Evaluation (Section 14).

  • Training Efficiency Discussion (Section 15).

  • Detailed Discussion with Some Related Works (Section 16).

7 Axiomatic Properties of Shapley Value↩︎

In this section, we mainly introduce the axiomatic properties of Shapley value. Weber et al. [36] have proved that Shapley value is the unique metric that satisfies the following axioms: Linearity, Symmetry, Dummy, and Efficiency.

Linearity Axiom. If two independent games \(u\) and \(v\) can be linearly merged into one game \(w(\mathcal{S}) = u(\mathcal{S}) + v(\mathcal{S})\), then the Shapley value of each player \(i \in \mathcal{N}\) in the new game \(w\) is the sum of Shapley values of the player \(i\) in the game \(u\) and \(v\), which can be formulated as:

\[\phi_w(i|\mathcal{N}) = \phi_u(i|\mathcal{N}) + \phi_v(i|\mathcal{N})\]

Symmetry Axiom. Considering two players \(i\) and \(j\) in a game \(v\), if they satisfy: \[\forall \mathcal{S} \in \mathcal{N} \setminus \{i, j\}, v(\mathcal{S} \cup \{i\}) = v(\mathcal{S} \cup \{j\})\] then \(\phi_v(i|\mathcal{N}) = \phi_v(j|\mathcal{N})\).

Dummy Axiom. The dummy player is defined as the player that has no interaction with other players. Formally, if a player \(i\) in a game \(v\) satisfies: \[\forall \mathcal{S} \in \mathcal{N} \setminus \{i\}, v(\mathcal{S} \cup \{i\}) = v(\mathcal{S}) + v(\{i\})\] then this player is defined as the dummy player. In this way, the dummy player \(i\) has no interaction with other players, i.e., \(v(\{i\}) = \phi_v(i|\mathcal{N})\).

Efficiency Axiom. The efficiency axiom ensures that the overall reward can be assigned to all players, which can be formulated as: \[\sum_{i \in \mathcal{N}} \phi_v(i) = v(\mathcal{N}) - v(\varnothing)\]

8 Proofs of Equation 7 and Equation 10↩︎

In this section, we provide detailed proofs for Equation 7 in Section 3.2.2 and Equation 10 in Section 3.2.3.

We first provide proof for Equation 7. The token-level Shapley interaction for \(\mathcal{R}_i\) can be decomposed as follows: \[\begin{align} \mathfrak{I}([\mathcal{R}_i]) &= \phi([\mathcal{R}_i]|X \setminus \mathcal{R}_i \cup \{[\mathcal{R}_i]\}) - \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i} \phi(\mathbf{x}_{i, k}^I |\mathcal{X} \setminus \mathcal{R}_i \cup \{\mathbf{x}_{i,k}^I\}) \\ &= \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c} [v_1(\mathcal{S} \cup \mathcal{R}_i) - v_1(\mathcal{S})]\} - \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i} \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c}[ v_1(\mathcal{S} \cup \{\mathbf{x}_{i, k}^I\}) - v_1(\mathcal{S})]\}\\ &= \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c} [v_1(\mathcal{S} \cup \mathcal{R}_i) - v_1(\mathcal{S})]\} - \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c}[ \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i} (v_1(\mathcal{S} \cup \{\mathbf{x}_{i, k}^I\}) - v_1(\mathcal{S})\;)]\}\\ &= \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c} [v_1(\mathcal{S} \cup \mathcal{R}_i) - v_1(\mathcal{S}) - \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i}(v_1(\mathcal{S} \cup \{\mathbf{x}_{i, k}^I\}) - v_1(\mathcal{S})\;)\;]\;\}\\ &= \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c} [v_1(\mathcal{S} \cup \mathcal{R}_i) - v_1(\mathcal{S}) - \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i} v_1(\mathcal{S} \cup \{\mathbf{x}_{i, k}^I\}) + \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i} v_1(\mathcal{S})\;]\;\}\\ &= \mathop{\mathbb{E}}\limits_{c}\{ \mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{X} \setminus \mathcal{R}_i \atop |\mathcal{S}| = c} [v_1(\mathcal{S} \cup \mathcal{R}_i) - \sum_{\mathbf{x}_{i, k}^I \in \mathcal{R}_i} v_1(\mathcal{S} \cup \{\mathbf{x}_{i, k}^I\}) + (K - 1) v_1(\mathcal{S}\;)\;]\;\} \end{align}\]

We then provide proof for Equation 10. The semantics-level Shapley interaction between region \(i\) and phrase \(j\) can be decomposed as follows:

\[\begin{align} \mathfrak{I}([\mathcal{H}_{ij}]) &= \phi([\mathcal{H}_{ij}]|\mathcal{H} \setminus \mathcal{H}_{ij} \cup \{[\mathcal{H}_{ij}]\}) - \phi(\mathbf{h}^I_i|\mathcal{H} \setminus \mathcal{H}_{ij} \cup \{\mathbf{h}^I_i\}) - \phi(\mathbf{h}^T_j| \mathcal{H} \setminus \mathcal{H}_{ij} \cup \{\mathbf{h}^T_j\}) \\ &= \mathop{\mathbb{E}}\limits_{c} \{\mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{H} \setminus \mathcal{H}_{ij} \atop |\mathcal{S}| = c} [v_2(\mathcal{S} \cup \mathcal{H}_{ij}) - v_2(\mathcal{S})] \} - \mathop{\mathbb{E}}\limits_{c} \{\mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{H} \setminus \mathcal{H}_{ij} \atop |\mathcal{S}| = c} [v_2(\mathcal{S} \cup \{\mathbf{h}_i^I\}) - v_2(\mathcal{S})] \}\\ & - \mathop{\mathbb{E}}\limits_{c} \{\mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{H} \setminus \mathcal{H}_{ij} \atop |\mathcal{S}| = c} [v_2(\mathcal{S} \cup \{\mathbf{h}_j^T\}) - v_2(\mathcal{S})] \}\\ &= \mathop{\mathbb{E}}\limits_{c} \{\mathop{\mathbb{E}}\limits_{\mathcal{S} \subseteq \mathcal{H} \setminus \mathcal{H}_{ij} \atop |\mathcal{S}| = c} [v_2(\mathcal{S} \cup \mathcal{H}_{ij}) - v_2(\mathcal{S} \cup \{\mathbf{h}^I_i\}) - v_2(\mathcal{S} \cup \{\mathbf{h}^T_j\}) + v_2(\mathcal{S})\;]\; \} \end{align}\]

Table 2: A summary of various hyperparameters in LOUPE.
Hyperparameter Value
Image Encoder - Swin-L
input image size \(224\times224\)
stage 1 - patch size \(4\times4\)
stage 1 - hidden size 192
stage 1 - window size \(7\times7\)
stage 1 - number of heads 6
stage 2 - patch size \(8\times8\)
stage 2 - hidden size 384
stage 2 - window size \(7\times7\)
stage 2 - number of heads 12
stage 3 - patch size \(16\times16\)
stage 3 - hidden size 768
stage 3 - window size \(7\times7\)
stage 3 - number of heads 24
stage 4 - patch size \(32\times32\)
stage 4 - hidden size 1536
stage 4 - window size \(7\times7\)
stage 4 - number of heads 48
Text Encoder - BERT-Small
maximum length of word tokens 60
vocabulary size 30522
attention dropout probability 0.1
hidden activation function GELU
hidden dropout probability 0.1
initializer range 0.02
intermediate size 2048
layer norm eps \(1e^{-12}\)
hidden size 512
number of attention heads 8
number of hidden layers 4
Pre-Training
number of epochs 20
batch size 512
learning rate 2e-4
learning schedule OneCycle
cycle momentum Ture
base momentum 0.85
max momentum 0.95
AdamW weight decay 0.01
AdamW \(\beta_1\) 0.9
AdamW \(\beta_2\) 0.999

9 Hyperparameters and Implementation Details↩︎

In this section, we summarize the hyperparameters in our LOUPE model in Table 2, including the hyperparameters of the image encoder, text encoder, and pre-training process. For the uncertainty-aware neural Shapley interaction learning module, we attempt three kinds of models (i.e., Conv1D, 3-Layer MLP + Attention, 3-Layer Transformer) to implement it for token-level and semantics-level Shapley interaction approximation.

For token-level Shapley interaction approximation, it takes the patch token sequence \(\mathcal{X}^I = \{\mathbf{x}^I_i\}_{i=1}^{L_1}\), word token sequence \(\mathcal{X}^T = \{\mathbf{x}^T_i\}_{i=1}^{L_2}\), and the visual region \(\mathcal{R}_i = \{\mathbf{x}^I_{i, k} \}_{k=1}^{K_i}\) as input, and estimates the corresponding token-level Shapley interaction value for \(\mathcal{R}_i\) along with the uncertainty \(\sigma\).

Conv1D model first performs Avg-Pooling over learned patch representations of \(\mathcal{R}_i\) to obtain the region representation \(\mathbf{h}^I_i\), and then fuse the word and patch token representations with the region representation \(\mathbf{h}^I_i\), respectively. Specifically, we project them into an unified semantic space by fully-connected layers and then fuse them through Hadamard product as:

\[\mathcal{F}^I = (\mathcal{W}_1 \mathbf{h}^I_i \mathfrak{1}^T ) \odot (\mathcal{W}_2 \mathcal{X}^I )\]

where \(\mathcal{W}_1\) and \(\mathcal{W}_2\) are the learnable projection parameters, \(\mathfrak{1}^T\) is the transpose of an all-ones vector, and \(\odot\) represents Hadamard product. We can obtain \(\mathcal{F}^T\) in a similar manner. Then, we apply 1D convolution operation with kernel size = 4 and stride = 2 over \(\mathcal{F}^I\) and \(\mathcal{F}^T\), respectively. Following with Max-Pooling operation, we obtain \(\tilde{\mathbf{f}}^I \in \mathbb{R}^d\) and \(\tilde{\mathbf{f}}^T \in \mathbb{R}^d\). Next, we concatenate them with \(\mathbf{h}^I_i\) and feed them to two separate 1-layer fully connected layers to get the Shapley interaction estimation and corresponding uncertainty.

3-Layer MLP + Attention model first performs Avg-Pooling over learned patch representations of \(\mathcal{R}_i\) to obtain the region representation \(\mathbf{h}^I_i\). Then, we use \(\mathbf{h}^I_i\) as the query to attend the patch token sequence and compute a weighted sum of the patch token representations as:

\[\begin{align} \tilde{\alpha}^I_j =& \mathcal{W}_3(tanh(\mathcal{W}_4 \mathbf{h}^I_i + \mathcal{W}_5 \mathbf{x}^I_{j}))\\ \mathbf{\alpha}^I =& softmax([\tilde{\alpha}^I_1, ..., \tilde{\alpha}^I_{L_1}])\\ \mathbf{e}^I =& \sum_{j=1} \alpha^I_i \mathbf{x}^I_j \end{align}\]

Where \(L_1\) is the number of patch tokens. We can obtain \(\mathbf{e}^T\) for word token sequence in a similar manner. Consequently, we concatenate them and \(\mathbf{h}^I_i\) and feed them to two separate 3-layer fully connected layers to get the Shapley interaction estimation and corresponding uncertainty.

3-Layer Transformer model takes the concatenated sequence \(\mathcal{X}^I\) and \(\mathcal{X}^T\) as input. We add position embeddings and three kinds of token type embeddings (i.e., word token, context patch token, region patch token) to them. We then apply three layers of transformer blocks to jointly encode the input sequence and take the output [CLS] token to predict the Shapley interaction estimation and corresponding uncertainty, separately.

For semantics-level Shapley interaction approximation, it takes the \(M\) regions \(\mathcal{H}^I = \{\mathbf{h}^I_i\}_{i=1}^{M}\), \(N\) phrases \(\mathcal{H}^T = \{\mathbf{h}^T_j\}_{j=1}^{N}\), and the target region-phrase pair \(<\mathbf{h}^I_i, \mathbf{h}^T_j>\) as input, and estimates the corresponding semantics-level Shapley interaction value for \(<\mathbf{h}^I_i, \mathbf{h}^T_j>\) along with the uncertainty \(\sigma\). The architectures of the three models are consistent with their token-level implementations.

10 Pre-Training and Evaluation Details↩︎

10.1 Pre-Training Dataset Details↩︎

As recent works [1], [3], [4] have shown that pre-training models can obtain great performance gain by scaling up the dataset, we construct a large-scale dataset, which consists of 240 million image-text pairs and covers a broad set of visual concepts. Concretely, we elaborate more details in the following.

Raw image-text pair collection. We first harvest large-scale noisy image-text pairs from the web and design multiple filtering rules to improve the quality of the web data.

Image-based filtering. Following ALIGN [1], we remove pornographic images and keep only images where both dimensions are larger than 200 pixels. Also, we remove the images whose aspect ratio is larger than 10. To prevent from leaking testing data, we remove the images that appear in all downstream evaluation datasets (e.g., MSCOCO, Flickr30K).

Text-based filtering. We remove the repeated captions and keep only English texts. The texts that are shorter than 3 words or longer than 100 words are discarded. As ALIGN [1], we also remove the texts that contain any rare token (outside of 100 million most frequent unigrams and bigrams from the raw dataset).

Joint image-text filtering. Although the above filtering rules have filtered out many noisy data, it is hard to detect the mismatched image-text pairs, where the texts do not accurately describe the visual content of the images, resulting in undesirable noisy signals to vision-language pre-training. Inspired by BLIP [57], we train a discriminator as a filtering model to predict whether the text is matched to the image. Specifically, the filtering model consists of an image encoder and an image-grounded text encoder, which takes the cross-attention to fuse image features and text features. The filtering model is trained on CC12M dataset using image-text contrastive loss and image-text matching loss.

10.2 Evaluation Details↩︎

Zero-Shot Image-Text Retrieval. We evaluate the zero-shot performance of LOUPE on the image-text retrieval task over the widely used Flickr30K [46] and MSCOCO [45] datasets. The image-text retrieval consists of two subtasks: image-to-text retrieval and text-to-image retrieval, where a model is required to identify an image from candidates given a caption describing its content, or vice versa. The MSCOCO dataset consists of 123,287 images, and each image is aligned with five captions. The Flickr30K dataset contains 31,783 images and five captions for each image. Following previous works [1], [4], we evaluate the zero-shot performance on the 1K and 5K test sets of Flickr30K and MSCOCO, respectively. We take the final representation of [CLS] tokens as the global representations of images and texts, and use them to measure the image-text similarity. We first compute the similarity scores for all image-text pairs. Then, we take the top-K candidates for ranking and report the top-K retrieval results.

Zero-Shot Transfer to Object Detection. Without any fine-tuning, we evaluate the zero-shot transfer performance of LOUPE on the object detection task [47] over the COCO [45] and PASCAL VOC [51] datasets. For the COCO Objects dataset, we use their 2017 validation split for evaluation. Previous zero-shot object detection models [52], [53], [58] follow the split proposed by [52], which consists of 48 base classes and 17 novel classes. They train models on base classes and evaluate models on novel classes. Differently, we directly evaluate the zero-shot transfer performance on both the base and novel classes, without fine-tuning on the base classes. Totally, we evaluate models on 4,836 test images that contain 33,152 instances of 65 classes. PASCAL VOC is a widely used object detection dataset, which contains 20 object classes. For PASCAL VOC, we evaluate models on 9657 instances of 5072 images. To complete object detection, we first use the region generation module to generate a set of candidate regions and then use prompt text (i.e., an image of [object class name].) to expand each detection label to a sentence. Next, we encode sentences for each object class by the learned text encoder and measure their similarity with the candidate regions as the classification scores. Following most zero-shot object detection methods, we use mean Average Precision (mAP) at IoU of \(\{0.3, 0.5\}\) as evaluation metrics.

Zero-Shot Transfer to Visual Grounding. Visual grounding [14] (also known as phrase localization and referring expression comprehension) aims to locate a specific visual region of the input image, according to the language referring expression. Visual grounding can be seen as generalized object detection, where the pre-defined class labels are replaced by language referring expression sentences. Without any fine-tuning, we evaluate the zero-shot transfer performance of LOUPE on the visual grounding task over the RefCOCO [14] and RefCOCO+ [14] datasets. These two datasets are collected by the ReferitGame [59], where a player is asked to write a language expression to refer to a specific object in the image, and another player is required to locate the target object given the image and the referring expression. RefCOCO dataset consists of 142,209 refer expressions for 50,000 objects in 19,994 images, which is split into train (120,624 expressions), val (10,834 expressions), testA (5,657 expressions), testB (5,095 expressions) sets. The images in testA set involve multiple persons and the images in testB set involve multiple objects. RefCOCO+ dataset consists of 141,564 expressions for 49,856 objects in 19,992 images, which is split into train (120,191 expressions), val (10,758 expressions), testA (5,726 expressions), testB (4,889 expressions) sets. We report the zero-shot transfer performance on the val, testA, and testB sets of both datasets.

Table 3: Image captioning evaluation results on COCO “Karpathy” test split.
* Image Captioning
BLEU@4 METEOR CIDEr SPICE
VLP [60] 36.5 28.4 117.7 21.3
\(\mathrm{OSCAR}_{\mathrm{large}}\) [8] 37.4 30.7 127.8 23.5
\(\mathrm{VinVL}_{\mathrm{large}}\) [32] 38.5 30.4 130.8 23.4
\(\mathrm{BLIP}_{\mathrm{ViT-L}}\) [57] 40.4 - 136.7 -
\(\mathrm{LEMON}_{\mathrm{large}}\) [61] 40.6 30.4 135.7 23.5
LOUPE 40.9 31.5 137.8 24.3

11 More Experiment Results on Vision-Language Generation Task↩︎

To further validate the generalization ability of the learned cross-modal representations by our LOUPE, we adapt the pre-trained LOUPE to vision-language generation task, i.e., image captioning [25]. Image captioning is the task of describing images with natural languages, which requires models to identify and describe the fine-grained semantics of images. The input images are encoded by the learned image encoder. As BLIP [57], we train an image-grounded text decoder which shares the feed forward layers with the learned text encoder and adopts cross-attention to attend to the image features. The text decoder is trained with a language modeling loss to generate captions according to the images.

We evaluate the image captioning performance on the MSCOCO [45] dataset, which is split into train (113, 287 images), val (5,000 images), “Karpathy” test split (5,000 images). Each image has 5 captions. We use the train split to train the image-grounded text decoder and report the performance on the public “Karpath” 5k test split. Following standard metrics, we use BLEU@4, METEOR, CIDEr, and SPICE as evaluation metrics. We compare our LOUPE model with recent vision-language pre-training generation models [8], [32], [57], [60], [61]. All methods are fine-tuned with cross-entropy loss only, without CIDEr optimization. As shown in Table 3, our LOUPE achieves competitive performance on all metrics, which verifies the strong generalization ability of our model on downstream vision-language generation tasks.

ll| c| cc |cc &*Image Encoder & &
& &image-to-text&text-to-image &image-to-text &text-to-image
&EfficientNet &88.6 &75.7 &58.6 &45.6
&ViT-L &89.8 &75.0 &61.3 &45.9
&ViT-L &88.0 &68.7 &58.4 &37.8
&Swin-L &88.7 &74.3 &59.3 &46.2
&Swin-L &90.5 &76.3 &62.3 &50.1

12 Further Analysis on the Image Encoder↩︎

In our work, we adopt the Swin-L [43] as our image encoder due to the following considerations. (1) The shifted windowing scheme of Swin Transformer achieves linear computational complexity with respect to image size, which is more efficient than ViT [22]. This merit is particularly beneficial to the vision-language pre-training as we need to process large-scale images (240M). (2) The hierarchical architecture of Swin Transformer is more flexible to model semantic regions at various scales.

To further verify the performance gain from our proposed fine-grained semantically aligned vision-language pre-training framework, we implement a variant version of \(\mathrm{CLIP}\) that adopts Swin-L as the image encoder (Row 4 in Table [t95image95encoder]), using the same training dataset as our LOUPE. It can also be viewed as the backbone of our LUOPE (without optimization from our token-level and semantics-level Shapley interaction modeling). As shown in Table [t95image95encoder], comparing \(\mathrm{CLIP}^*\) with \(\mathrm{CLIP}\), the Swin-L image encoder does bring some improvements over \(\mathrm{CLIP}\). However, there is still a clear performance gap between \(\mathrm{CLIP}^*\) and our LOUPE. With the same architecture, our LOUPE has 2.68 points higher average R@1 than the \(\mathrm{CLIP}^*\) over two datasets. This further verifies that the main performance gain comes from our proposed fine-grained semantically aligned vision-language pre-training framework. Notably, we observe that the text-to-image retrieval of our implementation is obviously higher than \(\mathrm{CLIP}\). This phenomenon has also been confirmed by [1], [4] (see Row 1 and Row 2 in Table [t95image95encoder]). We suppose that it might be caused by some training details or the dataset collection of \(\mathrm{CLIP}\).

13 More Qualitative Examples on Object Detection and Visual Grounding↩︎

For a more intuitive view of the performance of our model on object detection and visual grounding, we visualize more qualitative examples. Concretely, Figure 6 and Figure 7 show more object detection examples on the COCO [45] and PASCAL VOC [51] datasets. Figure 8 and Figure 9 show more visual grounding examples on the RefCOCO [14] and RefCOCO+ [14] datasets.

Figure 6: Qualitative examples of object detection on COCO Objects dataset.
Figure 7: Qualitative examples of object detection on PASCAL VOC dataset.
Figure 8: Qualitative examples of visual grounding on RefCOCO dataset.
Figure 9: Qualitative examples of visual grounding on RefCOCO+ dataset.

14 Linear Probing Evaluation↩︎

In this section, we evaluate the linear probing performance of our LOUPE on image classification. Following the same evaluation setting as CLIP [3], we freeze the whole backbone network and only fine-tuning the last linear classification layer, which takes the [CLS] token as input. We report the linear probing performance over 11 datasets in Table 4. Our LOUPE outperforms CLIP with average improvement of 1.6%. Notably, on ImageNet, the largest dataset among 11 datasets, our LOUPE surpasses CLIP by 1.8%.

Table 4: Linear probing performance (top-1 accuracy) over 11 datasets.
CLIP 98.0 95.2 90.9 81.8 99.2 46.4 72.9 69.4 95.1 96.5 83.9
LOUPE 97.6 96.0 92.1 82.6 99.5 49.3 70.7 80.2 95.5 97.5 85.7
Table 5: Comparison of training cost and architecture parameters.
Pre-Training Image-Text Pairs Parameters GPUs Days GPU Days
CLIP 400M 425M 256 V100 12 days 3072
ALIGN 1800M 820M 1024 TPUv3 - -
FILIP 340M 417M 192 V100 24 days 4608
LOUPE 240M 226M 128 V100 20 days 2560

15 Training Efficiency Discussion↩︎

Although our proposed Shapley interaction modeling increases the training time per iteration, it enables our model to converge with fewer total iterations by encouraging our model to learn fine-grained region-phrase alignment beyond coarse image-text alignment. As shown in Table 5, our LOUPE achieves the best performance while using relatively small GPU days (128 GPUs \(\times\) 20 days).

Indeed, the proposed Shapley interaction modeling increases the training time per iteration, but it enables our model to learn fine-grained region-phrase alignment from raw image-text pairs without any object-level human annotations. Our LOUPE can be used as a zero-shot object detector without any fine-tuning. Compared with the expensive cost of human annotations, the increased training time might be acceptable. Meanwhile, manual annotations for extremely diverse object categories in the real world are unscalable and even impossible while our model demonstrates a promising alternative, that is, learning fine-grained semantics from raw texts about images, which are easily available and contain a broader set of visual concepts. For example, the right case of Figure 4 in the main paper shows that LOUPE successfully recognizes the leash region and aligns it with the “a leash” phrase. Note that the “leash” category has never appeared in any existing object detection datasets.

On the other hand, our method is much more efficient than methods that rely on off-the-shelf object detectors (e.g., Faster R-CNN) to extract visual features. Recent studies [4], [6] have noticed that extracting visual features using object detectors greatly slows down the training (about 20 FPS per GPU) and requires more GPU memory. Thus, our model avoids such a heavy burden while being able to identify semantic-rich visual regions without any pre-training detectors or human annotations.

Figure 10: Comparison of the LOUPE with existing methods.

16 Detailed Discussion with Some Related Works↩︎

In this section, we first provide comparison table to highlight key differences of our LOUPE with various methods. Then, we provide a detailed discussion with three recent works (i.e., FILIP [4], RegionCLIP [33], X-VLM [35]), which also investigate fine-grained semantic alignment.

We highlight key differences in Figure 10. Our LOUPE differs as it explicitly learns fine-grained region-phrase alignment from the novel perspective of game-theoretic interactions, without resorting to any object-level human annotations and pre-trained Region Proposal Network (RPN). Notably, the human bounding-box annotations are usually limited to the pre-defined object categories, and the RPN can only detect regions belonging to the pre-defined categories of pre-training object detection datasets. Thus, the methods that use human bounding-box annotations or pre-trained RPN usually suffer from detecting novel objects beyond the pre-defined categories while LOUPE learns from large-scale raw image-text pairs, which are more scalable and contain a broader set of visual concepts.

Compared with FILIP, the superiorities of using Shapley Interaction modeling are mainly three-fold: 1) We suppose that directly computing token-wise alignment between every patch token and word token is not efficient and meaningful because an individual word token or patch token might not contain complete semantics. A semantic-rich phrase (e.g., “a girl in a blue coat”) usually consists of multiple words, and its corresponding visual region is composed of multiple patches. Also, some words (e.g., "is", "the") and patches (e.g., background pixels) are not meaningful. Based on this insight, our LOUPE differs as we first propose token-level Shapley interaction modeling to aggregate patches into semantic-meaningful regions, and then introduce semantics-level Shapley interaction modeling to explicitly model the fine-grained semantic alignment between semantic-meaningful regions and phrases. 2) Although FILIP computes token-wise similarity to simulate the fine-grained alignment, it can only learn implicit alignment from the supervision of image-text contrastive loss, lacking training signals to explicitly encourage semantic alignment between visual regions and textual phrases. In contrast, our Shapley interaction modeling provides explicit supervision signals (e.g., the alignment matrices visualized in Figure 4) to learn the fine-grained alignment. The consistently superior performance of our LOUPE than FILIP over all metrics also demonstrates the benefit of explicit fine-grained alignment learning. 3) FILIP can not be directly applied to object detection and visual grounding through implicit token-wise alignment learning while our LOUPE can immediately transfer to these tasks without any fine-tuning. It is because the proposed Shapley interaction modeling enables our model to identify semantic regions and align these regions with language. As shown in Table 2, without any bounding-box annotations and fine-tuning, our LOUPE achieves competitive performance across four object detection and visual grounding benchmarks.

Our LOUPE is different from RegionCLIP in the following aspects: 1) RegionCLIP uses pre-trained Region Proposal Network (RPN) to detect regions in images. However, RPN is usually pre-trained on pre-defined object categories (e.g., 80 classes for MSCOCO), which can not cover extensive categories of objects in the large-scale pre-training dataset. Furthermore, since the RPN casts excessive demand on memory and computation, existing methods (i.e., RegionCLIP) usually fix the parameters of RPN and regard region detection as pre-processing step, disconnected with vision-language pre-training. Thus, the performance of RegionCLIP is also restricted by the quality of the RPN. In contrast, our LOUPE learns to identify semantic regions of images by token-level Shapley interaction modeling, which is more scalable and enables our LOUPE to learn a broader set of visual concepts from large-scale pre-training dataset. 2) RegionCLIP constructs a pool of object concepts from image-text corpus and aligns visual regions with these concepts. These concepts are usually individual nouns (e.g., boy, kite, bus). In contrast, our LOUPE focuses on phrases that involve rich context (e.g., "a boy running on the grass"). By aligning visual regions with phrases that contain rich semantic context, our LOUPE can learn a boarder set of visual concepts (e.g., objects, actions, relations) from the large-scale pre-training dataset.

As for X-VLM, the main differences lie in three-fold: 1) X-VLM is trained on well-annotated datasets, where regions with bounding-box annotations are provided and each of them is associated with a description text. Such a manner is time-consuming and hard to scale to larger raw image-text data from the web. Our LOUPE differs as we are trained on noisy image-text pairs from the Internet. 2) X-VLM takes ground-truth regions as input and is trained to predict the bounding-box supervised by the regression loss on the ground-truth coordinates. In contrast, our LOUPE learns to identify semantic regions of images without such strong supervision signals from human annotations. 3) X-VLM has ground-truth alignment information between regions and their corresponding description texts, which provide strong supervision signals for region-text matching. By comparison, our LOUPE learns the fine-grained region-phrase alignment from game-theoretic interactions.

References↩︎

[1]
C. Jia et al., “Scaling up visual and vision-language representation learning with noisy text supervision,” 2021 , organization={PMLR}, pp. 4904–4916.
[2]
Y. Li et al., “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” arXiv preprint arXiv:2110.05208, 2021.
[3]
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021 , organization={PMLR}, pp. 8748–8763.
[4]
L. Yao et al., “FILIP: Fine-grained interactive language-image pre-training,” arXiv preprint arXiv:2111.07783, 2021.
[5]
Y.-C. Chen et al., “Uniter: Universal image-text representation learning,” 2020 , organization={Springer}, pp. 104–120.
[6]
W. Kim, B. Son, and booktitle=International. C. on M. L. Kim Ildoo, “Vilt: Vision-and-language transformer without convolution or region supervision,” 2021 , organization={PMLR}, pp. 5583–5594.
[7]
J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[8]
X. Li et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” 2020 , organization={Springer}, pp. 121–137.
[9]
J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
[10]
D. Qi, L. Su, J. Song, E. Cui, T. Bharti, and A. Sacheti, “Imagebert: Cross-modal pre-training with large-scale weak-supervised image-text data,” arXiv preprint arXiv:2001.07966, 2020.
[11]
H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.
[12]
W. Su et al., “Vl-bert: Pre-training of generic visual-linguistic representations,” arXiv preprint arXiv:1908.08530, 2019.
[13]
A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[14]
L. Yu, P. Poirson, S. Yang, A. C. Berg, and booktitle=European. C. on C. V. Berg Tamara L, “Modeling context in referring expressions,” 2016 , organization={Springer}, pp. 69–85.
[15]
K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” 2015 , organization={PMLR}, pp. 2048–2057.
[16]
M. Grabisch and M. Roubens, “An axiomatic approach to the concept of interaction among players in cooperative games,” International Journal of game theory, vol. 28, no. 4, pp. 547–565, 1999.
[17]
L. S. Shapley, “A value for n-person games, contributions to the theory of games, 2, 307–317.” Princeton University Press, Princeton, NJ, USA, 1953.
[18]
Y. Matsui and T. Matsui, “NP-completeness for calculating power indices of weighted majority games,” Theoretical Computer Science, vol. 263, no. 1–2, pp. 305–310, 2001.
[19]
J. Castro, D. Gómez, and J. Tejada, “Polynomial calculation of the shapley value based on sampling,” Computers & Operations Research, vol. 36, no. 5, pp. 1726–1730, 2009.
[20]
T. Brown et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
[21]
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
[22]
A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
[23]
K. He, H. Fan, Y. Wu, S. Xie, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Girshick Ross, “Momentum contrast for unsupervised visual representation learning,” 2020, pp. 9729–9738.
[24]
L. Wei, L. Xie, W. Zhou, H. Li, and Q. Tian, “MVP: Multimodality-guided visual pre-training,” arXiv preprint arXiv:2203.05175, 2022.
[25]
P. Anderson et al., “Bottom-up and top-down attention for image captioning and visual question answering,” 2018, pp. 6077–6086.
[26]
S. Antol et al., “Vqa: Visual question answering,” 2015, pp. 2425–2433.
[27]
J. Li et al., “Unsupervised reinforcement learning of transferable meta-skills for embodied navigation,” 2020, pp. 12123–12132.
[28]
Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Fu Jianlong, “Seeing out of the box: End-to-end pre-training for vision-language representation learning,” 2021, pp. 12976–12985.
[29]
Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-bert: Aligning image pixels with text by deep multi-modal transformers,” arXiv preprint arXiv:2004.00849, 2020.
[30]
W. Li et al., “Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning,” arXiv preprint arXiv:2012.15409, 2020.
[31]
F. Yu et al., “Ernie-vil: Knowledge enhanced vision-language representations through scene graph,” arXiv preprint arXiv:2006.16934, 2020.
[32]
P. Zhang et al., “Vinvl: Revisiting visual representations in vision-language models,” 2021, pp. 5579–5588.
[33]
Y. Zhong et al., “Regionclip: Region-based language-image pretraining,” 2022, pp. 16793–16803.
[34]
L. H. Li et al., “Grounded language-image pre-training,” arXiv preprint arXiv:2112.03857, 2021.
[35]
Y. Zeng, X. Zhang, and H. Li, “Multi-grained vision language pre-training: Aligning texts with visual concepts,” arXiv preprint arXiv:2111.08276, 2021.
[36]
R. J. Weber, “Probabilistic values for games,” The Shapley Value. Essays in Honor of Lloyd S. Shapley, pp. 101–119, 1988.
[37]
A. Datta, S. Sen, and booktitle=2016. I. symposium on security and privacy (SP). Zick Yair, “Algorithmic transparency via quantitative input influence: Theory and experiments with learning systems,” 2016 , organization={IEEE}, pp. 598–617.
[38]
S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
[39]
H. Zhang, Y. Xie, L. Zheng, D. Zhang, and Q. Zhang, “Interpreting multivariate shapley interactions in dnns,” arXiv preprint arXiv:2010.05045, 2020.
[40]
J. Ren et al., “A unified game-theoretic interpretation of adversarial robustness,” arXiv preprint arXiv:2111.03536, 2021.
[41]
J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv e-prints, 2018.
[42]
A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Advances in neural information processing systems, vol. 30, 2017.
[43]
Z. Liu et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021.
[44]
I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
[45]
T.-Y. Lin et al., “Microsoft coco: Common objects in context,” 2014 , organization={Springer}, pp. 740–755.
[46]
B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and booktitle=Proceedings. of the I. international conference on computer vision Lazebnik Svetlana, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” 2015, pp. 2641–2649.
[47]
S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
[48]
J. Li, G. Shakhnarovich, and R. A. Yeh, “Adapting CLIP for phrase localization without further training,” arXiv preprint arXiv:2204.03647, 2022.
[49]
R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “SLIC superpixels compared to state-of-the-art superpixel methods,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 11, pp. 2274–2282, 2012.
[50]
J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
[51]
M. Everingham et al., “The PASCAL visual object classes challenge 2007 (VOC2007) results,” 2008.
[52]
A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and booktitle=Proceedings. of the E. C. on C. V. (ECCV). Divakaran Ajay, “Zero-shot object detection,” 2018, pp. 384–400.
[53]
S. Rahman, S. Khan, and booktitle=Proceedings. of the A. C. on A. I. Barnes Nick, “Improved visual-semantic alignment for zero-shot object detection,” 2020, vol. 34, pp. 11932–11939.
[54]
X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” arXiv preprint arXiv:2104.13921, 2021.
[55]
A. Zareian, K. D. Rosa, D. H. Hu, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Chang Shih-Fu, “Open-vocabulary object detection using captions,” 2021, pp. 14393–14402.
[56]
P. Sharma, N. Ding, S. Goodman, and booktitle=Proceedings. of the 56th. A. M. of the A. for C. L. (. Soricut R., “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” 2018, vol. 1: Long Papers).
[57]
J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” arXiv preprint arXiv:2201.12086, 2022.
[58]
P. Zhu, H. Wang, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Saligrama Venkatesh, “Don’t even look once: Synthesizing features for zero-shot detection,” 2020, pp. 11693–11702.
[59]
S. Kazemzadeh, V. Ordonez, M. Matten, and booktitle=Proceedings. of the 2014. conference on empirical methods in natural language processing (EMNLP). Berg Tamara, “Referitgame: Referring to objects in photographs of natural scenes,” 2014, pp. 787–798.
[60]
L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and booktitle=Proceedings. of the A. C. on A. I. Gao Jianfeng, “Unified vision-language pre-training for image captioning and vqa,” 2020, vol. 34, pp. 13041–13049.
[61]
X. Hu et al., “Scaling up vision-language pre-training for image captioning,” arXiv preprint arXiv:2111.12233, 2021.

  1. Work done when interning at Huawei Cloud.↩︎

  2. Equal Contribution.↩︎

  3. Corresponding Authors.↩︎