Causal Debiasing for Visual Commonsense Reasoning 1


Abstract

Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.

Visual commonsense reasoning, bias-related, OOD dataset, causal graphs, backdoor adjustment.

1 Introduction↩︎

The Visual Commonsense Reasoning (VCR) [1] task is an extension of Visual Question Answering (VQA) [2], [3] and consists of two sub-tasks: four-choice question answering (Q\(\rightarrow\)A) and reasoning (QA\(\rightarrow\)R). In the Q\(\rightarrow\)A sub-task, the model selects the correct answer from four options based on the image. In the QA\(\rightarrow\)R sub-task, the model selects the appropriate reasoning for the chosen answer.

Existing research, including task-specific [4][8] and pre-training models [9][14], has made a lot of progress in VCR. Task-specific models use attention mechanisms [15][17] or graph networks [18][20] to capture image and question correlations for accurate answering and reasoning. Pre-trained models leverage prior training on other datasets and fine-tuning on VCR, benefiting from learned relationships and enhancing visual-textual alignment [21], [22].

Despite significant advancements in VCR, researchers have found that models may rely on biases rather than reasoning capabilities to answer questions [7], [23]. The issue is initially investigated by Ye et al. [24], who find that models heavily depend on co-occurrence for predictions when encountering repeated references. However, their study focus solely on the co-occurrence of person tags and do not extensively analyze other types of bias. Subsequent research has not adequately addressed the challenge of bias mitigation [25].

To bridge this gap, our study investigates bias challenges in VCR, focusing on co-occurrence bias and statistical bias. Co-occurrence bias arises from co-occurring components between modalities. During multi-modal feature fusion, the model may pay more attention to the common words between questions and answers or the co-occurrence between visual objects and answers, leading to biased predictions. To validate this, we conduct experiments using the R2C [1] on segmented datasets with or without co-occurring words between questions and answers. The results in Table I confirm that the baseline model achieves higher accuracy on the co-occurrence split, supporting our hypothesis.

Moreover, we observe that certain question types and image types can introduce statistical bias. For example, when questions include specific verbs, the frequency of words in the ground-truth answers may vary. In such cases, the model may rely excessively on high-frequency words and undermine visual information. To investigate this, we divide the dataset based on the frequency of words in answers for each question type. This results in two subsets: one with high-frequency words (head-labeled) and another with low-frequency (tail-labeled). Experiments in Table I show that accuracy of Q\(\rightarrow\)A is higher on the head-labeled dataset, providing evidence for statistical bias.

Table 1: The R2C model predicts accuracy(%) on different split data sets to verify the existence of two biases.
Subsets Q\(\rightarrow\)A QA\(\rightarrow\)R
co-occurrence 36.9 44.4
None co-occurrence 31.3 30.9
head-labeled 29.6
tail-labeled 27.9

After analyzing the bias, we introduce the out-of-distribution (OOD) VCR-OOD dataset, which deals with both co-occurrence bias and statistical bias. For the text modality, we create VCR-OOD-QA by filtering samples with co-occurring words between questions and answers, and balancing verb frequency in answers for different question types. For the visual modality, we apply similar procedures to obtain VCR-OOD-VA.

To combat bias, we develop a debiasing model based on causal intervention [26], [27]. Given the causal graph, we depict the causal relationships in VCR and analyze the prediction shortcuts provided by confounding factors that lead to model learning bias. By employing the backdoor adjustment method, we design an answer dictionary to disrupt the propagation path of these confounding factors and eliminate bias. The backdoor adjustment approach allows us to adjust for the effect of confounding factors by conditioning on variables that block the spurious associations between the input and the output. We create the dictionary based on the correct answer set, ensuring that the model focuses on the relevant information rather than relying on shortcuts that might arise due to biases in the training data.

In summary, our contributions can be categorized into three main areas:

  • We analyze two types of biases in VCR, including co-occurrence bias and statistical bias, and design VCR-OOD datasets.

  • In VCR, we utilize causal graphs to capture the influence of confounding factors on predictions.

  • To cut off the shortcut and alleviate bias, we introduce a debiasing method based on the backdoor adjustment to correct multi-modal features.

2 Construction of OOD datasets↩︎

In this section, we leverage the in-distribution (ID) dataset VCR [1] to create the VCR-OOD dataset, which comprises two subsets: VCR-OOD-QA for the text modality and VCR-OOD-VA for the visual modality. Below we will introduce our processing method in detail.

For the text modality, the VCR-OOD-QA dataset is constructed with the focus on co-occurrence bias and statistical bias. We filter out samples where there are co-occurring words between the questions and the correct answers, as well as between the connected question-answer pairs and the correct reasoning. This process create an auxiliary dataset. To tackle statistical bias, we first categorize the questions based on the verbs they contain, forming different question sets (\(Q_{1}\), \(Q_{2}\), etc.). We then record the corresponding correct answer sets (\(A_{1}\), \(A_{2}\), etc.) and extract the verb or noun from them. Based on the frequency of these words, we divide the samples for each question type into "head-labeled" (high-frequency) and "tail-labeled" (low-frequency) subsets. The division threshold, denoted as \(\alpha\), corresponds to the frequency of the word ranked in the middle. Finally, we randomly retain a threshold number of "head-labeled" samples from each question type to create the VCR-OOD-QA dataset. To maintain dataset consistency, we randomly select 10,000 training samples to balance the 1,045 validation samples.

In the visual modality, establishing correspondence between objects and answer text can be challenging. However, as many questions describe objects present in the images, we leverage the auxiliary dataset in VCR-OOD-QA to mitigate this bias. We generate VCR-OOD-VA using a similar methodology, but categorize images based on the contained objects rather than question text. This results in image sets \(I_{1}\), \(I_{2}\), etc. We then apply the head-tail division and sampling approach used for VCR-OOD-QA to construct the 1,687-sample VCR-OOD-VA validation set, while the training set remains the same as VCR-OOD-QA.

3 Our Method↩︎

To address the bias issue, we analyze the causal relationships [28] and present the shortcuts in the causal graph[29]. Then, we adopt the backdoor adjustment method [30] to debias.

Figure 1: (a) is the causal graph in Q\rightarrowA . (b) shows the confounder in Q\rightarrowA. (c) is our deconfounded in Q\rightarrowA. (d), (e) and (f) correspond to the representations in QA\rightarrowR respectively. The blue arrows indicate the prediction shortcuts.

Causal graphs are directed acyclic graphs (DAGs) [31] consisting of vertices (V) and edges (E), representing causal relationships between variables. Figure 1 (a) and (d) show the causal graphs of two sub-tasks in VCR, including image (I), question (Q), fused features (M), predicted answer (A), connecting question and true answer (QA), and predicted reasoning (R). Among them, the blue arrow indicates the existence of prediction shortcuts, such as I\(\rightarrow\)A. In (b) and (e), we introduce a hybrid variable \(C_{m}\) to represent the effects of co-occurrence and statistical regularity. The direct causal effect from M to A captures the influence of the image-text features on the predicted answers. However, the causal path \(M\leftarrow C_{m}\rightarrow A\) indicates that confounding factor creates a backdoor path originating from M, leading to a spurious correlation between M and A. The predicted probability can be expressed as \(P(A|M)=\sum_{c\in C_{m}}P_{c}(A|M,c)P(c|M)\).

As shown in Figure 1 (c) and (f), we propose a causal intervention method, do-calculus [32], to cut off the edge between \(C_{m}\) and \(M\). By considering all possible values of the confounding factor, we can estimate the predicted probability of a candidate answer using a backdoor adjustment approach: \[\begin{align} P(A|do(M))& = \sum_{c\in C_{m}}P_{c}(A|M,c)P(c)\\ \end{align}\] In typical models, the last layer often consists of a linear softmax structure, where the predicted probability given the multi-modal features \(m\) processed through the encoder \(f()\) is computed as \(P_{c}(A|M) = \text{softmax}(f(m))\). To incorporate the confounding factor into this process, we directly include it in the feature processing during the softmax calculation, resulting in \(P_{c}(A|M,c) = \text{softmax}(f(m),c)\). Then, we employ the Normalized Weighted Geometric Mean (NWGM) approximation [33] to simplify the representation of \(P(A|do(M))\). Taking into account the independence between the multi-modal features \(m\) and \(c\in C_{m}\), Formula(1) is simplified as: \[\begin{align} & P(A|do(M))\\ =& \mathbb{E}_{c\in C_{m}}[\text{softmax}(f(m),c)]\\ \approx&\text{softmax}[ \mathbb{E}_{c\in C_{m}}(f(m)+c)]\\ =&\text{softmax}[f(m)+ \mathbb{E}_{c\in C_{m}}(c)] \end{align}\] where \(\mathbb{E}_{c\in C_{m}}\) is the mathematical expectation with respect to \(c\in C_{m}\).

Figure 2 illustrates the framework of our model, with a primary focus on approximating confounding factors. To achieve this, we design a dictionary \(D[C_{m}]\) of dimensions \(N\times d\), where \(N\) denotes the predetermined number of candidate answers, and \(d\) represents the hidden layer size of the model. Specifically, we randomly select \(N\) correct answers from the training set and feed them into the text encoder, integrating them into the dictionary. We construct corresponding dictionaries for the two sub-tasks of VCR. Thanks to the linear additive property of expectation computation, we can use \(f(m)+ \mathbb{E}_{c\in C_{m}}(c)\) to compute \(\mathbb{E}_{c\in C_{m}}(f(m)+c)\). Specifically, \(\mathbb{E}_{c\in C_{m}}(c)=softmax (L^\mathsf{T}K)D[C_{m}]\), where \(L = W_{1}f(m), K = W_{2}D[C_{m}]\) are the product of elements, and \(W_{1}\), \(W_{2}\) are the mapping matrices.

Figure 2: The framework diagram of causal debiasing in the model. The blue arrows represent the prediction process of the backbone, and the orange arrows are the operations that introduce the do operator.

Considering that the answer dictionary is a feature of the text modality, the model may ignore the information of the visual modality. In order to further enhance the model’s ability to focus on visual content, we introduce negative samples. For each sample triple \((Q, I, A)\) in Q\(\rightarrow\)A, including query, image and answer, we randomly select an image in the batch to form a negative sample \((Q, I^-, A)\), and input it into the model to calculate the loss. \[\begin{align} Loss_{neg}=-\sum_{i=1}^4y_{i}log(f_{base}(Q,I^{-},A)) \end{align}\] where \(i\) is the sample index, \(y\) is the ground truth, and \(f_{base}\) indicate the predicted value obtained from the baseline model. For each sample triple \((QA, I, R)\) in Q\(\rightarrow\)A, Formula 3 can be rewritten as: \[\begin{align} Loss_{neg}=-\sum_{i=1}^4y_{i}log(f_{base}(QA,I^{-},R)) \end{align}\] The main distinction between this negative loss function and the baseline loss \(loss_{base}\) [34] lies in the input image. This loss serves as a penalty to incentivize the model to concentrate on the accurate visual content. Hence, the training loss can be formulated as follows: \[\begin{align} Loss=loss_{base}+\lambda*loss_{neg} \end{align}\] where \(\lambda\) is a hyper-parameter.

4 Experiments↩︎

Table 2: Accuracy (%) of baseline methods and our method on the VCR-OOD-QA validation set. The highest accuracy results are displayed in bold and black, and the second-best results are underlined.
Methods Q\(\rightarrow\)A QA\(\rightarrow\)R Q\(\rightarrow\)AR
R2C [1] 30.3 31.0 9.7
HGL [4] 36.3 36.7 14.7
MCC [6] 40.4 34.4 14.0
ARC [7] 33.6 37.8 14.7
MSGT [8] 37.7 38.1 15.7
R2C+ours 34.4 37.2 13.7
MSGT+ours 42.2 40.3 18.4
Table 3: Accuracy (%) of baseline methods and our method on the VCR-OOD-VA validation set. The highest accuracy results are displayed in bold and black, and the second-best results are underlined.
Methods Q\(\rightarrow\)A QA\(\rightarrow\)R Q\(\rightarrow\)AR
R2C [1] 32.2 34.0 10.8
HGL [4] 41.8 38.1 17.4
MCC [6] 46.6 38.9 18.5
ARC [7] 37.8 37.5 15.5
MSGT [8] 46.5 35.1 16.6
R2C+ours 34.8 38.4 15.0
MSGT+ours 47.3 40.3 19.2

4.1 Experimental Setup↩︎

We conduct extensive experiments on VCR [1] and our VCR-OOD dataset, using three commonly used evaluation metrics Q\(\rightarrow\)A, QA\(\rightarrow\)R and Q\(\rightarrow\)AR. The ID dataset VCR contains 290,000 four-option multiple-choice questions from 110,000 movie scenes. In the experiment, we select some VCR task-specific models as competitors, such as R2C [1], HGL [4], MCC [6], ARC [7], and MSGT [8]. We train and test the above 5 classic VCR models on VCR-OOD. Since our method is model-independent, we add our method to R2C [1] and MSGT [8] for comparison. To further verify the effectiveness of each module of the model, we conduct ablation experiments. The spaCy toolkit [35] was used to extract verbs and nouns from the text in the VCR-OOD dataset. The answer features in the dictionary are obtained by Bert. We artificially set the size of the dictionary, where \(N\) is 1000 and \(d\) is 512. For a fair comparison, our parameter settings follow those reported in the original article.

4.2 Quantitative Result↩︎

In this section, we evaluate five task-specific models on VCR-OOD and then compare our method’s performance against the baseline models R2C [1] and MSGT [8]. In addition, we also conduct experiments on the ID dataset VCR [1].

Table 4: Accuracy (%) of our method based on MSGT on the VCR validation set, which is an ID not OOD dataset.
Methods Q\(\rightarrow\)A QA\(\rightarrow\)R Q\(\rightarrow\)AR
R2C [1] 63.8 67.2 43.1
HGL [4] 69.4 70.6 49.1
MCC [6] 71.7 73.4 52.9
ARC [7] 70.5 73.1 51.8
MSGT [8] 72.2 73.6 53.3
MSGT+ours 71.3 72.9 52.2

Table II shows the results of the 5 baseline models and our method on VCR-OOD-QA. We find that the graph-based models HGL [4] and MSGT [8] outperform the basic R2C [1] model. While the contrastive learning-based MCC [6] achieves the highest accuracy of 40.4% on Q\(\rightarrow\)A, it struggles on reasoning tasks, scoring only 14.0% on Q\(\rightarrow\)AR. ARC matches HGL’s [4] 14.7% on Q\(\rightarrow\)AR. Notably, MSGT [8] achieves the highest accuracies among the baselines, with 38.1% on QA\(\rightarrow\)R and 15.7% on Q\(\rightarrow\)AR. We implement our method on R2C [1] and MSGT [8] respectively. On R2C [1], after applying our method, the prediction results are improved compared with the baseline in three indicators. After using MSGT [8] as the baseline, our method achieve the best in all three indicators, with prediction accuracy of 42.2%, 40.3% and 18.4%.

The results on VCR-OOD-VA are shown in Table III. HGL [4] shows significant improvement over R2C [1] across all three metrics. MCC [6] achieves the highest accuracy, with 46.6% on Q\(\rightarrow\)A, 38.9% on QA\(\rightarrow\)R, and 19.0% on Q\(\rightarrow\)AR, exceeding R2C [1] by 9.8 points on the latter. While the distillation in ARC does not work well on this dataset, scoring only 15.5% on Q\(\rightarrow\)AR, MSGT [8] achieves 46.5% on Q\(\rightarrow\)A. Importantly, our method outperforms MSGT [8], reaching the highest accuracies of 47.3% on Q\(\rightarrow\)A, 40.3% on QA\(\rightarrow\)R, and 19.2% on Q\(\rightarrow\)AR.

On the ID dataset VCR, after adding our debiasing method, the prediction accuracy of the model is slightly lower than the baseline, but it is still competitive. Table IV shows that our method achieves a prediction accuracy of 52.2% on Q\(\rightarrow\)AR. Our debiasing method does not completely damage the performance of the model, which is consistent with other debiasing works [36][38] in VQA.

4.3 Ablation Study↩︎

To evaluate the effectiveness of each component and the influence of \(\lambda\) in the loss, we utilize MSGT [8] as the baseline and perform ablation experiments on VCR-OOD. The results of each component on VCR-OOD-QA and VCR-OOD-VA are presented in Table V. Compared to the baseline, the model demonstrates improved performance on each sub-task when incorporating causal intervention. When both components are combined, the model achieves the best performance with accuracy rates of 18.4% and 19.2% on VCR-OOD-QA and VCR-OOD-VA, respectively. This suggests that both parts further enhance the model’s prediction accuracy.

lcc|ccc dataset&causal inference &\(loss_{neg}\)& Q\(\rightarrow\)A&QA\(\rightarrow\)R& Q\(\rightarrow\)AR
&\(\times\)&\(\times\)& 37.7& 38.1& 15.7
*VCR-OOD-QA&✔&\(\times\)& 41.2& 38.3& 16.4
 &\(\times\)&✔& 41.9& 36.3&16.0
&✔&✔& 42.2& 40.3& 18.4
&\(\times\)& \(\times\)& 46.5& 35.1&16.6
*VCR-OOD-VA&✔& \(\times\)& 47.0& 35.6&16.8
 &\(\times\)& ✔& 46.3& 38.6&17.9
&✔& ✔& 47.3& 40.3&19.2

Figure 3: Performance changes with different parameter values (\lambda) on VCR-OOD-QA (a) and VCR-OOD-VA Datasets (b).

Furthermore, we investigate the impact of the hyper-parameter \(\lambda\) in the loss function as shown in Figure 3. The accuracy of Q→AR is evaluated by correctly predicting the prediction results on both Q→A and Q→AR with the same \(\lambda\) value. On VCR-OOD-QA, the model achieves the highest accuracy of 40.3% on QA\(\rightarrow\)R when \(\lambda\) is 4, and 42.2% on Q\(\rightarrow\)A, 18.4% on Q\(\rightarrow\)AR when \(\lambda\) is 3. On the VCR-OOD-VA dataset, the accuracy reaches the highest in terms of Q\(\rightarrow\)A, QA\(\rightarrow\)R and Q\(\rightarrow\)AR when \(\lambda\) values are 5, 2, 5 respectively.

5 Conclusion↩︎

In this paper, we conduct a comprehensive analysis of the bias issue in the VCR task, considering both text and vision modalities. To evaluate the model’s generalization, we propose the VCR-OOD dataset. To address the issue of bias learning, we begin by constructing a task-specific causal graph that incorporates bias effects. By employing the backdoor adjustment method in causal intervention, we design an answer dictionary to disrupt the prediction shortcuts. Finally, we evaluate the performance of our proposed method on both ID and OOD datasets.

References↩︎

[1]
Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi, “From recognition to cognition: Visual commonsense reasoning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6720–6731.
[2]
Chen Feng, Duolikun Danier, Fan Zhang, and David Bull, “Rankdvqa: Deep vqa based on ranking-inspired hybrid training,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 1648–1658.
[3]
Alkesh Patel, Akanksha Bindal, Hadas Kotek, Christopher Klein, and Jason Williams, “Generating natural questions from images for multimodal assistants,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 2270–2274.
[4]
Weijiang Yu, Jingwen Zhou, Weihao Yu, Xiaodan Liang, and Nong Xiao, “Heterogeneous graph learning for visual commonsense reasoning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
[5]
Xi Zhang, Feifei Zhang, and Changsheng Xu, “Explicit cross-modal representation learning for visual commonsense reasoning,” IEEE Transactions on Multimedia, vol. 24, pp. 2986–2997, 2021.
[6]
Xi Zhang, Feifei Zhang, and Changsheng Xu, “Multi-level counterfactual contrast for visual commonsense reasoning,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 1793–1802.
[7]
Zhenyang Li, Yangyang Guo, Kejie Wang, Yinwei Wei, Liqiang Nie, and Mohan Kankanhalli, “Joint answering and explanation for visual commonsense reasoning,” IEEE Transactions on Image Processing, 2023.
[8]
Jian Zhu, Hanli Wang, and Bin He, “Multi-modal structure-embedding graph transformer for visual commonsense reasoning,” IEEE Transactions on Multimedia, vol. 26, pp. 1295–1305, 2023.
[9]
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu, “Uniter: Universal image-text representation learning,” in European conference on computer vision. Springer, 2020, pp. 104–120.
[10]
Zhecan Wang, Haoxuan You, Liunian Harold Li, Alireza Zareian, Suji Park, Yiqing Liang, Kai-Wei Chang, and Shih-Fu Chang, “Sgeitl: Scene graph enhanced image-text learning for visual commonsense reasoning,” in Proceedings of the AAAI conference on artificial intelligence, 2022, vol. 36, pp. 5914–5922.
[11]
Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal, “Unifying vision-and-language tasks via text generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 1931–1942.
[12]
Mengqi Yuan, Gengyun Jia, and Bing-Kun Bao, “Gpt-based knowledge guiding network for commonsense video captioning,” IEEE Transactions on Multimedia, 2023.
[13]
Mingjie Ma, Zhihuan Yu, Yichao Ma, and Guohui Li, “Eventlens: Leveraging event-aware pretraining and cross-modal linking enhances visual commonsense reasoning,” arXiv preprint arXiv:2404.13847, 2024.
[14]
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee, “Vip-llava: Making large multimodal models understand arbitrary visual prompts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 12914–12923.
[15]
Yuansheng Song and Ping Jian, “Deep hierarchical attention flow for visual commonsense reasoning,” in Natural Language Processing and Chinese Computing: 9th CCF International Conference, NLPCC 2020, Zhengzhou, China, October 14–18, 2020, Proceedings, Part I 9. Springer, 2020, pp. 16–28.
[16]
Siyu Lu, Mingzhe Liu, Lirong Yin, Zhengtong Yin, Xuan Liu, and Wenfeng Zheng, “The multi-modal fusion in visual question answering: a review of attention mechanisms,” PeerJ Computer Science, vol. 9, pp. e1400, 2023.
[17]
Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bâce, and Andreas Bulling, “Multimodal integration of human-like attention in visual question answering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2648–2658.
[18]
Zijie Song, Zhenzhen Hu, and Richang Hong, “Efficient and self-adaptive rationale knowledge base for visual commonsense reasoning,” Multimedia Systems, vol. 29, no. 5, pp. 3017–3026, 2023.
[19]
Yanan Wang, Michihiro Yasunaga, Hongyu Ren, Shinya Wada, and Jure Leskovec, “Vqa-gnn: Reasoning with multimodal knowledge via graph neural networks for visual question answering,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21582–21592.
[20]
Sheng Zhou, Dan Guo, Xun Yang, Jianfeng Dong, and Meng Wang, “Graph pooling inference network for text-based vqa,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 4, pp. 1–21, 2024.
[21]
Muhammad Jaleed Khan, John G Breslin, and Edward Curry, “Common sense knowledge infusion for visual understanding and reasoning: Approaches, challenges, and applications,” IEEE Internet Computing, vol. 26, no. 4, pp. 21–27, 2022.
[22]
Zhi-Qi Cheng, Qi Dai, Siyao Li, Teruko Mitamura, and Alexander Hauptmann, “Gsrformer: Grounded situation recognition transformer with alternate semantic attention refinement,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3272–3281.
[23]
Ali Vosoughi, Shijian Deng, Songyang Zhang, Yapeng Tian, Chenliang Xu, and Jiebo Luo, “Cross modality bias in visual question answering: A causal view with possible worlds vqa,” IEEE Transactions on Multimedia, 2024.
[24]
Keren Ye and Adriana Kovashka, “A case study of the shortcut effects in visual commonsense reasoning,” in Proceedings of the AAAI conference on artificial intelligence, 2021, vol. 35, pp. 3181–3189.
[25]
Jungeun Kim, Jinwoo Park, Jaekwang Seok, and Junyeong Kim, “Dynamic debiasing network for visual commonsense generation,” IEEE Access, 2023.
[26]
Wei Li and Zhixin Li, “Causal-setr: A segmentation transformer variant based on causal intervention,” in Proceedings of the Asian Conference on Computer Vision, 2022, pp. 756–772.
[27]
Tan Wang, Jianqiang Huang, Hanwang Zhang, and Qianru Sun, “Visual commonsense representation learning via causal inference,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2020, pp. 378–379.
[28]
Xi Zhang, Feifei Zhang, and Changsheng Xu, “Reducing vision-answer biases for multiple-choice vqa,” IEEE Transactions on Image Processing, 2023.
[29]
Jiayun Zheng and Maggie Makar, “Causally motivated multi-shortcut identification and removal,” Advances in Neural Information Processing Systems, vol. 35, pp. 12800–12812, 2022.
[30]
Zaixi Zhang, Qi Liu, Zhicai Wang, Zepu Lu, and Qingyong Hu, “Backdoor defense via deconfounded representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12228–12238.
[31]
Jiaxin Qi, Yulei Niu, Jianqiang Huang, and Hanwang Zhang, “Two causal principles for improving visual dialog,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10860–10869.
[32]
Judea Pearl and Dana Mackenzie, The book of why: the new science of cause and effect, Basic books, 2018.
[33]
Jana Krejčı́ and Jan Stoklasa, “Aggregation in the analytic hierarchy process: Why weighted geometric mean should be used instead of weighted arithmetic mean,” Expert Systems with Applications, vol. 114, pp. 97–106, 2018.
[34]
Zhenyang Li, Yangyang Guo, Kejie Wang, Fan Liu, Liqiang Nie, and Mohan Kankanhalli, “Learning to agree on vision attention for visual commonsense reasoning,” IEEE Transactions on Multimedia, 2023.
[35]
Matthew Honnibal and Ines Montani, “spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing,” To appear, vol. 7, no. 1, pp. 411–420, 2017.
[36]
Xi Zhang, Feifei Zhang, and Changsheng Xu, “Next-ood: Overcoming dual multiple-choice vqa biases,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 4, pp. 1913–1931, 2023.
[37]
Chenji Lu, Ge Bai, Shilong Li, Ying Liu, Xiyan Liu, Zerong Zeng, and Ruifang Liu, “Causalme: Balancing bi-modalities in visual question answering,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 10556–10560.
[38]
Xinpeng Lv, Wanrong Huang, Haotian Wang, Ruochun Jin, Xueqiong Li, Zhipeng Lin, Shuman Li, Yongquan Feng, and Yuhua Tang, “Modality re-balance for visual question answering: A causal framework,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 5650–5654.

  1. *corresponding author.

    This work was supported by the National Natural Science Foundation of China under Grants (No.62325206, 62306150), and the Key Research and Development Program of Jiangsu Province under Grant BE2023016-4.↩︎