mChartQA: A universal benchmark for multimodal Chart Question Answer based on Vision-Language Alignment and Reasoning


Multimodal chart question-answering, crucial for applications such as financial report analysis, decision support, and invoice parsing, confronts significant challenges with intricate color patterns, structural complexities, and implicit numerical data in charts. Traditional methods, mainly involving chart-to-text conversion followed by processing with Large Language Models (LLMs) or direct multimodal processing, often falter in these complex scenarios. To overcome these hurdles, this paper introduces mChartQA, a groundbreaking framework tailored for advanced multimodal chart question-answering. mChartQA innovatively merges sophisticated language processing capabilities of LLMs with a state-of-the-art table-to-text engine, facilitating effective processing and integration of complex visual and textual information. This framework stands out for its ability to align visual and textual data accurately and is further refined for deep reasoning and contextual understanding within charts. The AI contribution of this work lies in its novel integration of multimodal data processing techniques, significantly enhancing the accuracy of chart question-answering. Demonstrated through experimental results on three distinct datasets, mChartQA showcases superior performance in tackling complex, multimodal chart question-answering tasks, especially in scenarios that have posed challenges for existing methods.

Multimodal Chart Question-Answering,Vision-Language Alignment,Natural Language Processing,Large Language Models (LLMs),Two-Stage Training

1 Introduction↩︎

The goal of multimodal chart question answering is to automatically answer a natural language question about a chart to facilitate visual data analysis  [1], where the ability to understand and interact with visual data is essential  [2]. It has emerged as a crucial intersection of computer vision and natural language processing, addressing the growing demand for intelligent systems capable of interpreting complex visual data in charts  [2]. Beyond its general applications, multimodal chart question-answering plays a pivotal role in sectors requiring precise and rapid analysis of visual data. In the financial domain, it is indispensable for tasks such as financial report analysis  [3], decision support [4], invoice parsing  [5], and contract review [6]. Similarly, in the medical field, it significantly contributes to the digitization of patient records [7], medical insurance review [8], diagnostic assistance [9], and quality control [10] of medical records.

Figure 1: Examples of Color, Structure, and Textless Charts

Due to the richness and ambiguities of natural language and complex visual reasoning, multimodal chart question answering task requires to predict the answer in the intersection of information visualization, natural language processing, and human computer interactions  [1].

Early approaches apply natural language processing techniques by largely depending on heuristics or grammar-based parsing techniques  [11][14]. Thanks to insufficient processing of complex linguistic phenomena, over-reliance on grammatical rules, and limited depth of understanding natural language, deep learning models have been introduced for understanding natural language queries about visualizations  [15][17].

Recently, with the outstanding performance of large language models (LLMs) in natural language inference, researches propose a pipeline approach which involve converting chart information into textual format, known as chart-to-text, and then processing this information using LLMs  [18]. Though it performs well in text-based scenarios (seen Fig.1 (A)), this two-stage approach tends to struggle when it comes to complex scenarios such as colors, structures, or textless information (seen Fig.1 (B)(C)(D)). The main challenge here is that crucial visual details may be lost or misrepresented during the chart-to-text conversion process, leading to potential misunderstandings or errors in interpretation. Hence, researches further explore the multimodal alignment framework based on pre-trained vision-language (VL) models that directly process the visual form of charts, such as mPLUG-DocOwl [19] and Qwen-VL  [20]. Despite their strong alignment capabilities in VL tasks, these models reveal limitations in complex reasoning scenarios within chart question answering, particularly in handling questions about color patterns in charts, structural details, and numerical data, especially in instances where charts do not explicitly display numerical information.

Given the limitations of existing methods in multimodal chart question answering, particularly in scenarios that require intricate understanding of color patterns, structural complexities, and interpretation of charts with implicit numerical data, our research is focused on developing a more adaptive and comprehensive solution. This need underscores the importance of a model capable of effectively processing and integrating diverse visual and textual information. In response, we propose a multimodal chart question answering framework, mChartQA, which leverages vision-language alignment and reasoning. This framework distinctively integrates advanced language processing techniques from large language models (LLMs) with a sophisticated table-to-text engine, enabling the transformation of complex visual elements into analytically rich formats. Through this integration, mChartQA aims to overcome the limitations of current VL models, enhancing their ability to provide accurate and contextually rich answers to complex questions about multimodal charts. The main contributions of this paper are as follows:

  • We propose a universal benchmark for multimodal chart question answer based on vision-language alignment and reasoning, which provides the reasoning ability and word-level interpretability for multimodal chart question answer task.

  • Our proposed mChartQA aligns visual and textual data to ensure accurate interpretation of visual elements, and then applies advanced language processing techniques for deep reasoning and contextual understanding.

  • We compare mChartQA with existing state-of-the-art methods, and experimental results on three datasets demonstrate effectiveness of our model in handling diverse and challenging chart question-answering tasks.

2 Related Work↩︎

The field of chart question-answering has evolved significantly, beginning with early reliance on natural language processing (NLP) techniques and advancing towards sophisticated multimodal approaches. Initially, chart question-answering systems like Eviza  [11], Orko  [12], Evizeon  [13] and DataTone  [14] employed heuristic or grammar-based parsing techniques. While these methods provided foundational insights, they struggled with complex linguistic phenomena and heavily relied on grammatical rules. To address these limitations, more advanced models such as LEAF-QA  [15], STL-CQA  [16], and FigureNet  [17] were developed. LEAF-QA introduced a dataset of figures/charts with question-answer pairs for figure question answering, constructed from real-world open data sources  [15]. STL-CQA and FigureNet, on the other hand, focused on deep learning models for question-answering on scientific plots, pushing the boundaries in reasoning and understanding tasks  [16], [17].

The emergence of benchmarks like FigureQA  [21], PlotQA  [22], and ChartQA  [2] further highlighted the need for advanced multimodal methods in chart interpretation. These benchmarks have been instrumental in driving research towards more integrated approaches that combine linguistic and visual analysis. In response to these challenges, two main approaches have emerged in the field of chart question-answering: 1)Chart-to-Text Conversion and LLM Processing: The chart-to-text conversion approach, exemplified by DePlot  [18], involves translating visual chart data into text for processing with large language models (LLMs). This method is effective for simpler visual scenarios but often struggles with the complexity of visual elements in more intricate charts. 2)Direct Multimodal Processing: In contrast, direct multimodal processing using vision-language (VL) models has gained prominence. Models like Pix2Struct  [23], PaLI-3  [24], BLIP-2  [25], Qwen-VL  [20], mPLUG-DocOwl  [19], and UniChart  [26] have been at the forefront of this research, exploring innovative architectures for chart comprehension and reasoning. However, despite their strong capabilities in VL tasks, these models often encounter limitations in complex reasoning scenarios within chart question answering.

3 Method↩︎

Figure 2: The training architecture and workflow of the mChartQA model.

3.1 Architecture↩︎

Our model, mChartQA (Multimodal Chart Question-Answering Model), is designed for aligning and reasoning with visual and textual information from chart images and corresponding questions. The architecture, illustrated in Figure 2, includes four main components:

Vision Encoder (Ev): The Vision Encoder processes a chart image \(I\) to produce visual features \(V\). This process is formalized as \(V = E_v(I)\), capturing detailed visual information from the chart image.

Connector (C): The Connector employs a cross-attention mechanism to align visual features \(V\) with the text encoder. The alignment process is crucial for correlating visual elements with corresponding textual data, enhancing the model’s interpretative capabilities. The cross-attention mechanism in the Connector is defined as: \[\scalebox{0.9}{ V' = C(V) = \text{softmax}\left(\frac{(W_qV)(W_kV)^T}{\sqrt{d_k}}\right)(W_vV), }\] where \(Q = W_qV\) is a learnable query, \(K = W_kV\) and \(V = W_vV\) are key and value projections with \(W_k\) and \(W_v\) being the respective weights, and \(d_k\) is the dimension of the key vectors.

Chart-to-Text Engine (T): This module converts the chart image \(I\) into a textual representation \(T'\), formalized as \(T' = T(I)\), extracting key textual elements from the chart. For instance, in Figure 2, if the bar chart lacks numerical annotations, the engine will only recognize and extract the definite textual information present, without performing predictions based on the bar chart.

Large Language Model (L): The Large Language Model processes the tokenized question \(Q_t\), enhanced visual features \(V'\), and tokenized textual representation \(T'_t\). The prediction process is formalized as \(A = L(Q_t, V', T'_t)\), where \(L\) integrates visual and textual information to predict the answer.

3.2 Training↩︎

mChartQA training is conducted in two stages:

Stage 1 - Visual-Language Alignment: This stage focuses on training the Connector to optimize the alignment of visual and textual representations. The objective function for this stage is defined as: \[\scalebox{0.9}{ \min_{\theta_C} \mathcal{L}_{\text{alignment}}(V, Q_t; \theta_C) = -\sum_{i=1}^{N} \log P(y_i | V, Q_t; \theta_C), }\] where \(\mathcal{L}_{\text{alignment}}\) is the cross-entropy loss, \(y_i\) are the true labels, and \(P(y_i | V, Q_t; \theta_C)\) is the predicted probability of the correct label.

Stage 2 - Visual-Language Reasoning: In this stage, both the Connector and the Large Language Model are trained to enhance reasoning capabilities. The final optimization objective is defined as: \[\scalebox{0.75}{ \min_{\theta_C, \theta_L} \mathcal{L}_{\text{reasoning}}(Q_t, V', T'_t; \theta_C, \theta_L) = -\sum_{i=1}^{N} \log P(y_i | Q_t, V', T'_t; \theta_C, \theta_L) }\] where \(\mathcal{L}_{\text{reasoning}}\) is the cross-entropy loss, \(y_i\) are the true labels, and \(P(y_i | Q_t, V', T'_t; \theta_C, \theta_L)\) is the predicted probability of the correct label.

4 Experiment↩︎

4.1 Datasets↩︎

Table 1: Details of stage one training data
Task Dataset Used
COCO Caption [27] 400K
SBU [28] 300K
Captioning NoCaps [29] 200K
CC3M [30] 200K
ShareGPT4V [31] 500K
GRIT [32] 150K
Visual Genome [33] 100K
Grounding RefCOCO [34] 50K
RefCOCO+ [35] 50K
RefCOCOg [35] 50K
Chart-to-text ChartQA  [2] 20K
Total 2.02M

Stage 1 - Visual-Language Alignment In the initial stage, we mainly focus on realizing Image-text alignment and improving its fine-grained perception of images. Our model was trained using data specific to the Captioning, Grounding, and Chart-to-text tasks, as illustrated in Table 1. We primarily utilize 1600,000 text pairs for training purposes in the captioning task. For the grounding task, we utilized 400,000 data points for training and bifurcated it into two subtasks, Caption with Grounding and Grounded Captioning. In the Caption with Grounding task, the model is required to accurately identify and describe a specific object in the image while simultaneously labeling the position (box) of the object in the image. On the other hand, in the Grounded Captioning task, the model must describe the object according to the information provided by its position (box) in the image. These two tasks mutually benefit one another, greatly enhancing the model’s capability to detail objects within images at an intricate level. Additionally, We incorporated information from the 20,882 charts provided by the chartqa dataset to facilitate training on the chart-to-text task. Figure 3 demonstrates the precise structure of these aforementioned tasks. Through implementation of the above practices, our model can attain exceptional chart-to-text alignment and exhibit a profound comprehension of charts. This training methodology serves to notably enhance the model’s faculty for discerning images at a precise level, thereby yielding more exact and comprehensive information for ensuing analyses and processing endeavors.

Figure 3: Format for Stage 1 training data includes Captioning, Grounding, and Chart-to-Text tasks. The prefix sequence is in black text, while the correct label is in red text.

Stage 2 - Visual-Language Reasoning: In the stage, we utilized methods similar to  [36], [37]. We randomly extracted data from several training datasets for our study: 10,000 pairs from the ChartQA training dataset, 140,000 pairs from the FigureQA training dataset, and 150,000 pairs from the PlotQA training dataset, totaling 300,000 pairs. The table 2 summarizes the datasets used and the number of chart-question pairs originally available versus the number randomly extracted for our study.

Table 2: Summary of datasets and the number of chart-question pairs extracted.
Dataset Original Number of Chart-Question Pairs Number of Pairs Randomly Extracted
ChartQA 28,299 10,000
FigureQA 2,388,698 140,000
PlotQA 28,952,641 150,000

Test DataSets: We conduct tests on the public test sets of three datasets: ChartQA, PlotQA, and FigureQA. Similar to the approach in  [38], our test set composition is detailed below. We selected a set of representative problems from these datasets to evaluate the model’s performance on color, structure, and textless problems in more detail. These datasets present unique challenges, including the complexity of color patterns, structures, and the graphical interpretation of implicit numerical data. This method of evaluation provides a more detailed understanding of model performance. We identify three main types of questions, with their types and example templates described below.

Color: This problem type requires an understanding of color theory, as well as the observation and analysis of color information on a chart. Example: "What is the least difference between the light blue bar and the dark blue bar?"

Structure: These problems relate to chart layout and structure, requiring an analysis and understanding of the components of a chart and its visual representation. Example: "What is the label of the third bar from the left in each group?"

Textless: This problem involves interpreting graphs containing implicit numerical data, where the graphical elements lack precise numerical values and require numerical reasoning using a model. To enhance the construction of the test dataset, we manually filtered out datasets containing these three types of queries from the ChartQA, PlotQA, and FigureQA test dataset. These samples are visible in Figure 4. The diversity of the test dataset demonstrates the model’s ability to address diagrammatic problems, especially those involving color, structure, and textless diagrams, serving as an important evaluation criterion.

Table 3: Comprehensive Distribution of QA Pairs and Problem Types Across Test Datasets
Dataset QA Pairs Color Structure Textless
ChartQA 2,500 264 385 209
PlotQA V1 10,000 966 1,165
PlotQA V2 10,000 310 662
FigureQA Val1 5,000 1,607
FigureQA Val2 5,000 1,627

Figure 4: Example test dataset extracted from the ChartQA, PlotQA, and FigureQA datasets, with the test set and example type displayed in green.

4.2 Baselines↩︎

In our experiments with mChartQA, we compare it against two categories of baseline models: Few-Shot Learning Models: We include pioneering models such as GPT3 (1-Shot)  [38] and GPT4 (5-Shot)  [38], which leverage generative pre-trained transformers for Few-Shot learning. Additionally, FlanPaLM (540B) in both 1-Shot  [18] and 1-Shot with Self-Consistency (SC) decoding  [18], as well as LLaMA-2 (70B) in 1-Shot and 5-Shot configurations  [38], are evaluated to highlight their Few-Shot capabilities in the context of chart question-answering. Fully-Supervised Models: This category includes specialized models such as Pix2Struct  [23] and MATCHA  [39], which are designed specifically for interpreting and answering questions from charts. Large Language Model (LLM)-Based Approaches like PaLI-X-OCR  [40] and PaLI-X-no_OCR  [40], mPLUG-DocOwl (LLaMA-7B)  [41], Qwen-VL [20], and Qwen-VL-Chat  [20] demonstrate the adaptability of pre-trained models to chart question-answering. Vision-Language Pretraining (VLP) models such as VL-T5-OCR  [18], T5-OCR  [18], and VisionTapas-OCR  [18], along with ChartReader  [42] and ChartT5  [43], showcase the integration of OCR capabilities for enhanced performance. Additionally, the DOMINO (70B)  [38] is included to represent a state-of-the-art approach in fully supervised learning for chart question-answering. Through this array of baselines, we aim to provide a comprehensive evaluation of mChartQA’s performance, situating it within the broader landscape of chart question-answering research.

4.3 Experimental Setting↩︎

In our experiments, we have developed two main versions of the mChartQA model: mChartQA\(_{\text{Qwen}}\) and mChartQA\(_{\text{Intern-LM2}}\). The mChartQA\(_{\text{Qwen}}\) version is based on the Qwen-14B Large Language Model, while the mChartQA\(_{\text{Intern-LM2}}\) version utilizes the Intern-LM2-7B model. Both versions employ the clip-vit-large-patch14-336 as their vision encoder, ensuring advanced visual processing capabilities. The Chart-to-Text Engine across both versions is initialized with weights from DePlot, facilitating effective translation of chart images into descriptive text.

Training Configuration: 1) Visual-Language Alignment In this stage, we use the AdamW optimizer with \(\beta_1 = 0.9\), \(\beta_2 = 0.95\), a learning rate of \(1e-6\), weight decay of \(0.001\), and a batch size of 16. The training spans across 6 epochs, focusing on aligning visual and textual representations. 2) Visual-Language Reasoning In the reasoning stage, we continue with the same optimizer and weight decay settings but adjust the learning rate to \(1e-5\) and reduce the batch size to 8. This stage lasts for 8 epochs and involves training both the Connector and the Large Language Model to enhance reasoning capabilities.

Evaluation Metrics: Our evaluation follows the metrics used in the ChartQA, PlotQA, and FigureQA datasets. For numeric answers, a tolerance of up to 5% error is allowed. However, for textual answers, exact matching is required. Only responses that precisely match the correct answer are considered correct, ensuring accuracy in evaluation.

4.4 Main Results↩︎

Extended Comparative Results on the ChartQA Dataset: The columns represent the model name, performance on human-generated questions (ChartQA-human), performance on machine-generated questions (ChartQA-M), and the average performance across both types.

Table 4: Model Performance on FigureQA and PlotQA Datasets: Accuracy scores for Few-Shot and Supervised models on FigureQA v1, v2, and their average (FigureQA Avg), followed by PlotQA v1, v2, and their average (PlotQA Avg).
Evaluation Type Model FigureQA v1 FigureQA v2 FigureQA Avg PlotQA v1 PlotQA v2 PlotQA Avg
Few-Shot GPT3 (1-Shot)  [18] - - - 31.60 42.20 36.90
FlanPaLM (540B) (1-Shot)  [18] - - - 51.30 44.90 48.10
FlanPaLM (540B) (1-Shot, SC)  [18] - - - 57.80 50.10 53.90
LLaMA-2 (70B) (1-Shot)  [38] 55.60 55.70 55.65 32.50 43.40 37.90
LLaMA-2 (70B) (5-Shot)  [38] 61.60 61.20 61.40 43.20 44.70 43.90
Supervised Pix2Struct  [23] - - - 73.20 71.90 72.60
MATCHA  [39] 50.02 50.10 50.06 77.53 60.30 68.92
ChartReader  [42] 95.50 95.80 95.65 78.10 59.30 68.70
DOMINO(70B)  [38] 64.70 64.40 64.55 58.90 80.70 69.80
Ours mChartQA\(_{\text{Qwen}}\) 90.32 92.75 91.54 78.00 62.95 70.48
mChartQA\(_{\text{Intern-LM2}}\) 96.06 96.30 96.18 78.25 74.79 76.52

In our analysis, we benchmark the mChartQA model against a diverse set of baselines, encompassing both Few-Shot Learning Models and Fully-Supervised Models. These baselines represent the state-of-the-art performance on the datasets in question, providing a rigorous context for evaluating the effectiveness of our model. The comparison includes models such as GPT3 and GPT4 for Few-Shot learning scenarios, specialized chart interpretation models like Pix2Struct and MATCHA, and LLM-Based Approaches such as PaLI-X-OCR. This diverse set of baselines ensures a comprehensive assessment of mChartQA’s performance across different learning paradigms and task complexities.

Performance on the ChartQA Dataset: As shown in Table [tab:extended95chartqa], while mChartQA demonstrates superior performance in several aspects, particularly in handling machine-generated questions with its advanced understanding of chart elements, it faces stiff competition from models like DOMINO in the fully supervised setting. The effectiveness of mChartQA is particularly notable in scenarios requiring deep semantic understanding and contextual interpretation, where it outperforms traditional models. However, in tasks where extensive pre-training on chart-specific data is beneficial, models like MATCHA and Pix2Struct show competitive advantages. This suggests that while mChartQA excels in leveraging contextual cues and integrating multimodal information, there is room for improvement in direct chart data interpretation. Future improvements will focus in this direction.

Analysis on FigureQA and PlotQA Datasets: As shown in Table 4, mChartQA’s performance on the FigureQA and PlotQA datasets is robust, showcasing its generalizability across different chart types and question formats. It achieves remarkable accuracy, especially in the Intern-LM2 version, surpassing many baselines in both Few-Shot and Fully-Supervised categories. However, the model does not always lead, with certain baselines like ChartReader showing superior performance in specific tasks, indicating the potential benefits of integrating more targeted OCR and visual feature extraction techniques. This underscores the need for ongoing refinement in the model’s approach to visual data processing and interpretation.

Comparative Results in Complex Scenarios

Comparative Analysis in Complex Scenarios: As shown in Table [tab:chartqa95mul], the comparative results in complex scenarios such as color, structure, and textless tasks showcase the capabilities of the mChartQA model. The mChartQA model, especially the Intern-LM2 version, demonstrates a remarkable ability to outperform existing models across these challenging scenarios. Notably, the model achieves significant advancements in the textless category across different datasets, which underscores its proficiency in interpreting charts that lack explicit textual annotations. Despite these strengths, the analysis also uncovers areas where mChartQA could be further enhanced. For instance, while the model shows exceptional performance in textless scenarios, its performance in color and structure-related tasks, though superior, suggests there is still room for improvement. This is particularly evident when compared to the highest scores achieved by other models in specific categories, indicating that mChartQA’s approach to processing and understanding visual elements such as color and structural components can be refined. The comparative performance also highlights the potential for mChartQA to benefit from more targeted improvements in its handling of complex chart elements. For example, the slight underperformance in certain scenarios compared to the very best model outcomes suggests that integrating more sophisticated visual processing techniques or enhancing the model’s training on datasets with a wider variety of chart types could yield further improvements.

Overall Performance and Future Directions: In summary, mChartQA establishes a new standard in multimodal chart question-answering, showcasing strong performance in complex scenarios such as those involving color, structure, and charts lacking textual descriptions. Despite its notable achievements, the analysis identifies potential areas for further refinement, especially in enhancing the model’s interpretative accuracy across a broader range of chart types and scenarios. Future efforts will be directed towards deepening the model’s comprehension of complex visual elements and improving its interpretative capabilities, with the goal of extending mChartQA’s leadership in the field of chart question-answering.

4.5 Ablation Study↩︎

In this ablation study, we explored the impact of incorporating Deplot in different stages of training and testing in the mChartQA\(_{\text{Qwen}}\) model. We analyzed the performance variations by selectively applying Deplot, as shown in Table [tab:chartqa95comparison] (see rows 1-3 in Table [tab:chartqa95comparison]).

The mChartQA\(_{\text{Qwen}}\) model with Deplot in both training and testing phases (row 1) demonstrates the highest performance, underscoring Deplot’s significance in enhancing chart comprehension. The absence of Deplot in either phase (rows 2 and 3) leads to a noticeable decline in performance, particularly in complex scenarios like textless charts. This confirms our hypothesis about Deplot’s crucial role in the model’s learning process and its effectiveness in handling multimodal data and complex chart structures.

Ablation Study Results on Multimodal Chart Datasets

4.6 Further Analysis↩︎

Effect of Connector Replacement (see rows 4 in Table [tab:chartqa95comparison]): Replacing our cross-attention connector with an MLP-based approach (-Qformer + MLP) resulted in varied performance across different tasks. While the MLP connector showed some improvements in specific scenarios, such as color charts in ChartQA, our original connector generally outperformed the MLP in most tasks, especially in complex reasoning scenarios. This highlights the effectiveness of our cross-attention mechanism in integrating multimodal information.

Effect of Visual Encoder Variation (see rows 5 in Table [tab:chartqa95comparison]): Experimenting with different visual encoders, we replaced our ViT-448 encoder with ViT-384 (-ViT448 + Vit384). The results indicate that while ViT-384 performs competitively in certain tasks, our ViT-448 encoder generally achieves superior results, particularly in handling complex chart structures and textless scenarios. This suggests the importance of a specialized visual encoder tailored for multimodal chart question-answering.

4.7 Case Study↩︎

Figure 5: Case study example.

In this case study, we present an example that highlights the superior performance of our mChartQA model, especially in complex scenarios involving overlapping visual elements. Figure 5 illustrates this case,

This example demonstrates mChartQA (Base)’s ability to accurately interpret complex charts with significant visual element overlap. mChartQA is the only model among its comparisons to correctly identify the chart’s details. We speculate that this success is mainly due to the multimodal architecture, where the Vision Encoder provides precise information to the language encoder. The heatmap visualization confirms our speculation. In contrast, other models exhibit severe hallucinations, either misinterpreting or providing arbitrary answers.

5 Error Analysis↩︎

We delve into a detailed error analysis of our model, categorizing errors into three main types: structure, color, and textless chart questions. We identified prevalent error patterns within each category, as demonstrated in Figure 6.

Figure 6: Examples of errors in structure, color, and textless chart questions.

Structure-Related Errors: In the first row of Figure 6, we present errors encountered in structure-related chart questions. For example, the first error was due to rounding off the answer, resulting in a slight deviation from the correct value. The second error involved a language model hallucination, leading to a misspelling. The third error was attributed to inadequate structural recognition, affecting the accuracy of the response.

Color-Related Errors: The second row in Figure 6 shows errors in color-related chart questions. The first error occurred in multi-step calculations, where a mistake at any stage could lead to an incorrect outcome. The second error was due to limited color recognition accuracy, leading to ambiguity in identifying the correct color. The third error highlighted the need for precise question interpretation, as the model’s response, although technically correct, did not align with the expected answer format.

Textless Chart Errors: The third row of Figure 6 illustrates errors in textless chart questions. These included slight deviations in numerical estimations, precision mismatches between the model’s output and the standard answer, and confusion in the model’s understanding of the question scope.

This error analysis sheds light on potential areas for refinement in multimodal chart question-answering models. The issues observed in numerical precision, language model accuracy, and recognition in structural and color aspects will be the focus of our continued optimization efforts in the future.

6 Conclusion↩︎

In this study, our mChartQA model, designed to address the complex challenges of multimodal chart question-answering, particularly excelled in scenarios involving complex color patterns, structural complexities, and interpreting charts with implicit numerical data. By leveraging a two-stage training strategy, mChartQA achieved state-of-the-art performance in diverse chart question-answering tasks, demonstrating the effectiveness of our approach. Looking ahead, we plan to further refine the visual encoder and connector components within mChartQA to enhance its multimodal chart question-answering capabilities.


Hoque, E., Kavehzadeh, P., Masry, A., 2022. Chart question answering: State of the art and future directions, in: Computer Graphics Forum, Wiley Online Library. pp. 555–572.
Masry, A., Do, X.L., Tan, J.Q., Joty, S., Hoque, E., 2022. ChartQA: A benchmark for question answering about charts with visual and logical reasoning, in: Muresan, S., Nakov, P., Villavicencio, A.(Eds.), Findings of the Association for Computational Linguistics: ACL 2022, Association for Computational Linguistics, Dublin, Ireland. pp. 2263–2279.
Wang, G., Ma, J., Chen, G., 2023a. Attentive statement fraud detection: Distinguishing multimodal financial data with fine-grained attention. Decision Support Systems167, 113913.
Kafle, K., Shrestha, R., Cohen, S., Price, B., Kanan, C., 2020. Answering questions about data visualizations using efficient bimodal fusion, in: Proceedings of the IEEE/CVF Winter conference on applications of computer vision, pp. 1498–1507.
Gerling, C., Lessmann, S., 2023. Multimodal document analytics for banking process automation. arXiv preprint arXiv:2307.11845 .
Jie, W., Chen, Q., Wang, J., Koe, A.S.V., Li, J., Huang, P., Wu, Y., Wang, Y., 2023. A novel extended multimodal ai framework towards vulnerability detection in smart contracts. Information Sciences636, 118907.
Xu, Z., So, D.R., Dai, A.M., 2021. Mufasa: Multimodal fusion architecture search for electronic health records, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 10532–10540.
Meskó, B., 2023. The impact of multimodal large language models on health care’s future. Journal of Medical Internet Research25, e52865.
Othmani, A., Zeghina, A.O., 2022. A multimodal computer-aided diagnostic system for depression relapse prediction using audiovisual cues: A proof of concept. Healthcare Analytics2, 100090.
Schilcher, J., Nilsson, A., Andlid, O., Eklund, A., 2024. Fusion of electronic health records and radiographic images for a multimodal deep learning prediction model of atypical femur fractures. Computers in Biology and Medicine168, 107704.
Setlur, V., Battersby, S.E., Tory, M., Gossweiler, R., Chang, A.X., 2016. Eviza: A natural language interface for visual analysis, in: Proceedings of the 29th annual symposium on user interface software and technology, pp. 365–377.
Srinivasan, A., Stasko, J., 2017. Orko: Facilitating multimodal interaction for visual exploration and analysis of networks. IEEE transactions on visualization and computer graphics24, 511–521.
Hoque, E., Setlur, V., Tory, M., Dykeman, I., 2017. Applying pragmatics principles for interaction with visual analytics. IEEE transactions on visualization and computer graphics24, 309–318.
Gao, T., Dontcheva, M., Adar, E., Liu, Z., Karahalios, K.G., 2015. Datatone: Managing ambiguity in natural language interfaces for data visualization, in: Proceedings of the 28th annual acm symposium on user interface software & technology, pp. 489–500.
Chaudhry, R., Shekhar, S., Gupta, U., Maneriker, P., Bansal, P., Joshi, A., 2020. Leaf-qa: Locate, encode & attend for figure question answering, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3512–3521.
Singh, H., Shekhar, S., 2020. Stl-cqa: Structure-based transformers with localization and encoding for chart question answering, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 3275–3284.
Reddy, R., Ramesh, R., Deshpande, A., Khapra, M.M., 2019. Figurenet: A deep learning model for question-answering on scientific plots, in: 2019 International Joint Conference on Neural Networks (IJCNN), IEEE. pp. 1–8.
Liu, F., Eisenschlos, J., Piccinno, F., Krichene, S., Pang, C., Lee, K., Joshi, M., Chen, W., Collier, N., Altun, Y., 2023a. DePlot: One-shot visual language reasoning by plot-to-table translation, in: Rogers, A., Boyd-Graber, J., Okazaki, N.(Eds.), Findings of the Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics, Toronto, Canada. pp. 10381–10399.
Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Dan, Y., Zhao, C., Xu, G., Li, C., Tian, J., et al., 2023a. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 .
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J., 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966 .
Kahou, S.E., Michalski, V., Atkinson, A., Kádár, Á., Trischler, A., Bengio, Y., 2017. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300 .
Methani, N., Ganguly, P., Khapra, M.M., Kumar, P., 2020. Plotqa: Reasoning over scientific plots, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1527–1536.
Lee, K., Joshi, M., Turc, I.R., Hu, H., Liu, F., Eisenschlos, J.M., Khandelwal, U., Shaw, P., Chang, M.W., Toutanova, K., 2023. Pix2struct: Screenshot parsing as pretraining for visual language understanding, in: International Conference on Machine Learning, PMLR. pp. 18893–18912.
Chen, X., Wang, X., Beyer, L., Kolesnikov, A., Wu, J., Voigtlaender, P., Mustafa, B., Goodman, S., Alabdulmohsin, I., Padlewski, P., et al., 2023c. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199 .
Li, J., Li, D., Savarese, S., Hoi, S., 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, in: International conference on machine learning, PMLR. pp. 19730–19742.
Masry, A., Kavehzadeh, P., Do, X.L., Hoque, E., Joty, S., 2023. Unichart: A universal vision-language pretrained model for chart comprehension and reasoning. arXiv preprint arXiv:2305.14761 .
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L., 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 .
Ordonez, V., Kulkarni, G., Berg, T., 2011. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems24.
Agrawal, H., Desai, K., Wang, Y., Chen, X., Jain, R., Johnson, M., Batra, D., Parikh, D., Lee, S., Anderson, P., 2019. Nocaps: Novel object captioning at scale, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 8948–8957.
Sharma, P., Ding, N., Goodman, S., Soricut, R., 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565.
Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., Lin, D., 2023a. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793 .
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F., 2023. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 .
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., et al., 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision123, 32–73.
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T., 2014. Referitgame: Referring to objects in photographs of natural scenes, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798.
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K., 2016. Generation and comprehension of unambiguous object descriptions, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 11–20.
Xia, R., Zhang, B., Ye, H., Yan, X., Liu, Q., Zhou, H., Chen, Z., Dou, M., Shi, B., Yan, J., et al., 2024. Chartx & chartvlm: A versatile benchmark and foundation model for complicated chart reasoning. arXiv preprint arXiv:2402.12185 .
Meng, F., Shao, W., Lu, Q., Gao, P., Zhang, K., Qiao, Y., Luo, P., 2024. Chartassisstant: A universal chart multimodal language model via chart-to-table pre-training and multitask instruction tuning. arXiv preprint arXiv:2401.02384 .
Wang, P., Golovneva, O., Aghajanyan, A., Ren, X., Chen, M., Celikyilmaz, A., Fazel-Zarandi, M., 2023b. Domino: A dual-system for multi-step visual language reasoning. arXiv preprint arXiv:2310.02804 .
Liu, F., Piccinno, F., Krichene, S., Pang, C., Lee, K., Joshi, M., Altun, Y., Collier, N., Eisenschlos, J., 2023b. MatCha: Enhancing visual language pretraining with math reasoning and chart derendering, in: Rogers, A., Boyd-Graber, J., Okazaki, N.(Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Toronto, Canada. pp. 12756–12770.
Chen, X., Djolonga, J., Padlewski, P., Mustafa, B., Changpinyo, S., Wu, J., Ruiz, C.R., Goodman, S., Wang, X., Tay, Y., et al., 2023b. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565 .
Ye, J., Hu, A., Xu, H., Ye, Q., Yan, M., Dan, Y., Zhao, C., Xu, G., Li, C., Tian, J., et al., 2023b. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499 .
Cheng, Z.Q., Dai, Q., Hauptmann, A.G., 2023. Chartreader: A unified framework for chart derendering and comprehension without heuristic rules, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 22202–22213.
Zhou, M., Fung, Y., Chen, L., Thomas, C., Ji, H., Chang, S.F., 2023. Enhanced chart understanding via visual language pre-training on plot table pairs, in: The 61st Annual Meeting Of The Association For Computational Linguistics, p. 22202.