April 02, 2024
Multimodal chart question-answering, crucial for applications such as financial report analysis, decision support, and invoice parsing, confronts significant challenges with intricate color patterns, structural complexities, and implicit numerical data in charts. Traditional methods, mainly involving chart-to-text conversion followed by processing with Large Language Models (LLMs) or direct multimodal processing, often falter in these complex scenarios. To overcome these hurdles, this paper introduces mChartQA, a groundbreaking framework tailored for advanced multimodal chart question-answering. mChartQA innovatively merges sophisticated language processing capabilities of LLMs with a state-of-the-art table-to-text engine, facilitating effective processing and integration of complex visual and textual information. This framework stands out for its ability to align visual and textual data accurately and is further refined for deep reasoning and contextual understanding within charts. The AI contribution of this work lies in its novel integration of multimodal data processing techniques, significantly enhancing the accuracy of chart question-answering. Demonstrated through experimental results on three distinct datasets, mChartQA showcases superior performance in tackling complex, multimodal chart question-answering tasks, especially in scenarios that have posed challenges for existing methods.
Multimodal Chart Question-Answering,Vision-Language Alignment,Natural Language Processing,Large Language Models (LLMs),Two-Stage Training
The goal of multimodal chart question answering is to automatically answer a natural language question about a chart to facilitate visual data analysis [1], where the ability to understand and interact with visual data is essential [2]. It has emerged as a crucial intersection of computer vision and natural language processing, addressing the growing demand for intelligent systems capable of interpreting complex visual data in charts [2]. Beyond its general applications, multimodal chart question-answering plays a pivotal role in sectors requiring precise and rapid analysis of visual data. In the financial domain, it is indispensable for tasks such as financial report analysis [3], decision support [4], invoice parsing [5], and contract review [6]. Similarly, in the medical field, it significantly contributes to the digitization of patient records [7], medical insurance review [8], diagnostic assistance [9], and quality control [10] of medical records.
Due to the richness and ambiguities of natural language and complex visual reasoning, multimodal chart question answering task requires to predict the answer in the intersection of information visualization, natural language processing, and human computer interactions [1].
Early approaches apply natural language processing techniques by largely depending on heuristics or grammar-based parsing techniques [11]–[14]. Thanks to insufficient processing of complex linguistic phenomena, over-reliance on grammatical rules, and limited depth of understanding natural language, deep learning models have been introduced for understanding natural language queries about visualizations [15]–[17].
Recently, with the outstanding performance of large language models (LLMs) in natural language inference, researches propose a pipeline approach which involve converting chart information into textual format, known as chart-to-text, and then processing this information using LLMs [18]. Though it performs well in text-based scenarios (seen Fig.1 (A)), this two-stage approach tends to struggle when it comes to complex scenarios such as colors, structures, or textless information (seen Fig.1 (B)(C)(D)). The main challenge here is that crucial visual details may be lost or misrepresented during the chart-to-text conversion process, leading to potential misunderstandings or errors in interpretation. Hence, researches further explore the multimodal alignment framework based on pre-trained vision-language (VL) models that directly process the visual form of charts, such as mPLUG-DocOwl [19] and Qwen-VL [20]. Despite their strong alignment capabilities in VL tasks, these models reveal limitations in complex reasoning scenarios within chart question answering, particularly in handling questions about color patterns in charts, structural details, and numerical data, especially in instances where charts do not explicitly display numerical information.
Given the limitations of existing methods in multimodal chart question answering, particularly in scenarios that require intricate understanding of color patterns, structural complexities, and interpretation of charts with implicit numerical data, our research is focused on developing a more adaptive and comprehensive solution. This need underscores the importance of a model capable of effectively processing and integrating diverse visual and textual information. In response, we propose a multimodal chart question answering framework, mChartQA, which leverages vision-language alignment and reasoning. This framework distinctively integrates advanced language processing techniques from large language models (LLMs) with a sophisticated table-to-text engine, enabling the transformation of complex visual elements into analytically rich formats. Through this integration, mChartQA aims to overcome the limitations of current VL models, enhancing their ability to provide accurate and contextually rich answers to complex questions about multimodal charts. The main contributions of this paper are as follows:
We propose a universal benchmark for multimodal chart question answer based on vision-language alignment and reasoning, which provides the reasoning ability and word-level interpretability for multimodal chart question answer task.
Our proposed mChartQA aligns visual and textual data to ensure accurate interpretation of visual elements, and then applies advanced language processing techniques for deep reasoning and contextual understanding.
We compare mChartQA with existing state-of-the-art methods, and experimental results on three datasets demonstrate effectiveness of our model in handling diverse and challenging chart question-answering tasks.
The field of chart question-answering has evolved significantly, beginning with early reliance on natural language processing (NLP) techniques and advancing towards sophisticated multimodal approaches. Initially, chart question-answering systems like Eviza [11], Orko [12], Evizeon [13] and DataTone [14] employed heuristic or grammar-based parsing techniques. While these methods provided foundational insights, they struggled with complex linguistic phenomena and heavily relied on grammatical rules. To address these limitations, more advanced models such as LEAF-QA [15], STL-CQA [16], and FigureNet [17] were developed. LEAF-QA introduced a dataset of figures/charts with question-answer pairs for figure question answering, constructed from real-world open data sources [15]. STL-CQA and FigureNet, on the other hand, focused on deep learning models for question-answering on scientific plots, pushing the boundaries in reasoning and understanding tasks [16], [17].
The emergence of benchmarks like FigureQA [21], PlotQA [22], and ChartQA [2] further highlighted the need for advanced multimodal methods in chart interpretation. These benchmarks have been instrumental in driving research towards more integrated approaches that combine linguistic and visual analysis. In response to these challenges, two main approaches have emerged in the field of chart question-answering: 1)Chart-to-Text Conversion and LLM Processing: The chart-to-text conversion approach, exemplified by DePlot [18], involves translating visual chart data into text for processing with large language models (LLMs). This method is effective for simpler visual scenarios but often struggles with the complexity of visual elements in more intricate charts. 2)Direct Multimodal Processing: In contrast, direct multimodal processing using vision-language (VL) models has gained prominence. Models like Pix2Struct [23], PaLI-3 [24], BLIP-2 [25], Qwen-VL [20], mPLUG-DocOwl [19], and UniChart [26] have been at the forefront of this research, exploring innovative architectures for chart comprehension and reasoning. However, despite their strong capabilities in VL tasks, these models often encounter limitations in complex reasoning scenarios within chart question answering.
Our model, mChartQA (Multimodal Chart Question-Answering Model), is designed for aligning and reasoning with visual and textual information from chart images and corresponding questions. The architecture, illustrated in Figure 2, includes four main components:
Vision Encoder (Ev): The Vision Encoder processes a chart image \(I\) to produce visual features \(V\). This process is formalized as \(V = E_v(I)\), capturing detailed visual information from the chart image.
Connector (C): The Connector employs a cross-attention mechanism to align visual features \(V\) with the text encoder. The alignment process is crucial for correlating visual elements with corresponding textual data, enhancing the model’s interpretative capabilities. The cross-attention mechanism in the Connector is defined as: \[\scalebox{0.9}{ V' = C(V) = \text{softmax}\left(\frac{(W_qV)(W_kV)^T}{\sqrt{d_k}}\right)(W_vV), }\] where \(Q = W_qV\) is a learnable query, \(K = W_kV\) and \(V = W_vV\) are key and value projections with \(W_k\) and \(W_v\) being the respective weights, and \(d_k\) is the dimension of the key vectors.
Chart-to-Text Engine (T): This module converts the chart image \(I\) into a textual representation \(T'\), formalized as \(T' = T(I)\), extracting key textual elements from the chart. For instance, in Figure 2, if the bar chart lacks numerical annotations, the engine will only recognize and extract the definite textual information present, without performing predictions based on the bar chart.
Large Language Model (L): The Large Language Model processes the tokenized question \(Q_t\), enhanced visual features \(V'\), and tokenized textual representation \(T'_t\). The prediction process is formalized as \(A = L(Q_t, V', T'_t)\), where \(L\) integrates visual and textual information to predict the answer.
mChartQA training is conducted in two stages:
Stage 1 - Visual-Language Alignment: This stage focuses on training the Connector to optimize the alignment of visual and textual representations. The objective function for this stage is defined as: \[\scalebox{0.9}{ \min_{\theta_C} \mathcal{L}_{\text{alignment}}(V, Q_t; \theta_C) = -\sum_{i=1}^{N} \log P(y_i | V, Q_t; \theta_C), }\] where \(\mathcal{L}_{\text{alignment}}\) is the cross-entropy loss, \(y_i\) are the true labels, and \(P(y_i | V, Q_t; \theta_C)\) is the predicted probability of the correct label.
Stage 2 - Visual-Language Reasoning: In this stage, both the Connector and the Large Language Model are trained to enhance reasoning capabilities. The final optimization objective is defined as: \[\scalebox{0.75}{ \min_{\theta_C, \theta_L} \mathcal{L}_{\text{reasoning}}(Q_t, V', T'_t; \theta_C, \theta_L) = -\sum_{i=1}^{N} \log P(y_i | Q_t, V', T'_t; \theta_C, \theta_L) }\] where \(\mathcal{L}_{\text{reasoning}}\) is the cross-entropy loss, \(y_i\) are the true labels, and \(P(y_i | Q_t, V', T'_t; \theta_C, \theta_L)\) is the predicted probability of the correct label.
Task | Dataset | Used |
---|---|---|
COCO Caption [27] | 400K | |
SBU [28] | 300K | |
Captioning | NoCaps [29] | 200K |
CC3M [30] | 200K | |
ShareGPT4V [31] | 500K | |
GRIT [32] | 150K | |
Visual Genome [33] | 100K | |
Grounding | RefCOCO [34] | 50K |
RefCOCO+ [35] | 50K | |
RefCOCOg [35] | 50K | |
Chart-to-text | ChartQA [2] | 20K |
Total | 2.02M |
Stage 1 - Visual-Language Alignment In the initial stage, we mainly focus on realizing Image-text alignment and improving its fine-grained perception of images. Our model was trained using data specific to the Captioning, Grounding, and Chart-to-text tasks, as illustrated in Table 1. We primarily utilize 1600,000 text pairs for training purposes in the captioning task. For the grounding task, we utilized 400,000 data points for training and bifurcated it into two subtasks, Caption with Grounding and Grounded Captioning. In the Caption with Grounding task, the model is required to accurately identify and describe a specific object in the image while simultaneously labeling the position (box) of the object in the image. On the other hand, in the Grounded Captioning task, the model must describe the object according to the information provided by its position (box) in the image. These two tasks mutually benefit one another, greatly enhancing the model’s capability to detail objects within images at an intricate level. Additionally, We incorporated information from the 20,882 charts provided by the chartqa dataset to facilitate training on the chart-to-text task. Figure 3 demonstrates the precise structure of these aforementioned tasks. Through implementation of the above practices, our model can attain exceptional chart-to-text alignment and exhibit a profound comprehension of charts. This training methodology serves to notably enhance the model’s faculty for discerning images at a precise level, thereby yielding more exact and comprehensive information for ensuing analyses and processing endeavors.
Stage 2 - Visual-Language Reasoning: In the stage, we utilized methods similar to [36], [37]. We randomly extracted data from several training datasets for our study: 10,000 pairs from the ChartQA training dataset, 140,000 pairs from the FigureQA training dataset, and 150,000 pairs from the PlotQA training dataset, totaling 300,000 pairs. The table 2 summarizes the datasets used and the number of chart-question pairs originally available versus the number randomly extracted for our study.
Dataset | Original Number of Chart-Question Pairs | Number of Pairs Randomly Extracted |
---|---|---|
ChartQA | 28,299 | 10,000 |
FigureQA | 2,388,698 | 140,000 |
PlotQA | 28,952,641 | 150,000 |
Test DataSets: We conduct tests on the public test sets of three datasets: ChartQA, PlotQA, and FigureQA. Similar to the approach in [38], our test set composition is detailed below. We selected a set of representative problems from these datasets to evaluate the model’s performance on color, structure, and textless problems in more detail. These datasets present unique challenges, including the complexity of color patterns, structures, and the graphical interpretation of implicit numerical data. This method of evaluation provides a more detailed understanding of model performance. We identify three main types of questions, with their types and example templates described below.
Color: This problem type requires an understanding of color theory, as well as the observation and analysis of color information on a chart. Example: "What is the least difference between the light blue bar and the dark blue bar?"
Structure: These problems relate to chart layout and structure, requiring an analysis and understanding of the components of a chart and its visual representation. Example: "What is the label of the third bar from the left in each group?"
Textless: This problem involves interpreting graphs containing implicit numerical data, where the graphical elements lack precise numerical values and require numerical reasoning using a model. To enhance the construction of the test dataset, we manually filtered out datasets containing these three types of queries from the ChartQA, PlotQA, and FigureQA test dataset. These samples are visible in Figure 4. The diversity of the test dataset demonstrates the model’s ability to address diagrammatic problems, especially those involving color, structure, and textless diagrams, serving as an important evaluation criterion.
Dataset | QA Pairs | Color | Structure | Textless |
---|---|---|---|---|
ChartQA | 2,500 | 264 | 385 | 209 |
PlotQA V1 | 10,000 | — | 966 | 1,165 |
PlotQA V2 | 10,000 | — | 310 | 662 |
FigureQA Val1 | 5,000 | — | — | 1,607 |
FigureQA Val2 | 5,000 | — | — | 1,627 |
In our experiments with mChartQA, we compare it against two categories of baseline models: Few-Shot Learning Models: We include pioneering models such as GPT3 (1-Shot) [38] and GPT4 (5-Shot) [38], which leverage generative pre-trained transformers for Few-Shot learning. Additionally, FlanPaLM (540B) in both 1-Shot [18] and 1-Shot with Self-Consistency (SC) decoding [18], as well as LLaMA-2 (70B) in 1-Shot and 5-Shot configurations [38], are evaluated to highlight their Few-Shot capabilities in the context of chart question-answering. Fully-Supervised Models: This category includes specialized models such as Pix2Struct [23] and MATCHA [39], which are designed specifically for interpreting and answering questions from charts. Large Language Model (LLM)-Based Approaches like PaLI-X-OCR [40] and PaLI-X-no_OCR [40], mPLUG-DocOwl (LLaMA-7B) [41], Qwen-VL [20], and Qwen-VL-Chat [20] demonstrate the adaptability of pre-trained models to chart question-answering. Vision-Language Pretraining (VLP) models such as VL-T5-OCR [18], T5-OCR [18], and VisionTapas-OCR [18], along with ChartReader [42] and ChartT5 [43], showcase the integration of OCR capabilities for enhanced performance. Additionally, the DOMINO (70B) [38] is included to represent a state-of-the-art approach in fully supervised learning for chart question-answering. Through this array of baselines, we aim to provide a comprehensive evaluation of mChartQA’s performance, situating it within the broader landscape of chart question-answering research.
In our experiments, we have developed two main versions of the mChartQA model: mChartQA\(_{\text{Qwen}}\) and mChartQA\(_{\text{Intern-LM2}}\). The mChartQA\(_{\text{Qwen}}\) version is based on the Qwen-14B Large Language Model, while the mChartQA\(_{\text{Intern-LM2}}\) version utilizes the Intern-LM2-7B model. Both versions employ the clip-vit-large-patch14-336 as their vision encoder, ensuring advanced visual processing capabilities. The Chart-to-Text Engine across both versions is initialized with weights from DePlot, facilitating effective translation of chart images into descriptive text.
Training Configuration: 1) Visual-Language Alignment In this stage, we use the AdamW optimizer with \(\beta_1 = 0.9\), \(\beta_2 = 0.95\), a learning rate of \(1e-6\), weight decay of \(0.001\), and a batch size of 16. The training spans across 6 epochs, focusing on aligning visual and textual representations. 2) Visual-Language Reasoning In the reasoning stage, we continue with the same optimizer and weight decay settings but adjust the learning rate to \(1e-5\) and reduce the batch size to 8. This stage lasts for 8 epochs and involves training both the Connector and the Large Language Model to enhance reasoning capabilities.
Evaluation Metrics: Our evaluation follows the metrics used in the ChartQA, PlotQA, and FigureQA datasets. For numeric answers, a tolerance of up to 5% error is allowed. However, for textual answers, exact matching is required. Only responses that precisely match the correct answer are considered correct, ensuring accuracy in evaluation.
Extended Comparative Results on the ChartQA Dataset: The columns represent the model name, performance on human-generated questions (ChartQA-human), performance on machine-generated questions (ChartQA-M), and the average performance across both types.
Evaluation Type | Model | FigureQA v1 | FigureQA v2 | FigureQA Avg | PlotQA v1 | PlotQA v2 | PlotQA Avg |
---|---|---|---|---|---|---|---|
Few-Shot | GPT3 (1-Shot) [18] | - | - | - | 31.60 | 42.20 | 36.90 |
FlanPaLM (540B) (1-Shot) [18] | - | - | - | 51.30 | 44.90 | 48.10 | |
FlanPaLM (540B) (1-Shot, SC) [18] | - | - | - | 57.80 | 50.10 | 53.90 | |
LLaMA-2 (70B) (1-Shot) [38] | 55.60 | 55.70 | 55.65 | 32.50 | 43.40 | 37.90 | |
LLaMA-2 (70B) (5-Shot) [38] | 61.60 | 61.20 | 61.40 | 43.20 | 44.70 | 43.90 | |
Supervised | Pix2Struct [23] | - | - | - | 73.20 | 71.90 | 72.60 |
MATCHA [39] | 50.02 | 50.10 | 50.06 | 77.53 | 60.30 | 68.92 | |
ChartReader [42] | 95.50 | 95.80 | 95.65 | 78.10 | 59.30 | 68.70 | |
DOMINO(70B) [38] | 64.70 | 64.40 | 64.55 | 58.90 | 80.70 | 69.80 | |
Ours | mChartQA\(_{\text{Qwen}}\) | 90.32 | 92.75 | 91.54 | 78.00 | 62.95 | 70.48 |
mChartQA\(_{\text{Intern-LM2}}\) | 96.06 | 96.30 | 96.18 | 78.25 | 74.79 | 76.52 |
In our analysis, we benchmark the mChartQA model against a diverse set of baselines, encompassing both Few-Shot Learning Models and Fully-Supervised Models. These baselines represent the state-of-the-art performance on the datasets in question, providing a rigorous context for evaluating the effectiveness of our model. The comparison includes models such as GPT3 and GPT4 for Few-Shot learning scenarios, specialized chart interpretation models like Pix2Struct and MATCHA, and LLM-Based Approaches such as PaLI-X-OCR. This diverse set of baselines ensures a comprehensive assessment of mChartQA’s performance across different learning paradigms and task complexities.
Performance on the ChartQA Dataset: As shown in Table [tab:extended95chartqa], while mChartQA demonstrates superior performance in several aspects, particularly in handling machine-generated questions with its advanced understanding of chart elements, it faces stiff competition from models like DOMINO in the fully supervised setting. The effectiveness of mChartQA is particularly notable in scenarios requiring deep semantic understanding and contextual interpretation, where it outperforms traditional models. However, in tasks where extensive pre-training on chart-specific data is beneficial, models like MATCHA and Pix2Struct show competitive advantages. This suggests that while mChartQA excels in leveraging contextual cues and integrating multimodal information, there is room for improvement in direct chart data interpretation. Future improvements will focus in this direction.
Analysis on FigureQA and PlotQA Datasets: As shown in Table 4, mChartQA’s performance on the FigureQA and PlotQA datasets is robust, showcasing its generalizability across different chart types and question formats. It achieves remarkable accuracy, especially in the Intern-LM2 version, surpassing many baselines in both Few-Shot and Fully-Supervised categories. However, the model does not always lead, with certain baselines like ChartReader showing superior performance in specific tasks, indicating the potential benefits of integrating more targeted OCR and visual feature extraction techniques. This underscores the need for ongoing refinement in the model’s approach to visual data processing and interpretation.
Comparative Results in Complex Scenarios
Comparative Analysis in Complex Scenarios: As shown in Table [tab:chartqa95mul], the comparative results in complex scenarios such as color, structure, and textless tasks showcase the capabilities of the mChartQA model. The mChartQA model, especially the Intern-LM2 version, demonstrates a remarkable ability to outperform existing models across these challenging scenarios. Notably, the model achieves significant advancements in the textless category across different datasets, which underscores its proficiency in interpreting charts that lack explicit textual annotations. Despite these strengths, the analysis also uncovers areas where mChartQA could be further enhanced. For instance, while the model shows exceptional performance in textless scenarios, its performance in color and structure-related tasks, though superior, suggests there is still room for improvement. This is particularly evident when compared to the highest scores achieved by other models in specific categories, indicating that mChartQA’s approach to processing and understanding visual elements such as color and structural components can be refined. The comparative performance also highlights the potential for mChartQA to benefit from more targeted improvements in its handling of complex chart elements. For example, the slight underperformance in certain scenarios compared to the very best model outcomes suggests that integrating more sophisticated visual processing techniques or enhancing the model’s training on datasets with a wider variety of chart types could yield further improvements.
Overall Performance and Future Directions: In summary, mChartQA establishes a new standard in multimodal chart question-answering, showcasing strong performance in complex scenarios such as those involving color, structure, and charts lacking textual descriptions. Despite its notable achievements, the analysis identifies potential areas for further refinement, especially in enhancing the model’s interpretative accuracy across a broader range of chart types and scenarios. Future efforts will be directed towards deepening the model’s comprehension of complex visual elements and improving its interpretative capabilities, with the goal of extending mChartQA’s leadership in the field of chart question-answering.
In this ablation study, we explored the impact of incorporating Deplot in different stages of training and testing in the mChartQA\(_{\text{Qwen}}\) model. We analyzed the performance variations by selectively applying Deplot, as shown in Table [tab:chartqa95comparison] (see rows 1-3 in Table [tab:chartqa95comparison]).
The mChartQA\(_{\text{Qwen}}\) model with Deplot in both training and testing phases (row 1) demonstrates the highest performance, underscoring Deplot’s significance in enhancing chart comprehension. The absence of Deplot in either phase (rows 2 and 3) leads to a noticeable decline in performance, particularly in complex scenarios like textless charts. This confirms our hypothesis about Deplot’s crucial role in the model’s learning process and its effectiveness in handling multimodal data and complex chart structures.
Ablation Study Results on Multimodal Chart Datasets
Effect of Connector Replacement (see rows 4 in Table [tab:chartqa95comparison]): Replacing our cross-attention connector with an MLP-based approach (-Qformer + MLP) resulted in varied performance across different tasks. While the MLP connector showed some improvements in specific scenarios, such as color charts in ChartQA, our original connector generally outperformed the MLP in most tasks, especially in complex reasoning scenarios. This highlights the effectiveness of our cross-attention mechanism in integrating multimodal information.
Effect of Visual Encoder Variation (see rows 5 in Table [tab:chartqa95comparison]): Experimenting with different visual encoders, we replaced our ViT-448 encoder with ViT-384 (-ViT448 + Vit384). The results indicate that while ViT-384 performs competitively in certain tasks, our ViT-448 encoder generally achieves superior results, particularly in handling complex chart structures and textless scenarios. This suggests the importance of a specialized visual encoder tailored for multimodal chart question-answering.
In this case study, we present an example that highlights the superior performance of our mChartQA model, especially in complex scenarios involving overlapping visual elements. Figure 5 illustrates this case,
This example demonstrates mChartQA (Base)’s ability to accurately interpret complex charts with significant visual element overlap. mChartQA is the only model among its comparisons to correctly identify the chart’s details. We speculate that this success is mainly due to the multimodal architecture, where the Vision Encoder provides precise information to the language encoder. The heatmap visualization confirms our speculation. In contrast, other models exhibit severe hallucinations, either misinterpreting or providing arbitrary answers.
We delve into a detailed error analysis of our model, categorizing errors into three main types: structure, color, and textless chart questions. We identified prevalent error patterns within each category, as demonstrated in Figure 6.
Structure-Related Errors: In the first row of Figure 6, we present errors encountered in structure-related chart questions. For example, the first error was due to rounding off the answer, resulting in a slight deviation from the correct value. The second error involved a language model hallucination, leading to a misspelling. The third error was attributed to inadequate structural recognition, affecting the accuracy of the response.
Color-Related Errors: The second row in Figure 6 shows errors in color-related chart questions. The first error occurred in multi-step calculations, where a mistake at any stage could lead to an incorrect outcome. The second error was due to limited color recognition accuracy, leading to ambiguity in identifying the correct color. The third error highlighted the need for precise question interpretation, as the model’s response, although technically correct, did not align with the expected answer format.
Textless Chart Errors: The third row of Figure 6 illustrates errors in textless chart questions. These included slight deviations in numerical estimations, precision mismatches between the model’s output and the standard answer, and confusion in the model’s understanding of the question scope.
This error analysis sheds light on potential areas for refinement in multimodal chart question-answering models. The issues observed in numerical precision, language model accuracy, and recognition in structural and color aspects will be the focus of our continued optimization efforts in the future.
In this study, our mChartQA model, designed to address the complex challenges of multimodal chart question-answering, particularly excelled in scenarios involving complex color patterns, structural complexities, and interpreting charts with implicit numerical data. By leveraging a two-stage training strategy, mChartQA achieved state-of-the-art performance in diverse chart question-answering tasks, demonstrating the effectiveness of our approach. Looking ahead, we plan to further refine the visual encoder and connector components within mChartQA to enhance its multimodal chart question-answering capabilities.