Exploring OCR-augmented Generation for Bilingual VQA

JoonHo Lee1, Sunho Park

KL-Net, South Korea


Abstract

We investigate OCR-augmented generation with Vision Language Models (VLMs), exploring tasks in Korean and English toward multilingualism. To support research in this domain, we train and release KLOCR, a strong bilingual OCR baseline trained on 100M instances to augment VLMs with OCR ability. To complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and analyze different prompting methods. Extensive experiments show that OCR-extracted text significantly boosts performance across open source and commercial models. Our work offers new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at https://github.com/JHLee0513/KLOCR.

1 Introduction↩︎

Optical Character Recognition (OCR) interprets text from visual inputs for applications such as accessibility, business automation, and robotics. The task requires understanding the spatial layout, semantic content, and inter-component relationships of text [1], [2]. Despite the progress, traditional OCR pipelines based on text detection and recognition exhibit limitations in scalability and human-level understanding [3].

In this work, we explore the limits of OCR-augmented generation with Vision Language Models (VLMs). Recent advancements in VLMs show competitive OCR performance to traditional pipelines, and their semantic knowledge offers promising avenues toward end-to-end, OCR-capable agents. [4][6], [6][8] In comparison to OCR-free document understanding models [9], [10], VLMs are also capable of using their conversational abilities to directly address the downstream task at hand.

Figure 1: OCR model comparison on the validation set of KLOCR data. KLOCR not only sets state-of-the-art accuracy on the benchmark, but also exhibits the best accuracy-speed tradeoff.

We investigate OCR-augmented generation for Visual Question and Answering in English and Korean, with aims to promote research of multilingual models. We provide KOCRBench, a novel Korean OCR Benchmark, and KLOCR, a robust bilingual OCR baseline. Our contribution lies in exploring the impact of OCR in providing additional context to VLMs, and we anticipate the benchmark and OCR model will encourage further research. Extensive experiments show that OCR significantly boosts performance, indicating room for further improvement by VLMs. Overall, findings show the presence of character-accurate key information was the most crucial factor to model success. Model and code are available at https://github.com/JHLee0513/KLOCR.

2 Related Work↩︎

2.1 Text Recognition↩︎

Text recognition [11][15] forms the core algorithm behind OCR. [14] demonstrated scaling laws present in OCR with common English benchmarks [16][21]. We follow this insight to collect large-scale training data for KLOCR.

2.2 Scene Text Detection↩︎

Scene Text Detection [22][26] identifies text regions as bounding boxes, assisting recognition and improving spatial understanding. We integrate KLOCR with PaddleOCR [27], [28] implementation of DBNet [25] for our experiments.

2.3 Document Structure Analysis↩︎

Document Structure Analysis enhances OCR by identifying the structure of the text such as reading order, text types, and layout. Prior work includes structure analysis [29], [30], table detection and recognition [31][34], reading order detection [35], and semantic structure analysis [36]. Despite their strong in-domain accuracy, the models require a significant amount of densely annotated data and show limited performance for out-of-domain samples [37].

2.4 Key Information Extraction↩︎

Key Information Extraction (KIE) [38], [39] focuses on extracting queried information rather than converting the entire visual input to text. Public benchmarks such as FUNSD [40] and SROIE [41] verify the extraction capabilities of pipelines and models in receipts, records, and other documents. As many applications rely on this task, we include it in KOCRBench.

2.5 Vision Language Models↩︎

Vision Language Models are general-purpose models trained on large amounts of image and text data for conversational vision language tasks [42][48]. Their recent applications in vision language tasks and even embodied AI demonstrate their wide range of capabilities [49], [50].

3 KLOCR: Open Source Bilingual OCR Model↩︎

[14] demonstrated scaling laws in OCR, achieving state-of-the-art performances on six common English benchmarks by training a transformer based model on a large-scale dataset. Following this insight, we train the Korean Language Optical Character Recognition (KLOCR) model on a 100M2 instances bilingual dataset, achieving competitive performance on English and state-of-the-art accuracy on Korean.

3.1 Data↩︎

We curate a diverse mixture of English and Korean OCR data, varying in text length and image domain. Table 1 describes our final composition, where most of the data is sourced from multilingual datasets made publicly available at AI-Hub. We combine SynthTIGER-v1.1 [51], PixParse  [52], and generate 3M samples of multi-line, multi-word samples to increase data variety. We split the final collection into approximate 80-20 split for training and testing. Figure 2 highlights several samples that can be found in our mixture. We share the AIHub dataset details, licensing information, and pre-processing steps taken in Appendix 8.

Table 1: KLOCR Data Mixture. \(\dagger\)We generate additional data by running SynthTIGER data engine with the text from the AIHub datasets. After validation split, we have approximately 100M training samples.
Type Lang Dataset Instances
Real Ko+En AIHub 100M
Real En PixParse 7.2M
Real En Union14M 3.2M
Synth En SynthTIGER 10M
Synth Ko+En SynthTIGER\(\dagger\) 3M
Real En UberText 0.1M
Real En TextOcr 0.7M
Real En CocoText 0.07M
Mixed Ko+En Total 124.3M
Figure 2: Samples from KLOCR data mixture. The data collection is bilingual and varies across multiple domains (e.g. documents, road signs, handwriting).

3.2 Model↩︎

We finetune TrOCR [12] pretrained on a custom synthetic dataset generated with the SynthTIGER engine [53]. The model uses DeiT [54] as its encoder and RoberTa [55] as its decoder. At 55M Parameters, the model runs real-time (20+ FPS) on a desktop GPU.

3.3 Training↩︎

We trained KLOCR for two epochs using two RTX A6000 GPUs, with a batch size of 64 per GPU. We use the AdamW [56] optimizer with a fixed learning rate of \(5e^{-7}\) to avoid drifting too far from the initialized weights. Since most of the samples are clear high-quality images, we found data augmentation (random rotation, random brightness, CoarseDropOut [57], [58]) beneficial to model generalization. The training run finishes in approximately 500 GPU hours, and we estimate the total development cost of the model to have been below 3000 GPU hours.

4 Visual Question answering with OCR-Augmented Reasoning↩︎

We consider Base as a baseline method, where we prompt the VLM with the input image and query without any additional context. In comparison, OCR-based prompting, which we denote as OCR, prompts the VLM with OCR-extracted text as additional context. We follow a format similar to [46] but omit the bounding box coordinates.

Figure 3: Sample images from the KOCRBench dataset. We collect various samples from KLOCR data mixture and repurpose samples from KVQA to create (image, question, answer) triplets. The dataset covers various scenarios with road signs, product images, and documents. Images have been resized for visualization purposes.

4.1 KOCRBench: Korean VQA Benchmark↩︎

We curated KOCRBench to test VLMs’ ability to handle visual question answering in Korean. Following design of the prior work in English OCR  [4], [6], [59], [60], we collected 250 questions from public sources spanning over 248 input images. Specifically, a portion of the benchmark is repurposed from the Korean Localization of Visual Question Answering for Blind People (KVQA) [61] dataset with reinforced annotations. We generated the majority of samples by selecting raw images from the holdout data from KLOCR data mixture and annotated manually. We created annotations for 4 tasks: text recognition (22 samples), scene VQA (70 samples), document VQA (29 samples), and key information extraction (129 samples). The number of samples were based on our internal assessment of the importance of each task in real business processes.

5 Experiments↩︎

Table 2: Character Error Rate and word accuracy on the Korean OCR benchmark. KLOCR demonstrates significantly better performance than other open source models. \(\dagger\) denotes variant trained with additional Union14M-L dataset, matching its data distribution closer to the common English benchmarks. *Model from  [62] is only trained on English data, and therefore shows high error.
Method CER\(\downarrow\) Word Accuracy\(\uparrow\)
CLIP4STR-L* 125.2% 9.0%
Surya 60.9% 55.3%
PaddleOCR 49.6% 32.6%
PORORO 30.0% 53.1%
TrOCR 27.0% 49.4%
KLOCR 2.34% 94.6%
Table 3: Word accuracy on English benchmarks. Avg is the total average accuracy across all samples from the benchmarks. CLIP4STR-L* trained by  [62] includes training splits of benchmark data in their training data. Despite not targeting the English benchmarks and using a much smaller model, KLOCR performance remains competitive.
Method IC13 IIIT5k SVT CUTE80 IC15 SVTP Avg
TrOCR 66.86 59.07 60.43 45.83 49.48 49.46 55.19
PORORO 78.30 64.30 56.57 47.57 45.33 46.05 56.35
Surya 82.73 71.50 74.19 44.79 64.00 64.19 69.48
KLOCR 95.92 86.50 93.20 91.67 84.87 87.91 88.13
CLIP4STR-L* 99.42 99.13 98.61 99.65 92.6 98.13 97.42

5.1 Implementation Details↩︎

We used vLLM [63] to host the VLMs on our hardware and hosted the OCR models on the same machine or on a separate machine with an RTX A1000 GPU. We conducted our experiments in PyTorch [64].

5.2 OCR Benchmarks↩︎

Table 2 provides evaluation on Korean OCR for currently available open source OCR models. KLOCR outperforms prior models by a significant margin, achieving 94.6% word accuracy and 2.34% character error rate. The performance gap between TrOCR and KLOCR despite the two sharing the same architecture highlights the importance of scaling up OCR data. As expected, the Clip4STR model by  [62] does not handle Korean and therefore achieves low accuracy.

Table 3 provides evaluation on the six common English benchmarks. KLOCR demonstrates comparable performance without any in-domain training data, demonstrating its scale and variety. KLOCR significantly out-performs prior OCR models focusing on Korean. As a reference point, we include the CLIP4STR-L model trained by  [62], which includes the training subset of the benchmark data in its training and evidently achieves the highest performance.

5.3 Multilingual VQA↩︎

Table 4: KOCRBench Performance Comparison, for models with both base and instruction-tuned available, instruction-tuned variants are tested.\(\dagger\) Due to memory constraints, we run the AWQ quantized model.
Model Prompt Recognition Scene Document KIE Total
Qwen2.5-VL-7B Base 22 66 16 94 198
Qwen2.5-VL-7B OCR 21 65 22 104 212
InternVL2.5-7B Base 16 46 5 20 87
InternVL2.5-7B OCR 19 52 10 81 162
Qwen2.5-VL-32B-Instruct\(\dagger\) Base 21 60 20 75 176
Qwen2.5-VL-32B-Instruct\(\dagger\) OCR 20 61 21 103 205
gemini 2.0 flash Base 20 65 22 93 200
gemini 2.0 flash OCR 19 64 23 97 203
gemini-2.5-flash-preview-04-17 Base 21 70 20 71 182
gemini-2.5-flash-preview-04-17 OCR 19 69 22 102 212

As aforementioned in Section  4, we compare Base and OCR prompting. Table 4 shows the benchmark results across 5 models: Qwen-VL 2.5 7B, 32B [42], InternVL 2.5 7B [65], Gemini 2.0 Flash, and Gemini 2.5 Flash [45]. The chosen models have shown competitive performances on the English benchmarks and also provide multilingual support. The Gemini models have been added to provide a reference point for commercially available models.

The addition of OCR-extracted information significantly improves accuracy for all models, aligning with the findings by [62]. The largest improvements are observed from smaller models with a weaker base performance such as InternVL, indicating the OCR information is used by the models to correct their responses. Notably, we observe very strong base performance from Qwen-VL 2.5 7B despite its smaller size, indicating the potential fact that Qwen trainig mixture has substantial multilingual data.

Our results indicate largest performance improvement in Key Information Extraction, highlighting the usefulness of OCR’s accurate character recognition. This also implies VLMs are yet to resolve spelling errors, especially on unusual and semantically meaningless words or obscure jargon.

6 Discussion↩︎

We further discuss the applicability of OCR-augmented generation with a set of ablation studies. When is OCR useful? While KLOCR has shown robust performance and significantly boosted VLMs’ performance in VQA, the trade-off between training OCR models and finetuning VLMs to improve their OCR ability should be weighed properly. Results on English [62] and Korean indicate OCR can play a crucial role in assisting VLMs, especially for low base performance models. It is also possible to finetune the VLMs directly on the OCR data, albeit with potential forgetting of other abilities. Meanwhile, it’s challenging to train large-scale OCR model for low-resource languages, and hence resolving this issue for VLMs and OCR models remain a challenge.

Impact of OCR accuracy on VLM We verify the effectiveness of OCR-augmented generation by testing Qwen-VL 2.5 7B and InternVL 2.5 7B using KLOCR and TrOCR as the OCR extraction model. Results in Table 5 clearly indicate that improvement in OCR also leads to an improvement in VLM response, while stronger models such as Qwen 2.5 show greater robustness against OCR error.

Table 5: Ablation study on OCR model. Using a more powerful OCR model (KLOCR) improves overall score.
VLM OCR R S D K Total
InternVL TrOCR 18 54 8 47 127
InternVL KLOCR 19 52 10 81 162
Qwen 2.5 TrOCR 19 68 23 92 202
Qwen 2.5 KLOCR 21 65 22 104 212

KOCRBench error analysis Our results on KOCRBench exhibit VLMs’ weaknesses:

  1. Counting: Counting has been a challenging task for either LLMs or VLMs [66], and it is no exception in this case. As illustrated by the example in Figure 4, counting is a common source of error.

  2. Character-level precision: Observations show that misspelling and punctuation errors are the most common sources of error. While OCR-augmented generation generally alleviates this issue as observed in Table 4, the approach may still struggle with edge cases.

  3. Refusing to answer: we observe several instances of refusal to answer where the VLM determines the question is unanswerable, with such cases more frequent with long context.

Figure 4: Example failure case of miscounting. Blue text indicates translated text for context. Boxed areas with red text highlight three applications written down. When asked to count the number of applicants in the form, VLMs often response to mistakenly list 5 valid applicants instead of 3.
Table 6: Ablation study on applying test-time scaling. Both methods are fed the OCR tokens as additional context. Scores in () indicate what the model would have received if punctuation errors were not considered.
Gemini 2.5 R S D K Total
Flash 19 69 22 102 212
Thinking 21 70 23 70(95) 184(209)

Does test-time scaling improve OCR-augmented generation? We investigate whether test-time scaling [67][69] improves OCR-augmented generation. Open source vision language models do not yet support reasoning in conjunction to vision at the time of our experiments, and therefore we run our experiments on gemini-2.5-flash-preview-04-17, which supports reasoning with its "thinking" option. Results in Table 6 indicate reasoning does not improve VQA capabilities, in particular due to significant a drop in KIE performance. Closer analysis showed that the model showed increased punctuation and spelling error with thinking, and often ignored OCR information more than the non-thinking variant. The punctuation errors in this case mostly are spacing errors specific to the Korean language. We manually check incorrect answers due to spacing errors in KIE, and observe that 25 errors were caused by this error. Had the score not account for this type of error, we would have observed a score of 209 that is much closer to the non-thinking variant. Therefore, our findings indicate reasoning models in multilingual VQA still holds more room for improvements.

7 Conclusion↩︎

We introduced KOCRBench, a collection of text-oriented visual question and answering data for benchmarking Korean VQA towards multilingual visual understanding. Using the benchmark and our released KLOCR OCR model, we ran extensive experiments to explore the benefits and limitations of OCR-augmented generation for VQA. We observe that OCR most benefits the models by assisting them in precise character recognition. Our results indicate room for improving VLMs in more precise recognition and building an accurate representation of documents.

Limitations↩︎

KLOCR While the 100M dataset is large-scale and publicly sourced, it relies heavily on AIHub and SynthTIGER. AIHub is only a data platform and the data sources are independent, but we expect more robustness if other sources could be used (e.g. the web), and if it can integrate more synthetic data and other large datasets e.g. REBU-Syn. Due to increasing scale and compute requirements, we leave this to future work. Additionally, as the focus of KLOCR is in its bilingual abilities, no tuning has been made to achieve state-of-the-art performance for English. Lastly, we leave expansion to other languages, especially low-resource ones, to future work.

KOCRBench KOCRBench captures various tasks in different domain scenarios, but its modest size of 250 questions does not fully capture the performance of models like other massive English VQA benchmarks. We aim to continue our work in curating data to expand the benchmark, and experiment with synthetic dataset creation to reduce the limitation of manual labeling. We anticipate our efforts to encourage other researchers to contribute to expanding multilingual VQA benchmarks.

8 KLOCR Data Details↩︎

8.1 Mixture↩︎

We report the exact datasets used from AIHub in Table [tab:data95sources].

c|c Dataset & Source
Public Administrative Documents & Link
OCR Data (Public Services) & Link
Finance Documents Data & Link
Korean Font Images & Link
OCR Data (Handwriting OCR Data) & Link
Various Korean Characters OCR & Link
OCR Data (Financial and Logistics) & Link

8.2 Data Processing↩︎

Figure 5: KLOCR data processing.

Figure 5 illustrates the pre-processing process. We preprocess the data only if the images are not cropped into ROIs. Given the annotation JSON with bounding boxes and corresponding text labels, we acquire the cropped images and save the processed (image, text) pairs.

For train-test splits, we used existing splits for the public datasets and generated a random split if the dataset did not provide one.

8.3 AIHub Data License Details↩︎

Disclaimer: the authors are not affiliated with AIHub or with any data from AIHub.

The data from AI Hub has been released for open public uses, including but not limited to commercial/non-commercial purposes in the research and development of AI. In order to control the data usage, downloading the data from AIHub requires an account. For further information, please refer to their policy page.

References↩︎

[1]
Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, and Ron Litman. 2024. https://arxiv.org/abs/2412.08746. Preprint, arXiv:2412.08746.
[2]
Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nourbakhsh, and Xiaomo Liu. 2024. https://doi.org/10.18653/v1/2024.acl-long.463. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8529–8548, Bangkok, Thailand. Association for Computational Linguistics.
[3]
Haoran Wei, Chenglong Liu, Jinyue Chen, Jia Wang, Lingyu Kong, Yanming Xu, Zheng Ge, Liang Zhao, Jianjian Sun, Yuang Peng, Chunrui Han, and Xiangyu Zhang. 2024. https://openreview.net/forum?id=3LOcwfB4JX.
[4]
Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. 2021. https://doi.org/10.1109/WACV48630.2021.00225. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2199–2208.
[5]
Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. https://doi.org/10.18653/v1/2022.findings-acl.177. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland. Association for Computational Linguistics.
[6]
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. 2024. https://doi.org/10.1007/s11432-024-4235-6. Science China Information Sciences, 67(12).
[7]
Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, Chunhui Lin, Wanqing Li, Mohamad Fitri Faiz Bin Mahmood, Hao Feng, Zhen Zhao, Yanjie Wang, Yuliang Liu, Hao Liu, Xiang Bai, and Can Huang. 2024. https://arxiv.org/abs/2405.11985. Preprint, arXiv:2405.11985.
[8]
Alan Thomas, Robert Gaizauskas, and Haiping Lu. 2024. https://aclanthology.org/2024.lt4hala-1.14/. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pages 116–121, Torino, Italia. ELRA and ICCL.
[9]
Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sangdoo Yun, Dongyoon Han, and Seunghyun Park. 2022. Ocr-free document understanding transformer. In European Conference on Computer Vision (ECCV).
[10]
Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. 2023. https://arxiv.org/abs/2308.13418. Preprint, arXiv:2308.13418.
[11]
Baoguang Shi, Xiang Bai, and Cong Yao. 2017. https://doi.org/10.1109/TPAMI.2016.2646371. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11):2298–2304.
[12]
Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2021. https://arxiv.org/abs/2109.10282. Preprint, arXiv:2109.10282.
[13]
Yongkun Du, Zhineng Chen, Caiyan Jia, Xiaoting Yin, Tianlun Zheng, Chenxia Li, Yuning Du, and Yu-Gang Jiang. 2022. https://arxiv.org/abs/2205.00159. Preprint, arXiv:2205.00159.
[14]
Miao Rang, Zhenni Bi, Chuanjian Liu, Yunhe Wang, and Kai Han. 2024. An empirical study of scaling law for scene text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15619–15629.
[15]
Shuai Zhao, Ruijie Quan, Linchao Zhu, and Yi Yang. 2024. https://doi.org/10.1109/TIP.2024.3512354. IEEE Transactions on Image Processing, pages 1–1.
[16]
Kai Wang, Boris Babenko, and Serge Belongie. 2011. https://doi.org/10.1109/ICCV.2011.6126402. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, page 1457–1464, USA. IEEE Computer Society.
[17]
Anand Mishra, Karteek Alahari, and C. Jawahar. 2012. https://doi.org/10.5244/C.26.127.
[18]
Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, Lluis Gomez I. Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Jon Almazan Almazan, and Lluis PereDe Las Heras. 2013. https://doi.org/10.1109/ICDAR.2013.221. Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pages 1484–1493. Copyright: Copyright 2013 Elsevier B.V., All rights reserved.; 12th International Conference on Document Analysis and Recognition, ICDAR 2013 ; Conference date: 25-08-2013 Through 28-08-2013.
[19]
Trung Phan, Palaiahnakote Shivakumara, Shuangxuan Tian, and Chew Lim Tan. 2013. https://doi.org/10.1109/ICCV.2013.76. pages 569–576.
[20]
Anhar Risnumawan, Palaiahnakote Shivakumara, Chee Seng Chan, and Chew Lim Tan. 2014. https://api.semanticscholar.org/CorpusID:15559857. Expert Syst. Appl., 41:8027–8048.
[21]
Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ramaseshan Chandrasekhar, Shijian Lu, Faisal Shafait, Seiichi Uchida, and Ernest Valveny. 2015. https://doi.org/10.1109/ICDAR.2015.7333942. In 13th IAPR International Conference on Document Analysis and Recognition, ICDAR 2015 - Conference Proceedings, Proceedings of the International Conference on Document Analysis and Recognition, ICDAR, pages 1156–1160, United States. IEEE Computer Society. Publisher Copyright: © 2015 IEEE.; 13th International Conference on Document Analysis and Recognition, ICDAR 2015 ; Conference date: 23-08-2015 Through 26-08-2015.
[22]
Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. https://arxiv.org/abs/1904.01941. CoRR, abs/1904.01941.
[23]
Minghui Liao, Zhaoyi Wan, Cong Yao, Kai Chen, and Xiang Bai. 2020. Real-time scene text detection with differentiable binarization. In Proc. AAAI.
[24]
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, and Dacheng Tao. 2022. https://arxiv.org/abs/2207.04491. Preprint, arXiv:2207.04491.
[25]
Minghui Liao, Zhisheng Zou, Zhaoyi Wan, Cong Yao, and Xiang Bai. 2022. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence.
[26]
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, and Dacheng Tao. 2023. Dptext-detr: Towards better scene text detection with dynamic points in transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3241–3249.
[27]
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. 2020. https://arxiv.org/abs/2009.09941. CoRR, abs/2009.09941.
[28]
Chenxia Li, Weiwei Liu, Ruoyu Guo, Xiaoting Yin, Kaitao Jiang, Yongkun Du, Yuning Du, Lingfeng Zhu, Baohua Lai, Xiaoguang Hu, Dianhai Yu, and Yanjun Ma. 2022. https://arxiv.org/abs/2206.03001. Preprint, arXiv:2206.03001.
[29]
Birgit Pfitzmann, Christoph Auer, Michele Dolfi, Ahmed S Nassar, and Peter W J Staar. 2022. https://doi.org/10.1145/3534678.353904.
[30]
Cheng Da, Chuwei Luo, Qi Zheng, and Cong Yao. 2023. Vision grid transformer for document layout analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 19462–19472.
[31]
Brandon Smock, Rohith Pesala, and Robin Abraham. 2022. PubTables-1M: Towards comprehensive table extraction from unstructured documents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4634–4642.
[32]
Anthony Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari Raji Balasubramaniyan, and Duen Horng Chau. 2023. High-performance transformers for table structure recognition need early convolutions. In NeurIPS 2023 Second Table Representation Learning Workshop.
[33]
ShengYun Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari Balasubramaniyan, and Duen Horng Chau. 2024. Unitable: Towards a unified framework for table structure recognition via self-supervised pretraining. arXiv preprint.
[34]
ShengYun Peng, Seongmin Lee, Xiaojing Wang, Rajarajeswari Balasubramaniyan, and Duen Horng Chau. 2024. Self-supervised pretraining for table structure recognition transformer. arXiv preprint.
[35]
Zilong Wang, Yiheng Xu, Lei Cui, Jingbo Shang, and Furu Wei. 2021. https://arxiv.org/abs/2108.11591. Preprint, arXiv:2108.11591.
[36]
Xiao Yang, Ersin Yumer, Paul Asente, Mike Kraley, Daniel Kifer, and C. Lee Giles. 2017. https://doi.org/10.1109/CVPR.2017.462. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4342–4351.
[37]
Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. https://arxiv.org/abs/1908.07836. Preprint, arXiv:1908.07836.
[38]
Jiapeng Wang, Chongyu Liu, Lianwen Jin, Guozhi Tang, Jiaxin Zhang, Shuaitao Zhang, Qianying Wang, Yaqiang Wu, and Mingxiang Cai. 2021. https://doi.org/10.1609/aaai.v35i4.16378. Proceedings of the AAAI Conference on Artificial Intelligence, 35:2738–2745.
[39]
Zhibo Yang, Rujiao Long, Pengfei Wang, Sibo Song, Humen Zhong, Wenqing Cheng, Xiang Bai, and Cong Yao. 2023. https://doi.org/10.1109/CVPR52729.2023.01474. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15358–15367.
[40]
Jean-Philippe Thiran Guillaume Jaume, Hazim Kemal Ekenel. 2019. Funsd: A dataset for form understanding in noisy scanned documents. In Accepted to ICDAR-OST.
[41]
Zheng Huang, Kai Chen, Jianhua He, Xiang Bai, Dimosthenis Karatzas, Shijian Lu, and C. V. Jawahar. 2019. https://doi.org/10.1109/icdar.2019.00244. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE.
[42]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, and 8 others. 2025. Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923.
[43]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, and 8 others. 2022. https://arxiv.org/abs/2204.14198. Preprint, arXiv:2204.14198.
[44]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. https://api.semanticscholar.org/CorpusID:246411402. In International Conference on Machine Learning.
[45]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, and 1331 others. 2024. https://arxiv.org/abs/2312.11805. Preprint, arXiv:2312.11805.
[46]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf. In Advances in Neural Information Processing Systems, volume 36, pages 34892–34916. Curran Associates, Inc.
[47]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
[48]
Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2024. Nvlm: Open frontier-class multimodal llms. arXiv preprint.
[49]
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alex Herzog, Jasmine Hsu, Brian Ichter, and 35 others. 2023. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In arXiv preprint arXiv:2307.15818.
[50]
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. 2024. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246.
[51]
Moonbin Yim, Yoonsik Kim, Han-Cheol Cho, and Sungrae Park. 2021. Synthtiger: Synthetic text image generator towards better text recognition models. In International Conference on Document Analysis and Recognition, pages 109–124. Springer.
[52]
Pixparse. 2024. dataset. https://huggingface.co/datasets/pixparse/idl-wds. Accessed: 2025-04-01.
[53]
team-lucid. 2023. trocr-small-korean model. https://huggingface.co/team-lucid/trocr-small-korean. Accessed: 2025-04-01.
[54]
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. https://proceedings.mlr.press/v139/touvron21a.html. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 10347–10357. PMLR.
[55]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. https://arxiv.org/abs/1907.11692. CoRR, abs/1907.11692.
[56]
Ilya Loshchilov and Frank Hutter. 2019. https://openreview.net/forum?id=Bkg6RiCqY7. In International Conference on Learning Representations.
[57]
Terrance Devries and Graham W. Taylor. 2017. https://api.semanticscholar.org/CorpusID:23714201. ArXiv, abs/1708.04552.
[58]
Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. https://doi.org/10.1609/aaai.v34i07.7000. Proceedings of the AAAI Conference on Artificial Intelligence, 34(07):13001–13008.
[59]
Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. 2019. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8317–8326.
[60]
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. 2022. https://arxiv.org/abs/2203.10244. Preprint, arXiv:2203.10244.
[61]
Jin-Hwa Kim, Soohyun Lim, Jaesun Park, and Hansu Cho. 2019. https://aiforsocialgood.github.io/neurips2019/accepted/track1/pdfs/44_aisg_neurips2019.pdf. In Proceedings of the AI for Social Good Workshop at NeurIPS.
[62]
Miao Rang, Zhenni Bi, Chuanjian Liu, Yunhe Wang, and Kai Han. 2024. https://arxiv.org/abs/2401.00028. Preprint, arXiv:2401.00028.
[63]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
[64]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, and 2 others. 2019. https://arxiv.org/abs/1912.01703. Preprint, arXiv:1912.01703.
[65]
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, and 1 others. 2024. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198.
[66]
Mahtab Bigverdi, Zelun Luo, Cheng-Yu Hsieh, Ethan Shen, Dongping Chen, Linda G Shapiro, and Ranjay Krishna. 2024. Perception tokens enhance visual reasoning in multimodal language models. arXiv preprint arXiv:2412.03548.
[67]
OpenAI, :, Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, Alex Iftimie, Alex Karpenko, Alex Tachard Passos, Alexander Neitz, Alexander Prokofiev, Alexander Wei, Allison Tam, and 244 others. 2024. https://arxiv.org/abs/2412.16720. Preprint, arXiv:2412.16720.
[68]
DeepSeek-AI. 2025. https://arxiv.org/abs/2501.12948. Preprint, arXiv:2501.12948.
[69]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. https://arxiv.org/abs/2501.19393. Preprint, arXiv:2501.19393.

  1. Work done while at KL-Net↩︎

  2. Total dataset size is +120M, while we hold out \(\sim\)​20M as validation.↩︎