October 02, 2025
We investigate OCR-augmented generation with Vision Language Models (VLMs), exploring tasks in Korean and English toward multilingualism. To support research in this domain, we train and release KLOCR, a strong bilingual OCR baseline trained on 100M instances to augment VLMs with OCR ability. To complement existing VQA benchmarks, we curate KOCRBench for Korean VQA, and analyze different prompting methods. Extensive experiments show that OCR-extracted text significantly boosts performance across open source and commercial models. Our work offers new insights into OCR-augmented generation for bilingual VQA. Model, code, and data are available at https://github.com/JHLee0513/KLOCR.
Optical Character Recognition (OCR) interprets text from visual inputs for applications such as accessibility, business automation, and robotics. The task requires understanding the spatial layout, semantic content, and inter-component relationships of text [1], [2]. Despite the progress, traditional OCR pipelines based on text detection and recognition exhibit limitations in scalability and human-level understanding [3].
In this work, we explore the limits of OCR-augmented generation with Vision Language Models (VLMs). Recent advancements in VLMs show competitive OCR performance to traditional pipelines, and their semantic knowledge offers promising avenues toward end-to-end, OCR-capable agents. [4]–[6], [6]–[8] In comparison to OCR-free document understanding models [9], [10], VLMs are also capable of using their conversational abilities to directly address the downstream task at hand.
We investigate OCR-augmented generation for Visual Question and Answering in English and Korean, with aims to promote research of multilingual models. We provide KOCRBench, a novel Korean OCR Benchmark, and KLOCR, a robust bilingual OCR baseline. Our contribution lies in exploring the impact of OCR in providing additional context to VLMs, and we anticipate the benchmark and OCR model will encourage further research. Extensive experiments show that OCR significantly boosts performance, indicating room for further improvement by VLMs. Overall, findings show the presence of character-accurate key information was the most crucial factor to model success. Model and code are available at https://github.com/JHLee0513/KLOCR.
Text recognition [11]–[15] forms the core algorithm behind OCR. [14] demonstrated scaling laws present in OCR with common English benchmarks [16]–[21]. We follow this insight to collect large-scale training data for KLOCR.
Scene Text Detection [22]–[26] identifies text regions as bounding boxes, assisting recognition and improving spatial understanding. We integrate KLOCR with PaddleOCR [27], [28] implementation of DBNet [25] for our experiments.
Document Structure Analysis enhances OCR by identifying the structure of the text such as reading order, text types, and layout. Prior work includes structure analysis [29], [30], table detection and recognition [31]–[34], reading order detection [35], and semantic structure analysis [36]. Despite their strong in-domain accuracy, the models require a significant amount of densely annotated data and show limited performance for out-of-domain samples [37].
Key Information Extraction (KIE) [38], [39] focuses on extracting queried information rather than converting the entire visual input to text. Public benchmarks such as FUNSD [40] and SROIE [41] verify the extraction capabilities of pipelines and models in receipts, records, and other documents. As many applications rely on this task, we include it in KOCRBench.
Vision Language Models are general-purpose models trained on large amounts of image and text data for conversational vision language tasks [42]–[48]. Their recent applications in vision language tasks and even embodied AI demonstrate their wide range of capabilities [49], [50].
[14] demonstrated scaling laws in OCR, achieving state-of-the-art performances on six common English benchmarks by training a transformer based model on a large-scale dataset. Following this insight, we train the Korean Language Optical Character Recognition (KLOCR) model on a 100M2 instances bilingual dataset, achieving competitive performance on English and state-of-the-art accuracy on Korean.
We curate a diverse mixture of English and Korean OCR data, varying in text length and image domain. Table 1 describes our final composition, where most of the data is sourced from multilingual datasets made publicly available at AI-Hub. We combine SynthTIGER-v1.1 [51], PixParse [52], and generate 3M samples of multi-line, multi-word samples to increase data variety. We split the final collection into approximate 80-20 split for training and testing. Figure 2 highlights several samples that can be found in our mixture. We share the AIHub dataset details, licensing information, and pre-processing steps taken in Appendix 8.
Type | Lang | Dataset | Instances |
---|---|---|---|
Real | Ko+En | AIHub | 100M |
Real | En | PixParse | 7.2M |
Real | En | Union14M | 3.2M |
Synth | En | SynthTIGER | 10M |
Synth | Ko+En | SynthTIGER\(\dagger\) | 3M |
Real | En | UberText | 0.1M |
Real | En | TextOcr | 0.7M |
Real | En | CocoText | 0.07M |
Mixed | Ko+En | Total | 124.3M |
We finetune TrOCR [12] pretrained on a custom synthetic dataset generated with the SynthTIGER engine [53]. The model uses DeiT [54] as its encoder and RoberTa [55] as its decoder. At 55M Parameters, the model runs real-time (20+ FPS) on a desktop GPU.
We trained KLOCR for two epochs using two RTX A6000 GPUs, with a batch size of 64 per GPU. We use the AdamW [56] optimizer with a fixed learning rate of \(5e^{-7}\) to avoid drifting too far from the initialized weights. Since most of the samples are clear high-quality images, we found data augmentation (random rotation, random brightness, CoarseDropOut [57], [58]) beneficial to model generalization. The training run finishes in approximately 500 GPU hours, and we estimate the total development cost of the model to have been below 3000 GPU hours.
We consider Base as a baseline method, where we prompt the VLM with the input image and query without any additional context. In comparison, OCR-based prompting, which we denote as OCR, prompts the VLM with OCR-extracted text as additional context. We follow a format similar to [46] but omit the bounding box coordinates.
We curated KOCRBench to test VLMs’ ability to handle visual question answering in Korean. Following design of the prior work in English OCR [4], [6], [59], [60], we collected 250 questions from public sources spanning over 248 input images. Specifically, a portion of the benchmark is repurposed from the Korean Localization of Visual Question Answering for Blind People (KVQA) [61] dataset with reinforced annotations. We generated the majority of samples by selecting raw images from the holdout data from KLOCR data mixture and annotated manually. We created annotations for 4 tasks: text recognition (22 samples), scene VQA (70 samples), document VQA (29 samples), and key information extraction (129 samples). The number of samples were based on our internal assessment of the importance of each task in real business processes.
Method | CER\(\downarrow\) | Word Accuracy\(\uparrow\) |
---|---|---|
CLIP4STR-L* | 125.2% | 9.0% |
Surya | 60.9% | 55.3% |
PaddleOCR | 49.6% | 32.6% |
PORORO | 30.0% | 53.1% |
TrOCR | 27.0% | 49.4% |
KLOCR | 2.34% | 94.6% |
Method | IC13 | IIIT5k | SVT | CUTE80 | IC15 | SVTP | Avg | |
---|---|---|---|---|---|---|---|---|
TrOCR | 66.86 | 59.07 | 60.43 | 45.83 | 49.48 | 49.46 | 55.19 | |
PORORO | 78.30 | 64.30 | 56.57 | 47.57 | 45.33 | 46.05 | 56.35 | |
Surya | 82.73 | 71.50 | 74.19 | 44.79 | 64.00 | 64.19 | 69.48 | |
KLOCR | 95.92 | 86.50 | 93.20 | 91.67 | 84.87 | 87.91 | 88.13 | |
CLIP4STR-L* | 99.42 | 99.13 | 98.61 | 99.65 | 92.6 | 98.13 | 97.42 |
We used vLLM [63] to host the VLMs on our hardware and hosted the OCR models on the same machine or on a separate machine with an RTX A1000 GPU. We conducted our experiments in PyTorch [64].
Table 2 provides evaluation on Korean OCR for currently available open source OCR models. KLOCR outperforms prior models by a significant margin, achieving 94.6% word accuracy and 2.34% character error rate. The performance gap between TrOCR and KLOCR despite the two sharing the same architecture highlights the importance of scaling up OCR data. As expected, the Clip4STR model by [62] does not handle Korean and therefore achieves low accuracy.
Table 3 provides evaluation on the six common English benchmarks. KLOCR demonstrates comparable performance without any in-domain training data, demonstrating its scale and variety. KLOCR significantly out-performs prior OCR models focusing on Korean. As a reference point, we include the CLIP4STR-L model trained by [62], which includes the training subset of the benchmark data in its training and evidently achieves the highest performance.
Model | Prompt | Recognition | Scene | Document | KIE | Total |
---|---|---|---|---|---|---|
Qwen2.5-VL-7B | Base | 22 | 66 | 16 | 94 | 198 |
Qwen2.5-VL-7B | OCR | 21 | 65 | 22 | 104 | 212 |
InternVL2.5-7B | Base | 16 | 46 | 5 | 20 | 87 |
InternVL2.5-7B | OCR | 19 | 52 | 10 | 81 | 162 |
Qwen2.5-VL-32B-Instruct\(\dagger\) | Base | 21 | 60 | 20 | 75 | 176 |
Qwen2.5-VL-32B-Instruct\(\dagger\) | OCR | 20 | 61 | 21 | 103 | 205 |
gemini 2.0 flash | Base | 20 | 65 | 22 | 93 | 200 |
gemini 2.0 flash | OCR | 19 | 64 | 23 | 97 | 203 |
gemini-2.5-flash-preview-04-17 | Base | 21 | 70 | 20 | 71 | 182 |
gemini-2.5-flash-preview-04-17 | OCR | 19 | 69 | 22 | 102 | 212 |
As aforementioned in Section 4, we compare Base and OCR prompting. Table 4 shows the benchmark results across 5 models: Qwen-VL 2.5 7B, 32B [42], InternVL 2.5 7B [65], Gemini 2.0 Flash, and Gemini 2.5 Flash [45]. The chosen models have shown competitive performances on the English benchmarks and also provide multilingual support. The Gemini models have been added to provide a reference point for commercially available models.
The addition of OCR-extracted information significantly improves accuracy for all models, aligning with the findings by [62]. The largest improvements are observed from smaller models with a weaker base performance such as InternVL, indicating the OCR information is used by the models to correct their responses. Notably, we observe very strong base performance from Qwen-VL 2.5 7B despite its smaller size, indicating the potential fact that Qwen trainig mixture has substantial multilingual data.
Our results indicate largest performance improvement in Key Information Extraction, highlighting the usefulness of OCR’s accurate character recognition. This also implies VLMs are yet to resolve spelling errors, especially on unusual and semantically meaningless words or obscure jargon.
We further discuss the applicability of OCR-augmented generation with a set of ablation studies. When is OCR useful? While KLOCR has shown robust performance and significantly boosted VLMs’ performance in VQA, the trade-off between training OCR models and finetuning VLMs to improve their OCR ability should be weighed properly. Results on English [62] and Korean indicate OCR can play a crucial role in assisting VLMs, especially for low base performance models. It is also possible to finetune the VLMs directly on the OCR data, albeit with potential forgetting of other abilities. Meanwhile, it’s challenging to train large-scale OCR model for low-resource languages, and hence resolving this issue for VLMs and OCR models remain a challenge.
Impact of OCR accuracy on VLM We verify the effectiveness of OCR-augmented generation by testing Qwen-VL 2.5 7B and InternVL 2.5 7B using KLOCR and TrOCR as the OCR extraction model. Results in Table 5 clearly indicate that improvement in OCR also leads to an improvement in VLM response, while stronger models such as Qwen 2.5 show greater robustness against OCR error.
VLM | OCR | R | S | D | K | Total |
---|---|---|---|---|---|---|
InternVL | TrOCR | 18 | 54 | 8 | 47 | 127 |
InternVL | KLOCR | 19 | 52 | 10 | 81 | 162 |
Qwen 2.5 | TrOCR | 19 | 68 | 23 | 92 | 202 |
Qwen 2.5 | KLOCR | 21 | 65 | 22 | 104 | 212 |
KOCRBench error analysis Our results on KOCRBench exhibit VLMs’ weaknesses:
Counting: Counting has been a challenging task for either LLMs or VLMs [66], and it is no exception in this case. As illustrated by the example in Figure 4, counting is a common source of error.
Character-level precision: Observations show that misspelling and punctuation errors are the most common sources of error. While OCR-augmented generation generally alleviates this issue as observed in Table 4, the approach may still struggle with edge cases.
Refusing to answer: we observe several instances of refusal to answer where the VLM determines the question is unanswerable, with such cases more frequent with long context.
Gemini 2.5 | R | S | D | K | Total |
---|---|---|---|---|---|
Flash | 19 | 69 | 22 | 102 | 212 |
Thinking | 21 | 70 | 23 | 70(95) | 184(209) |
Does test-time scaling improve OCR-augmented generation? We investigate whether test-time scaling [67]–[69] improves OCR-augmented generation. Open source vision language models do not yet support reasoning in conjunction to vision at the time of our experiments, and therefore we run our experiments on gemini-2.5-flash-preview-04-17, which supports reasoning with its "thinking" option. Results in Table 6 indicate reasoning does not improve VQA capabilities, in particular due to significant a drop in KIE performance. Closer analysis showed that the model showed increased punctuation and spelling error with thinking, and often ignored OCR information more than the non-thinking variant. The punctuation errors in this case mostly are spacing errors specific to the Korean language. We manually check incorrect answers due to spacing errors in KIE, and observe that 25 errors were caused by this error. Had the score not account for this type of error, we would have observed a score of 209 that is much closer to the non-thinking variant. Therefore, our findings indicate reasoning models in multilingual VQA still holds more room for improvements.
We introduced KOCRBench, a collection of text-oriented visual question and answering data for benchmarking Korean VQA towards multilingual visual understanding. Using the benchmark and our released KLOCR OCR model, we ran extensive experiments to explore the benefits and limitations of OCR-augmented generation for VQA. We observe that OCR most benefits the models by assisting them in precise character recognition. Our results indicate room for improving VLMs in more precise recognition and building an accurate representation of documents.
KLOCR While the 100M dataset is large-scale and publicly sourced, it relies heavily on AIHub and SynthTIGER. AIHub is only a data platform and the data sources are independent, but we expect more robustness if other sources could be used (e.g. the web), and if it can integrate more synthetic data and other large datasets e.g. REBU-Syn. Due to increasing scale and compute requirements, we leave this to future work. Additionally, as the focus of KLOCR is in its bilingual abilities, no tuning has been made to achieve state-of-the-art performance for English. Lastly, we leave expansion to other languages, especially low-resource ones, to future work.
KOCRBench KOCRBench captures various tasks in different domain scenarios, but its modest size of 250 questions does not fully capture the performance of models like other massive English VQA benchmarks. We aim to continue our work in curating data to expand the benchmark, and experiment with synthetic dataset creation to reduce the limitation of manual labeling. We anticipate our efforts to encourage other researchers to contribute to expanding multilingual VQA benchmarks.
We report the exact datasets used from AIHub in Table [tab:data95sources].
c|c Dataset & Source
Public Administrative Documents & Link
OCR Data (Public Services) & Link
Finance Documents Data & Link
Korean Font Images & Link
OCR Data (Handwriting OCR Data) & Link
Various Korean Characters OCR & Link
OCR Data (Financial and Logistics) & Link
Figure 5 illustrates the pre-processing process. We preprocess the data only if the images are not cropped into ROIs. Given the annotation JSON with bounding boxes and corresponding text labels, we acquire the cropped images and save the processed (image, text) pairs.
For train-test splits, we used existing splits for the public datasets and generated a random split if the dataset did not provide one.
Disclaimer: the authors are not affiliated with AIHub or with any data from AIHub.
The data from AI Hub has been released for open public uses, including but not limited to commercial/non-commercial purposes in the research and development of AI. In order to control the data usage, downloading the data from AIHub requires an account. For further information, please refer to their policy page.