CantoNLU: A benchmark for Cantonese natural language understanding


1 Introduction↩︎

Mandarin Chinese is considered a high-resource language, abundant with pre-trained language model (PLM) support [1][3], corpora, and evaluation benchmarks [4], [5], and most recently commercial large language models [6], [7]. However, the same cannot be said about other variants of the Sinitic language family, including Cantonese. Mandarin is the official state language and the prestige language in media, business, and academia. Owing to this status, Cantonese, along with other Sinitic languages, remains a primarily vernacular language without written standardization [8], [9].

Such lack of written standardization, prestige, and official status results in a shortage of language resources in Cantonese. It is frequently described as a low-resource language [10][12] despite having millions of speakers [13]. As a result, Cantonese language processing systems often rely on datasets, models, and corpora adapted from Mandarin [14], despite a lack of mutual intelligibility between the two languages [15], [16].

Figure 1: An overview of the tasks in CantoNLU, and our framework for investigating when and how cross-lingual transfer learning from Mandarin is effective for natural language understanding in Cantonese.
Table 1: An overview of the 7 NLU tasks that make up CantoNLU, what the task requires, their size and source. L, F, M stand for lexicon, form (syntax), and meaning (semantics), respectively.
Task Requires Size Source
Sentence-level tasks
Acceptability L, F, M 1.6k MT error dataset [17] adapted
Lang Detection L, F 47k Parallel corpus [18] aligned and purturbed
NLI M 570k Machine-translated English NLI [19]
Sentiment M 12k Hong Kong restaurant reviews [20]
Word-level tasks
WSD L, M 109 Manual compilation
POS, Dep parsing F 14k Universal Dependencies dataset [21]

Cross-lingual transfer from Mandarin for Cantonese language processing has proven effective, with empirical success across a range of tasks, including language modeling and reading comprehension [12], translation [10], [22], and speech recognition [23], [24]. However, despite empirical success, there remains a gap in formal investigations into the best practices for Cantonese natural language understanding (NLU). This is attributable to a lack of a centralized evaluation framework for Cantonese language processing or understanding.

In this paper, we introduce CantoNLU, a GLUE-like [25] natural language understanding benchmark in Cantonese. CantoNLU provides an in-depth evaluation of syntax, lexicon and semantic understanding, comprising 7 tasks: word sense disambiguation (WSD), linguistic acceptability judgment (LAJ), language detection (LD), natural language inference (NLI), sentiment analysis (SA), part-of-speech tagging (POS), and dependency parsing (DEPS). In particular, the WSD dataset is entirely novel, providing the first resource for sense-level lexical understanding in Cantonese. LAJ and LD represent novel adaptations of existing datasets, while NLI, SA, POS, and DEPS datasets are direct adoptions of existing datasets [19][21]. We describe each task and underlying datasets in Section 4, whose overview is outlined in Table 1.

Using CantoNLU, we evaluate three types of models – a Mandarin model1 without explicit training on Cantonese, Cantonese-adapted transfer models with additional Cantonese training on a Mandarin model, and a monolingual Cantonese model, as outlined in Section 5. In Section 6, we discuss how monolingual Cantonese, Cantonese-adapted, and Mandarin models compare across various aspects in Cantonese NLU.

In spite of limited training data in Cantonese, we demonstrate that our monolingual Cantonese model excels in syntactic tasks such as POS and DEPS. On the other hand, Cantonese-adapted models excel in semantic tasks such as NLI, LD, WSD, and SA. Simultaneously, mandarin models offer a competitive alternative to additional or monolingual training on Cantonese data, excelling in NLI and LAJ.

Based on our results, we recommend monolingual Cantonese models for syntactic tasks, and Cantonese-adapted models for semantic tasks. In domains where Cantonese corpora are scarce, Mandarin models without Cantonese adaptation may be sufficient. In addition to the benchmark and analysis, we publicly release our code and model weights at github.com/aatlantise/sinitic-nlu.

2 Background↩︎

Cantonese, also known as Yue [13], is the second most widely used Sinitic language after Mandarin, spoken by an estimated 85 million people worldwide. However, it remains primarily a spoken language [15], with most speakers defaulting to written Standard Chinese [14]. This diglossic situation and its short history as a written language [8] has contributed to the scarcity of high-quality textual resources for Cantonese NLP, in stark contrast to Mandarin Chinese, a related language rich in resource across corpora, models [1][3], [6], [7], and evaluation resources [4], [5]. Thus, as highlighted in Section 1, prior work have used varying degrees of transfer from models with Mandarin knowledge to perform downstream tasks in Cantonese [12], [23], [26].

2.0.0.1 Cross-lingual transfer.

Transfer learning is a common strategy in deep learning (DL) to address data sparsity, where features or signal learned from a resource-rich domain is used to augment a low-resource domain, task, or dataset [27]. In the context of NLP, a common application of transfer learning is in cross-lingual transfer for low-resource languages [3], [28]. Two types of cross-lingual transfer exists: model transfer and data transfer, as described in [29]. Data transfer involves translating a dataset or corpus from a high-resource language to a lesser-resourced target language, as performed in yue-all-nli [19], the MMLU [30] portion of the HKCanto-Eval benchmark [31], and Yue benchmarks from [32].

On the other hand, model transfer involves adapting models trained on a high-resource source language to a lesser-resourced target language, such as Mandarin to Cantonese and Danish to Faroese [33]. Model transfer relies on the lexical or typological similarity between the two languages, thereby enabling the transfer of linguistic knowledge [34]. While some implementations of cross-lingual transfer explicitly designate a source language for cross-lingual transfer [33], [35], others rely on multilingual models. Rather than transferring from a specific source language, such models are thought to capture general cross-linguistic patterns that extend beyond typological or lexical similarity, allowing them to perform reasonably well even on unseen or low-resource languages [3], [36]. Hybrid approaches combine both paradigms: they begin with a multilingual model, then continue pre-training or fine-tuning on a specific target language before applying the resulting model to downstream tasks [37], [38].

Due to the richness of Mandarin language resources and the strong performance of Mandarin PLMs and LLMs, most cross-lingual transfer to Cantonese use Mandarin as the source language [10], [22][24], [26], with exceptions using a multilingual model [12].

However, there is emerging work suggesting limitations in models’ ability to capture Cantonese idiosyncrasies. Factors previously identified include substantial dissimilarities in lexicon, syntax and writing systems [22], along with the prevalence of colloquial phrases and code-switching in more recent Cantonese corpora [32]. These issues hinder the ability of Mandarin-trained models in Cantonese [14] due to over-reliance on Mandarin linguistic knowledge [31].

2.0.0.2 Cantonese grammar.

Cantonese diverges from Mandarin in word order, particle and grammatical word inventory, and morphology, as highlighted in theoretical linguistics work [39][41] as well as prior work in Cantonese-Mandarin machine translation and corpus linguistics [17], [42], [43]. For example, in double object constructions, Mandarin takes the direct-indirect order as seen in (1), while Cantonese takes the indirect-direct order as seen in (2) [41].

gei3 ni3 qian2 // give you money // ‘(I) give you money’ //

bei2 cin2 nei5 // give money you // ‘(I) give you money’ //

Cantonese also features a substantially larger and more diverse inventory of particles and aspect markers than Mandarin, enabling speakers to encode subtle distinctions of tense, stance, and speaker attitude [40], [41]. Cantonese morphology is more flexible, with more frequent verb serialization [39] and reduplication [43], [44] than in Mandarin. This allows sequences such as (3), sourced from [45], where three verbs combine to describe one scene and (4), where reduplicating the classifier (counting noun) yields an every quantifier. Given such unique properties of Cantonese grammar, availability of high-quality Cantonese data is critical to performing well on Cantonese linguistic tasks.

keoi5 jap6 heoi3 co5 // 3SG enter come sit // ‘He went in and sat down’ //

zek3 zek3 gau2 // CL CL dog // ‘every dog’ //

3 Related Work↩︎

3.0.0.1 Cantonese language resources.

Although widely considered low-resource, recent efforts have begun to address scarcity in Cantonese language resources. Machine-translations of non-Cantonese resources may offer potential utility as data transfer, as provided by [19] for Cantonese natural language inference (NLI) and [31], [32] for LLM knowledge evaluation. Foundational tools such as PyCantonese [46] provide NLTK-like [47] essential language processing utilities, while a Cantonese Universal Dependencies dataset [21] offers a small yet significant syntactically annotated treebank. Multilingual resources such as Wikipedia [48], SIB-200 [49], and NLLB [50] include Cantonese portions, which may be useful for representation learning or evaluation in topic classification and multilingual translation. Larger-scale corpora such as YueData [12] have also emerged, though they often contain substantial portions of non-Cantonese text.

Beyond corpora, work has extended to specific applications, including sentiment analysis [20], automatic speech recognition [23], machine translation [10], [17], [18], [22]. Most recently, [31] introduced HKCanto-Eval, a comprehensive benchmark for evaluating linguistic and cultural understanding in Cantonese. [32] presents another benchmark for evaluating LLM reasoning, knowledge, and logic in Cantonese.

On the modeling side, prior work has explored both general-purpose and Cantonese-specific pretrained models. While commercial multilingual models such as Qwen [6] and DeepSeek [7] provide Cantonese support, their Cantonese proficiency lags behind that in English or Mandarin [32]. [12] use their YueData corpus to train YueTung-7b, a continually pre-trained model based on a Qwen-7B model [6], which exhibit improved Cantonese performance compared to other open-source and commercial LLMs. On the smaller end in terms of the number of parameters, [26] represents the only encoder-only Cantonese model to our knowledge. It is continually pre-trained bert-base-chinese on Cantonese news articles, social media posts, and web pages on. Implementation details, training recipes, and corpora selection for bert-base-chinese and bert-base-cantonese are both obscure as neither model provides a description paper or technical report. In particular, [1] describes the BERT architecture in general but not the Chinese model.

3.0.0.2 CantoNLU and other Cantonese benchmarks.

Although we have described Cantonese benchmarks HKCanto-Eval [31] and Yue-Benchmark [32] as related work, we wish to clarify our contributions to Cantonese benchmarking in comparison to these two benchmarks. While HKCanto-Eval includes OpenRice [20] as a SA dataset, which overlaps with our CantoNLU, along with minor datasets in Cantonese phonology and orthography, HKCanto-Eval and Yue-Benchmark primarily target generative LLMs on general world knowledge and reasoning, in a comparable way to MMLU [30]. We highlight CantoNLU’s focus on discriminative NLU tasks, explicitly evaluating a model’s ability to understand the Cantonese lexicon (WSD), syntax (POS, DEPS), semantics (NLI, SA), and overall well-formedness with respect to both syntax and semantics (LAJ). The focus is similar to those of GLUE [25] and its derivatives–in Korean [51], Chinese [4], Vietnamese [52], and Indonesian [53].

4 Building CantoNLU↩︎

We introduce a benchmark of seven Cantonese NLU tasks, encompassing word sense disambiguation (WSD), linguistic acceptability judgment (LAJ), language detection (LD), natural language inference (NLI), sentiment analysis (SA), part-of-speech tagging (POS), and dependency parsing (DEPS). While all tasks require Cantonese proficiency, tasks such as LD may reward Mandarin knowledge, whereas tasks such as DEPS and WSD may penalize Mandarin knowledge. WSD, LAJ, and LD datasets are novel contributions to Cantonese NLU.

4.0.0.1 Novel WSD dataset via manual compilation.

First, we manually compile Cantonese words with more than one attested meaning to create the first Cantonese word sense disambiguation dataset. We collect the words’ multiples senses and two example sentences for each sense, resulting in 41 multi-sense words with a total of 109 senses. For each sense, there are least 2 example sentences containing the word. The dataset does not require fine-tuning for evaluation—model predictions are obtained by masking the target word in each sentence and comparing cosine similarities between the hidden representations at the mask position. For each target word \(w\) with two contexts \(s_i\) and \(s_j\), we obtain hidden representations for each at the masked position from the model \(\mathbf{h}_i\) and \(\mathbf{h}_j\).

Using cosine similarity, we define same-sense score \(s_{\text{same}}\) and different-sense score \(s_{\text{diff}}\) as : \[s_{same} = \frac{1}{|P_{\text{same}}|} \sum_{(i,j) \in P_{\text{same}}} \text{cos}(\mathbf{h}_i, \mathbf{h}_j),\] \[s_{\text{diff}} = \frac{1}{|P_{\text{diff}}|} \sum_{(i,j) \in P_{\text{diff}}} \text{cos}(\mathbf{h}_i, \mathbf{h}_j),\] where \(P_{\text{same}}\) and \(P_{\text{diff}}\) are the sets of sentence pairs containing the same and different senses of \(w\), respectively. A model prediction is considered correct if \(s_{\text{same}} > s_{\text{diff}}\).

4.0.0.2 Novel LAJ dataset adapted from error span dataset.

LAJ is a classification task, where the model predicts whether the given sequence is linguistically acceptable. This often aligns with grammatical judgment acceptability, but may also include judgment on semantic plausibility or pragmatic felicity. We compile the first Cantonese LAJ dataset by adapting the Cantonese portion of SiniticMTError [17], a dataset of Sinitic translation error span annotations, where each datapoint consists of a well-formed reference sentence ref, a machine translated sentence mt, and annotations of errors in the mt sentence. We consider error-free ref as acceptable, mt with error annotations not acceptable, to create pairs with one acceptable and one not acceptable versions of the same sentence. Unlike previous LAJ datasets such as CoLA [54], which asks for a binary acceptable-not acceptable judgment, we implement a more robust setup of providing two versions of the same sentence and asking for a more acceptable version as is preferred in psycholinguistics and cognitive science [55], [56]. The dataset consists of 1.6k pairs.

4.0.0.3 Novel LD dataset from word-aligning and perturbing parallel corpus.

Language Detection (LD) is a three-label classification task that identifies whether a given sentence is written in Cantonese, Mandarin, or mixed. We construct the novel dataset from the parallel translation corpus of [18], selecting the first 10,000 sentence pairs. To create mixed-language examples, we randomly replace tokens in Cantonese and Mandarin sentences with their counterparts from the other language with a set probability. In a given mixed sentence, 15%, 33%, 50% of the sentence may be from the other language, thus producing up to 6 mixed sentences for each pair. Word-level alignments are obtained using SimAlign [57], and all text is converted to traditional orthography using HanziConv [58] to prevent script-based shallow heuristics. The resulting dataset contains 47,578 sentences, of which 27,578 are mixed. We reserve 5% each for the validation and testing splits. We note that this task requires proficiency in both Mandarin and Cantonese to perform well.

4.0.0.4 Machine translated English NLI datasets.

NLI is a classification task where the model is asked to predict whether the premise entails, or implies the truth of, the hypothesis. The NLI portion of our benchmark is the yue-nli-all dataset [19], which is a machine translation of English NLI datasets SNLI [59] and MNLI [60] into Cantonese. The dataset comprises of 557k train examples, 6.6k development examples, and 6.6k test examples. Each example includes a reference premise and two hypotheses, one entailing and one contradicting, from which we create two examples with each label. This results in a balanced, two-label classification dataset.

4.0.0.5 Sentiment analysis dataset from restaurant reviews.

Sentiment analysis is a sentence-level classification task of predicting the sentiment of a given sentence. We use the OpenRice dataset [20] compiled from restaurant reviews in Hong Kong, with the 3-way label space of positive, neutral, and negative. The dataset is balanced, with 10k datapoints in its train split, 1k in its development split, and 1k in its test split.

4.0.0.6 POS and DEPS from Cantonese UD.

Finally, POS and DEPS are both token-level classification tasks. A POS model predicts the part-of-speech tag (e.g. noun, verb, etc.) of a given word, while a DEPS model predicts the dependency head (answering, which word is the syntactic head of this word?) and dependency type (answering, what is the relationship between this word and its head?) of the given word. For POS and DEPS, we use the Cantonese-HK Universal Dependency dataset [21], which comprises of 1k sentences and 14k tokens. We split the dataset 9:1 for training and testing, respectively. The dataset contains 15 POS tags and 48 dependency relations, of which 17 are specific to Cantonese. We report both unlabeled (UAS) and labeled attachment scores (LAS) to measure DEPS performance.

5 Experimental Benchmark Setup↩︎

Using the compiled Cantonese NLU benchmark, we evaluate three types of models–Cantonese monolingual, Cantonese-adapted from Mandarin, and Mandarin.

5.0.0.1 Pre-training corpora.

For those requiring training or adaptation into Cantonese, we use two corpora: the Cantonese Wikipedia [48] and a list of 30 million Cantonese sentences compiled by [61]. The two corpora are the publicly available and open-source subset of YueData [12]. The Wikipedia dump includes empty parentheses from their pre-processing stage that removes text in other languages. For example, (5) becomes (6), whose process often yields parentheses that are either empty or only contain punctuation marks. We remove them. Then, the corpora are divided into excerpts of maximum length 128 for pre-training. The Cantonese Wikipedia contains 137k articles, totaling 40M characters, while [61] contains 30M sentences, totaling 660M characters. In total, the pre-training corpora consists of 700M characters.

UTF8gbsn

  1. 電影(英語:movie/ film),特点是运动/移动的画面(英語:motion/ moving picture)2

  2. 电影(/ ),特点是运动/移动的画面()

5.0.0.2 Cantonese-adapted model.

The Cantonese-adapted model represents the most common text processing effort in Cantonese, taking an existing model with Mandarin support and performing continued pre-training on Cantonese before applying the adapted text or speech model to downstream tasks [12], [24], [26]. In our implementation, we take the off-the-shelf bert-base-chinese model [1]3 and continually pre-train on Cantonese text described above. We do not make any changes to the model’s tokenizer.

In addition, we offer a comparison to a concurrent BERT-based Cantonese work in bert-base-cantonese [26], also a continually pre-trained model, but on a close-source corpus of news articles, social media posts, and web content.

5.0.0.3 Mandarin model without Cantonese adaptation.

Direct transfer from Mandarin without conditioning on Cantonese also represents a small subset of Cantonese NLP efforts [10], [62], including evaluating non-Cantonese trained multi-lingual LLMs such as Llama models on Cantonese knowledge benchmarks [31]. We employ bert-base-chinese [1] as a model representing direct transfer from Mandarin, and fine-tune the model on the downstream tasks without additional conditioning on Cantonese text. We do not make any changes to the model’s tokenizer.

5.0.0.4 Monolingual Cantonese model.

We train a monolingual Cantonese model from scratch using the BERT architecture [1]. While previous work in Cantonese language modeling have incorporated additional datasets [12], many of them are proprietary, may include non-Cantonese text, or may not be freely used as pointed out by [14]. Thus, similar to the Cantonese-adapted model, we train a monolingual Cantonese model using the publicly available Cantonese Wikipedia dump [48] and the cantonese-sentences corpus [61]. The amount of data, totaling 700M characters, is an order of magnitude smaller than what was used to train bert-base-uncased at 3.3B words [1]. We are unable to make an exact comparison to bert-base-chinese as training details of the model are not publicly available, although we suspect the Mandarin Wikipedia dump was used for a part of its training, which consists of around 1 million articles (cf. 137k articles for Cantonese) at the time of bert-base-chinese training.

For the monolingual model, we train a sentencepiece byte-pair encoding tokenizer on the same data to obtain a Cantonese-only tokenizer. Unlike the tokenizer from bert-base-chinese which only includes 8k tokens of character length 1, the Cantonese tokenizer captures subword structure by including around 32k tokens, of which 27k are multi-character tokens. Moreover, a high overlap in lexicons is maintained, with 4.8k characters being represented in both bert-base-chinese and our Cantonese tokenizer. While we expected greater coverage of Cantonese-only characters from Hong Kong Supplementary Character Set (HKSCS), this is not the case as bert-base-chinese’s tokenizer boasts a greater coverage with 603 characters, while our tokenizer covers only 233 characters in HKSCS. This result reflects the relative scarcity of Cantonese-specific data compared to the abundant Mandarin data available.

Each model is fine-tuned on the training split of the downstream NLU task, then evaluated on the test split of the same task with the exception of LAJ and WSD which use model surprisal without fine-tuning. We report accuracy metrics for NLI, LAJ, WSD; F1 metrics for POS, LD and SA; and UAS and LAS for DEPS. Following NLU convention [1], we do not freeze model weights and allow them to be updated during fine-tuning. We report task-specific hyperparameter choice during fine-tuning in Table 2.

Table 2: Hyperparameters used for fine-tuning across tasks. LR = learning rate. Patience refers to the number of epochs before early stopping.
Task LR Batch Epochs
NLI 2e-5 16 3
POS 2e-5 32 3
DEP 3e-5 16 20
LD 2e-6 16 3
SA 2e-5 16 3
LAJ No fine-tuning performed
WSD No fine-tuning performed
Table 3: Performance of Mandarin, Cantonese-adapted, and monolingual models across Cantonese NLU tasks. We offer performances of open-weight but closed-source models bert-base-chinese [1] and bert-base-cantonese [26] as comparison.
WSD LAJ LD NLI SA POS DEPS Avg.
Model Acc. F1 F1 Acc. F1 F1 UAS LAS
No Cantonese adaptation
bert-base-chinese 78.9 91.7 76.4 93.2 70.2 74.6 29.1 25.9 67.5
Cantonese-adapted
bert-base-cantonese 92.7 89.2 78.7 93.2 71.9 72.2 30.2 26.8 69.4
Our transfer model 85.3 89.6 78.4 87.5 71.3 74.8 30.0 26.8 68.0
Monolingual Cantonese
Our monolingual model 70.6 85.7 73.3 82.6 70.1 78.2 32.4 27.9 65.1

6 Results and Discussion↩︎

Our results across the 7 Cantonese NLU tasks are shown in Table 3, where our Cantonese monolingual model demonstrates the highest performance for POS and DEPS; Cantonese-adapted for NLI, LD, and WSD; Mandarin model without Cantonese adaptation for NLI and LAJ. Our transfer model’s performances slightly tail that of bert-base-cantonese. We highlight three main findings. First, Cantonese monolingual models excel in syntactic tasks, while Cantonese-adapted Mandarin models are currently the most effective approach for Cantonese semantic tasks. Second, Mandarin-only models can still perform competitively on some tasks. Finally, despite the relative success of monolingual and transfer models, there remains substantial room for improvement in representing Cantonese lexicon, syntax, and semantics.

6.0.0.1 Transfer from Mandarin is the most effective for Cantonese NLU.

Supporting the empirical success of Mandarin-to-Cantonese transfer seen in contemporary Cantonese NLP, Cantonese-adapted Mandarin models offer the strongest performance with a task-averaged score of 69.4 (bert-base-cantonese) and 68.0 (our open-source transfer model). While the monolingual model excels in syntactic tasks POS and DEPS, its average score of 65.1 lags behind those the Cantonese-adapted and Mandarin models. We attribute this primarily to the small size of our Cantonese pretraining corpus (roughly 700M characters). As discussed in Section 5, this is an order of magnitude smaller than the datasets used to train comparable English or Mandarin models, and therefore insufficient for learning robust linguistic representations from scratch.

6.0.0.2 A well-trained Mandarin model may be sufficient for some Cantonese NLP

Interestingly, bert-base-chinese achieves comparable, or in some cases, superior performance to its Cantonese-adapted and Cantonese-monolingual counterparts with an average score of 68.1. When scholars discuss mutual intelligibility among Sinitic languages, they typically refer to the spoken form [16], [63]. However, written Sinitic languages are far more mutually intelligible, given the historical influence of Mandarin as the dominant language of education and literacy across Chinese-speaking regions [8], [14]. This suggests that effective written Cantonese understanding can emerge from training on Mandarin text without explicit exposure to Cantonese text, as also observed in [12], where LLMs without explicit Cantonese support perform relatively well on Cantonese and Hong Kong-related knowledge benchmarks.

6.0.0.3 Despite effective transfer, Cantonese representations remain limited.

While models achieve respectable performance on NLI, POS tagging, and lexical disambiguation, dependency parsing remains notably weak—likely due to both the small fine-tuning dataset of around 1k sentences and the inherent difficulty of the task. Nonetheless, dependency parsing for other low-resource languages such as Buryat [64] and Old English [65] has reached higher performance with smaller datasets, suggesting that current Cantonese representations still lack sufficient syntactic and semantic grounding.

Taken together, these results point to the promise of continued pre-training for Cantonese NLP and to the need for richer, higher-quality Cantonese corpora to close the representational gap.

7 Limitations↩︎

While we proposes a novel evaluation framework and recommend insights into Cantonese representation learning, several limitations remain. First, the size and coverage of available Cantonese corpora significantly constrain our results. Our monolingual Cantonese model was trained on roughly 700M characters, an order of magnitude smaller than corpora typically used to pre-train large language models in other major languages [1]. This data sparsity likely limits the model’s ability to capture the full lexical and semantic diversity of written Cantonese, and may explain the comparatively weaker performance of the monolingual model in semantic tasks. Future work would benefit from expanding and diversifying Cantonese text resources, especially in informal and user-generated domains that reflect contemporary usage.

Second, our evaluation benchmark, though designed to cover multiple aspects of Cantonese NLU, is itself constrained by data availability. Some tasks, such as dependency parsing (DEPS), rely on small fine-tuning datasets [21], making the results more susceptible to statistical noise and limiting generalization. In addition, the benchmark focuses on written Cantonese and does not address spoken or colloquial aspects of the language such as code-switching [66] or informal register in-depth; these aspects are integral to Cantonese as it is a primarily spoken language [8].

As a result, while our findings suggest that Mandarin-to-Cantonese transfer is effective for semantic tasks and Mandarin models perform sufficiently well, they should be interpreted as a reflection of current data and resource disparities, rather than representative features of Mandarin and Cantonese.

8 Conclusion↩︎

In this paper, we describe a benchmark of Cantonese NLU tasks, and evaluate a monolingual Cantonese model, two transfer models from Mandarin, and a Mandarin model on said Cantonese benchmark to investigate contexts where each type of model is most effective. Our results indicate that both Cantonese monolingual models and Cantonese-adapted models with cross-lingual transfer from Mandarin both have merit for Cantonese NLP in today’s landscape of language resources. In addition, direct transfer from a Mandarin model without Cantonese representation learning may suffice for some tasks. Our findings also suggest that the existing open-source Cantonese corpora are insufficient to train a reliable representation of Cantonese lexicon, syntax, and semantics.

Our benchmark and analyses provide the first systematic investigation into and evidence of whether and how transfer from Mandarin is effective for performing Cantonese linguistic tasks. By establishing a framework and pipeline for training monolingual or transfer models and evaluating them, we hope to catalyze broader progress in the space of Cantonese NLP.

Transfer may light the way, but data will pave the road.

Responsible Research Statement↩︎

While we do not collect human annotations for our work, we acknowledge that we make use of datasets that were annotated by humans or otherwise adapted from a human source.

We make use of a variety of Cantonese NLP and language resources. In addition to our discussion of them in Sections 3, 4, and 5, we acknowledge their use and organize them below.

Finally, we disclose that we use ChatGPT4 and PyCharm’s AI5 as a writing and coding assistant during the project’s implementation and the paper’s writeup.

Bibliographical References↩︎

References↩︎

[1]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
[2]
Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer. 2020. https://doi.org/10.1162/tacl_a_00343. Transactions of the Association for Computational Linguistics, 8:726–742.
[3]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
[4]
Liang Xu, Hai Hu, Xuanwei Zhang, Lu Li, Chenjie Cao, Yudong Li, Yechen Xu, Kai Sun, Dian Yu, Cong Yu, Yin Tian, Qianqian Dong, Weitang Liu, Bo Shi, Yiming Cui, Junyi Li, Jun Zeng, Rongzhao Wang, Weijian Xie, Yanting Li, Yina Patterson, Zuoyu Tian, Yiwen Zhang, He Zhou, Shaoweihua Liu, Zhe Zhao, Qipeng Zhao, Cong Yue, Xinrui Zhang, Zhengliang Yang, Kyle Richardson, and Zhenzhong Lan. 2020. https://doi.org/10.18653/v1/2020.coling-main.419. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4762–4772, Barcelona, Spain (Online). International Committee on Computational Linguistics.
[5]
Beilei Xiang, Changbing Yang, Yu Li, Alex Warstadt, and Katharina Kann. 2021. https://doi.org/10.18653/v1/2021.eacl-main.242. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2784–2790, Online. Association for Computational Linguistics.
[6]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609.
[7]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437.
[8]
Don Snow. 2004. Cantonese as written language: The growth of a written Chinese vernacular, volume 1. Hong Kong University Press.
[9]
David C. S. Li. 2006. https://doi.org/10.1017/S0267190506000080. Annual Review of Applied Linguistics, 26:149–176.
[10]
Evelyn Kai-Yan Liu. 2022. https://aclanthology.org/2022.vardial-1.4/. In Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 28–40, Gyeongju, Republic of Korea. Association for Computational Linguistics.
[11]
Rong Xiang, Hanzhuo Tan, Jing Li, Mingyu Wan, and Kam-Fai Wong. 2022. https://doi.org/10.18653/v1/2022.aacl-tutorials.3. In Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Tutorial Abstracts, pages 16–21, Taipei. Association for Computational Linguistics.
[12]
Jiyue Jiang, Alfred Kar Yin Truong, Yanyu Chen, Qinghang Bao, Sheng Wang, Pengan Chen, Jiuming Wang, Lingpeng Kong, Yu Li, and Chuan Wu. 2025. Developing and utilizing a large-scale cantonese dataset for multi-tasking in large language models. arXiv preprint arXiv:2503.03702.
[13]
David M. Eberhard, Gary F. Simons, and Charles D. Fennig, editors. 2023. http://www.ethnologue.com, 26 edition. SIL International, Dallas.
[14]
Rong Xiang, Ming Liao, and Jing Li. 2024. https://aclanthology.org/2024.sighan-1.8/. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), pages 69–79, Bangkok, Thailand. Association for Computational Linguistics.
[15]
Jerry Norman. 1988. Chinese. Cambridge University Press.
[16]
Chaoju Tang and Vincent J van Heuven. 2007. Mutual intelligibility and similarity of chinese dialects: Predicting judgments from objective measures. Linguistics in the Netherlands, 24(1):223–234.
[17]
Hannah Liu, Junghyun Min, Ethan Yue Heng Cheung, Shou-Yi Hung, Syed Mekael Wasti, Runtong Liang, Shiyao Qian, Shizhao Zheng, Elsie Chan, Ka Ieng Charlotte Lo, et al. 2025. Siniticmterror: A machine translation dataset with error annotations for sinitic languages. arXiv preprint arXiv:2509.20557.
[18]
Yuqian Dai, Chun Fai Chan, Ying Ki Wong, and Tsz Ho Pun. 2025. https://aclanthology.org/2025.loreslm-1.32/. In Proceedings of the First Workshop on Language Models for Low-Resource Languages, pages 427–436, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
[19]
Cheng Chung Shing. 2025. https://huggingface.co/datasets/hon9kon9ize/yue-all-nli.
[20]
Ziqiong Zhang, Qiang Ye, Zili Zhang, and Yijun Li. 2011. https://doi.org/https://doi.org/10.1016/j.eswa.2010.12.147. Expert Systems with Applications, 38(6):7674–7682.
[21]
Tak-sum Wong, Kim Gerdes, Herman Leung, and John Lee. 2017. https://aclanthology.org/W17-6530/. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 266–275, Pisa, Italy. Linköping University Electronic Press.
[22]
King Yiu Suen, Rudolf Chow, and Albert Y.S. Lam. 2024. https://doi.org/10.18653/v1/2024.loresmt-1.8. In Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), pages 74–84, Bangkok, Thailand. Association for Computational Linguistics.
[23]
Bryan Li, Xinyue Wang, and Homayoon Beigi. 2019. Cantonese automatic speech recognition using transfer learning from mandarin. arXiv preprint arXiv:1911.09271.
[24]
Jian Luo, Jianzong Wang, Ning Cheng, Edward Xiao, Jing Xiao, Georg Kucsko, Patrick O’Neill, Jagadeesh Balam, Slyne Deng, Adriana Flores, Boris Ginsburg, Jocelyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, and Jason Li. 2021. https://doi.org/10.1109/ICME51207.2021.9428334. In 2021 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6.
[25]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/W18-5446. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium. Association for Computational Linguistics.
[26]
Cheng Chung Shing. 2024. https://huggingface.co/hon9kon9ize/bert-base-cantonese.
[27]
Yoshua Bengio. 2012. https://proceedings.mlr.press/v27/bengio12a.html. In Proceedings of ICML Workshop on Unsupervised and Transfer Learning, volume 27 of Proceedings of Machine Learning Research, pages 17–36, Bellevue, Washington, USA. PMLR.
[28]
Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. 2019. https://doi.org/10.18653/v1/N19-5004. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, Minneapolis, Minnesota. Association for Computational Linguistics.
[29]
Iker García-Ferrero, Rodrigo Agerri, and German Rigau. 2022. https://doi.org/10.18653/v1/2022.findings-emnlp.478. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6403–6416, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
[30]
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. https://openreview.net/forum?id=d7KBjmI3GmQ. In International Conference on Learning Representations.
[31]
Tsz Chung Cheng, Chung Shing Cheng, Chaak-ming Lau, Eugene Lam, Wong Chun Yat, Hoi On Yu, and Cheuk Hei Chong. 2025. https://doi.org/10.18653/v1/2025.conll-1.1. In Proceedings of the 29th Conference on Computational Natural Language Learning, pages 1–11, Vienna, Austria. Association for Computational Linguistics.
[32]
Jiyue Jiang, Pengan Chen, Liheng Chen, Sheng Wang, Qinghang Bao, Lingpeng Kong, Yu Li, and Chuan Wu. 2025. https://doi.org/10.18653/v1/2025.findings-naacl.253. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 4464–4505, Albuquerque, New Mexico. Association for Computational Linguistics.
[33]
Vésteinn Snæbjarnarson, Annika Simonsen, Goran Glavaš, and Ivan Vulić. 2023. https://aclanthology.org/2023.nodalida-1.74/. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 728–737, Tórshavn, Faroe Islands. University of Tartu Library.
[34]
Aditya Khan, Mason Shipton, David Anugraha, Kaiyao Duan, Phuong H. Hoang, Eric Khiu, A. Seza Doğruöz, and En-Shiun Annie Lee. 2025. https://aclanthology.org/2025.coling-main.463/. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6937–6952, Abu Dhabi, UAE. Association for Computational Linguistics.
[35]
Harish Thangaraj, Ananya Chenat, Jaskaran Singh Walia, and Vukosi Marivate. 2024. http://arxiv.org/abs/2409.10965.
[36]
Wesley Scivetti, Lauren Levine, and Nathan Schneider. 2025. https://aclanthology.org/2025.coling-main.247/. In Proceedings of the 31st International Conference on Computational Linguistics, pages 3655–3669, Abu Dhabi, UAE. Association for Computational Linguistics.
[37]
Vitaly Protasov, Elisei Stakovskii, Ekaterina Voloshina, Tatiana Shavrina, and Alexander Panchenko. 2024. https://doi.org/10.18653/v1/2024.loresmt-1.10. In Proceedings of the Seventh Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2024), pages 94–108, Bangkok, Thailand. Association for Computational Linguistics.
[38]
Shadi Manafi and Nikhil Krishnaswamy. 2024. https://aclanthology.org/2024.lrec-main.372/. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 4174–4184, Torino, Italia. ELRA and ICCL.
[39]
Stephen Matthews. 2006. On serial verb constructions in cantonese. Serial verb constructions: A cross-linguistic typology, 2.
[40]
Foong Ha Yap and Winnie Chor. 2011. https://homepage.ntu.edu.tw/ gilntu/icpeal2018cldc/files/cldc2011_program.pdf. 2011 The 5th Conference on Language, Discourse and Cognition (CLDC-5) ; Conference date: 29-04-2011 Through 01-05-2011.
[41]
Stephen Matthews and Virginia Yip. 2011. https://doi.org/https://doi.org/10.4324/9780203835012, 2nd edition. Routledge.
[42]
Xiaoheng Zhang. 1998. https://aclanthology.org/C98-2233/. In COLING 1998 Volume 2: The 17th International Conference on Computational Linguistics.
[43]
Charles Lam. 2020. https://aclanthology.org/2020.paclic-1.64/. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, pages 562–567, Hanoi, Vietnam. Association for Computational Linguistics.
[44]
Peppina Po-lun Lee. 2020. https://doi.org/10.1017/S0022226720000110. Journal of Linguistics, 56(4):701–743.
[45]
T.A. O’Melia. 1965. https://books.google.com/books?id=I6MKAAAAMAAJ. Number v. 1 in First Year Cantonese. Catholic Truth Society.
[46]
Jackson Lee, Litong Chen, Charles Lam, Chaak Ming Lau, and Tsz-Him Tsui. 2022. https://aclanthology.org/2022.lrec-1.711/. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 6607–6611, Marseille, France. European Language Resources Association.
[47]
Steven Bird and Edward Loper. 2004. https://aclanthology.org/P04-3031/. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain. Association for Computational Linguistics.
[48]
Wikimedia Foundation. 2023. https://dumps.wikimedia.org.
[49]
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, and En-Shiun Annie Lee. 2024. https://doi.org/10.18653/v1/2024.eacl-long.14. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 226–245, St. Julians, Malta. Association for Computational Linguistics.
[50]
NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. 2024. https://doi.org/10.1038/s41586-024-07335-x. Nature, 630(8018):841–846.
[51]
Sungjoon Park, Jihyung Moon, Sungdong Kim, Won Ik Cho, Ji Yoon Han, Jangwon Park, Chisung Song, Junseong Kim, Youngsook Song, Taehwan Oh, Joohong Lee, Juhyun Oh, Sungwon Lyu, Younghoon Jeong, Inkwon Lee, Sangwoo Seo, Dongjun Lee, Hyunwoo Kim, Myeonghwa Lee, Seongbo Jang, Seungwon Do, Sunkyoung Kim, Kyungtae Lim, Jongwon Lee, Kyumin Park, Jamin Shin, Seonghyun Kim, Lucy Park, Alice Oh, Jung-Woo Ha, and Kyunghyun Cho. 2021. https://openreview.net/forum?id=q-8h8-LZiUm. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[52]
Phong Nguyen-Thuan Do, Son Quoc Tran, Phu Gia Hoang, Kiet Van Nguyen, and Ngan Luu-Thuy Nguyen. 2024. https://doi.org/10.18653/v1/2024.findings-naacl.15. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 211–222, Mexico City, Mexico. Association for Computational Linguistics.
[53]
Bryan Wilie, Karissa Vincentio, Genta Indra Winata, Samuel Cahyawijaya, Xiaohong Li, Zhi Yuan Lim, Sidik Soleman, Rahmad Mahendra, Pascale Fung, Syafri Bahar, and Ayu Purwarianti. 2020. https://doi.org/10.18653/v1/2020.aacl-main.85. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
[54]
Alex Warstadt, Amanpreet Singh, and Samuel R. Bowman. 2019. https://doi.org/10.1162/tacl_a_00290. Transactions of the Association for Computational Linguistics, 7:625–641.
[55]
Kyle Mahowald, Peter Graff, Jeremy Hartman, and Edward Gibson. 2016. Snap judgments: A small n acceptability paradigm (snap) for linguistic acceptability judgments. Language, 92(3):619–635.
[56]
Tal Linzen and Yohei Oseki. 2018. https://doi.org/https://doi.org/10.5334/gjgl.528. Glossa: a journal of general linguistics, 3(1).
[57]
Masoud Jalili Sabet, Philipp Dufter, François Yvon, and Hinrich Schütze. 2020. https://www.aclweb.org/anthology/2020.findings-emnlp.147. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pages 1627–1643, Online. Association for Computational Linguistics.
[58]
Bernard Yue and Andrew Gallant. 2014. https://pypi.org/project/hanziconv/0.2.2/. Version 0.2.2.
[59]
Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. https://doi.org/10.18653/v1/D15-1075. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
[60]
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. https://doi.org/10.18653/v1/N18-1101. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
[61]
Raptor Kwok. 2024. https://huggingface.co/datasets/raptorkwok/cantonese_sentences.
[62]
Qing Li. 2024. Fine-tuning guangdong cantonese based on wav2vec 2.0 xlrs model pretrained on mandarin chinese to improve asr performance. Master’s thesis, University of Groningen, Campus Fryslân, July. Supervised by Dr. Shekhar Nayak; second reader: Associate Professor Matt Coler.
[63]
Charlotte Gooskens and Vincent J Van Heuven. 2021. Mutual intelligibility. Similar languages, varieties, and dialects: A computational perspective, pages 51–95.
[64]
Elena Badmaeva and Francis M. Tyers. 2017. Dependency treebank for buryat. In Proceedings of the 15th International Workshop on Treebanks and Linguistic Theories (TLT15), pages 1–12.
[65]
Lauren Levine, Junghyun Min, and Amir Zeldes. 2025. https://aclanthology.org/2025.udw-1.10/. In Proceedings of the Eighth Workshop on Universal Dependencies (UDW, SyntaxFest 2025), pages 97–104, Ljubljana, Slovenia. Association for Computational Linguistics.
[66]
Odilla Yim and Ellen Bialystok. 2012. https://doi.org/10.1017/S1366728912000478. Bilingualism: Language and Cognition, 15(4):873–883.
[67]
Kang-Kwong Luke and May LY Wong. 2015. The hong kong cantonese corpus: design and uses. Journal of Chinese Linguistics, pages 309–330.
[68]
Chaak-ming Lau, Grace Wing-yan Chan, Raymond Ka-wai Tse, and Lilian Suet-ying Chan. 2022. Words. hk: a comprehensive cantonese dictionary dataset with definitions, translations and transliterated examples. In Proceedings of the Workshop on Dataset Creation for Lower-Resourced Languages within the 13th Language Resources and Evaluation Conference, pages 53–62.

  1. bert-base-chinese↩︎

  2. The illustrated example is from Mandarin, due to LaTeX compilers’ difficulty with Cantonese text.↩︎

  3. The citation is attributable to the bert language model as a whole; it does not include implementation details on bert-base-chinese.↩︎

  4. https://chat.openai.com↩︎

  5. https://www.jetbrains.com/pycharm/features/ai/↩︎