October 23, 2025
Mandarin Chinese is considered a high-resource language, abundant with pre-trained language model (PLM) support [1]–[3], corpora, and evaluation benchmarks [4], [5], and most recently commercial large language models [6], [7]. However, the same cannot be said about other variants of the Sinitic language family, including Cantonese. Mandarin is the official state language and the prestige language in media, business, and academia. Owing to this status, Cantonese, along with other Sinitic languages, remains a primarily vernacular language without written standardization [8], [9].
Such lack of written standardization, prestige, and official status results in a shortage of language resources in Cantonese. It is frequently described as a low-resource language [10]–[12] despite having millions of speakers [13]. As a result, Cantonese language processing systems often rely on datasets, models, and corpora adapted from Mandarin [14], despite a lack of mutual intelligibility between the two languages [15], [16].
| Task | Requires | Size | Source |
|---|---|---|---|
| Sentence-level tasks | |||
| Acceptability | L, F, M | 1.6k | MT error dataset [17] adapted |
| Lang Detection | L, F | 47k | Parallel corpus [18] aligned and purturbed |
| NLI | M | 570k | Machine-translated English NLI [19] |
| Sentiment | M | 12k | Hong Kong restaurant reviews [20] |
| Word-level tasks | |||
| WSD | L, M | 109 | Manual compilation |
| POS, Dep parsing | F | 14k | Universal Dependencies dataset [21] |
Cross-lingual transfer from Mandarin for Cantonese language processing has proven effective, with empirical success across a range of tasks, including language modeling and reading comprehension [12], translation [10], [22], and speech recognition [23], [24]. However, despite empirical success, there remains a gap in formal investigations into the best practices for Cantonese natural language understanding (NLU). This is attributable to a lack of a centralized evaluation framework for Cantonese language processing or understanding.
In this paper, we introduce CantoNLU, a GLUE-like [25] natural language understanding benchmark in Cantonese. CantoNLU provides an in-depth evaluation of syntax, lexicon and semantic understanding, comprising 7 tasks: word sense disambiguation (WSD), linguistic acceptability judgment (LAJ), language detection (LD), natural language inference (NLI), sentiment analysis (SA), part-of-speech tagging (POS), and dependency parsing (DEPS). In particular, the WSD dataset is entirely novel, providing the first resource for sense-level lexical understanding in Cantonese. LAJ and LD represent novel adaptations of existing datasets, while NLI, SA, POS, and DEPS datasets are direct adoptions of existing datasets [19]–[21]. We describe each task and underlying datasets in Section 4, whose overview is outlined in Table 1.
Using CantoNLU, we evaluate three types of models – a Mandarin model1 without explicit training on Cantonese, Cantonese-adapted transfer models with additional Cantonese training on a Mandarin model, and a monolingual Cantonese model, as outlined in Section 5. In Section 6, we discuss how monolingual Cantonese, Cantonese-adapted, and Mandarin models compare across various aspects in Cantonese NLU.
In spite of limited training data in Cantonese, we demonstrate that our monolingual Cantonese model excels in syntactic tasks such as POS and DEPS. On the other hand, Cantonese-adapted models excel in semantic tasks such as NLI, LD, WSD, and SA. Simultaneously, mandarin models offer a competitive alternative to additional or monolingual training on Cantonese data, excelling in NLI and LAJ.
Based on our results, we recommend monolingual Cantonese models for syntactic tasks, and Cantonese-adapted models for semantic tasks. In domains where Cantonese corpora are scarce, Mandarin models without Cantonese adaptation may be sufficient. In addition to the benchmark and analysis, we publicly release our code and model weights at github.com/aatlantise/sinitic-nlu.
Cantonese, also known as Yue [13], is the second most widely used Sinitic language after Mandarin, spoken by an estimated 85 million people worldwide. However, it remains primarily a spoken language [15], with most speakers defaulting to written Standard Chinese [14]. This diglossic situation and its short history as a written language [8] has contributed to the scarcity of high-quality textual resources for Cantonese NLP, in stark contrast to Mandarin Chinese, a related language rich in resource across corpora, models [1]–[3], [6], [7], and evaluation resources [4], [5]. Thus, as highlighted in Section 1, prior work have used varying degrees of transfer from models with Mandarin knowledge to perform downstream tasks in Cantonese [12], [23], [26].
Transfer learning is a common strategy in deep learning (DL) to address data sparsity, where features or signal learned from a resource-rich domain is used to augment a low-resource domain, task, or dataset [27]. In the context of NLP, a common application of transfer learning is in cross-lingual transfer for low-resource languages [3], [28]. Two types of cross-lingual
transfer exists: model transfer and data transfer, as described in [29]. Data transfer involves translating a
dataset or corpus from a high-resource language to a lesser-resourced target language, as performed in yue-all-nli [19], the MMLU [30] portion of the HKCanto-Eval benchmark [31], and Yue benchmarks from [32].
On the other hand, model transfer involves adapting models trained on a high-resource source language to a lesser-resourced target language, such as Mandarin to Cantonese and Danish to Faroese [33]. Model transfer relies on the lexical or typological similarity between the two languages, thereby enabling the transfer of linguistic knowledge [34]. While some implementations of cross-lingual transfer explicitly designate a source language for cross-lingual transfer [33], [35], others rely on multilingual models. Rather than transferring from a specific source language, such models are thought to capture general cross-linguistic patterns that extend beyond typological or lexical similarity, allowing them to perform reasonably well even on unseen or low-resource languages [3], [36]. Hybrid approaches combine both paradigms: they begin with a multilingual model, then continue pre-training or fine-tuning on a specific target language before applying the resulting model to downstream tasks [37], [38].
Due to the richness of Mandarin language resources and the strong performance of Mandarin PLMs and LLMs, most cross-lingual transfer to Cantonese use Mandarin as the source language [10], [22]–[24], [26], with exceptions using a multilingual model [12].
However, there is emerging work suggesting limitations in models’ ability to capture Cantonese idiosyncrasies. Factors previously identified include substantial dissimilarities in lexicon, syntax and writing systems [22], along with the prevalence of colloquial phrases and code-switching in more recent Cantonese corpora [32]. These issues hinder the ability of Mandarin-trained models in Cantonese [14] due to over-reliance on Mandarin linguistic knowledge [31].
Cantonese diverges from Mandarin in word order, particle and grammatical word inventory, and morphology, as highlighted in theoretical linguistics work [39]–[41] as well as prior work in Cantonese-Mandarin machine translation and corpus linguistics [17], [42], [43]. For example, in double object constructions, Mandarin takes the direct-indirect order as seen in (1), while Cantonese takes the indirect-direct order as seen in (2) [41].
gei3 ni3 qian2 // give you money // ‘(I) give you money’ //
bei2 cin2 nei5 // give money you // ‘(I) give you money’ //
Cantonese also features a substantially larger and more diverse inventory of particles and aspect markers than Mandarin, enabling speakers to encode subtle distinctions of tense, stance, and speaker attitude [40], [41]. Cantonese morphology is more flexible, with more frequent verb serialization [39] and reduplication [43], [44] than in Mandarin. This allows sequences such as (3), sourced from [45], where three verbs combine to describe one scene and (4), where reduplicating the classifier (counting noun) yields an every quantifier. Given such unique properties of Cantonese grammar, availability of high-quality Cantonese data is critical to performing well on Cantonese linguistic tasks.
keoi5 jap6 heoi3 co5 // 3SG enter come sit // ‘He went in and sat down’ //
zek3 zek3 gau2 // CL CL dog // ‘every dog’ //
Although widely considered low-resource, recent efforts have begun to address scarcity in Cantonese language resources. Machine-translations of non-Cantonese resources may offer potential utility as data transfer, as provided by [19] for Cantonese natural language inference (NLI) and [31], [32] for LLM knowledge evaluation. Foundational tools such as PyCantonese [46] provide NLTK-like [47] essential language processing utilities, while a Cantonese Universal Dependencies dataset [21] offers a small yet significant syntactically annotated treebank. Multilingual resources such as Wikipedia [48],
SIB-200 [49], and NLLB [50] include Cantonese portions, which may be useful for representation learning or evaluation in topic classification and multilingual translation. Larger-scale corpora such as YueData [12] have also emerged, though they often contain substantial portions of non-Cantonese text.
Beyond corpora, work has extended to specific applications, including sentiment analysis [20], automatic speech recognition [23], machine translation [10], [17], [18], [22]. Most recently, [31] introduced HKCanto-Eval, a comprehensive benchmark for evaluating linguistic and cultural understanding in Cantonese. [32] presents another benchmark for evaluating LLM reasoning, knowledge, and logic in Cantonese.
On the modeling side, prior work has explored both general-purpose and Cantonese-specific pretrained models. While commercial multilingual models such as Qwen [6] and DeepSeek [7] provide Cantonese support, their Cantonese proficiency lags behind that in English or
Mandarin [32]. [12] use their YueData corpus to train YueTung-7b, a continually pre-trained model based on a Qwen-7B model [6], which exhibit improved Cantonese performance compared to other open-source and commercial LLMs. On the smaller end in terms of the number of parameters, [26] represents the only encoder-only Cantonese model to our knowledge. It is continually pre-trained bert-base-chinese on Cantonese news articles, social media posts, and
web pages on. Implementation details, training recipes, and corpora selection for bert-base-chinese and bert-base-cantonese are both obscure as neither model provides a description paper or technical report. In particular, [1] describes the BERT architecture in general but not the Chinese model.
Although we have described Cantonese benchmarks HKCanto-Eval [31] and Yue-Benchmark [32] as related work, we wish to clarify our contributions to Cantonese benchmarking in comparison to these two benchmarks. While HKCanto-Eval includes OpenRice [20] as a SA dataset, which overlaps with our CantoNLU, along with minor datasets in Cantonese phonology and orthography, HKCanto-Eval and Yue-Benchmark primarily target generative LLMs on general world knowledge and reasoning, in a comparable way to MMLU [30]. We highlight CantoNLU’s focus on discriminative NLU tasks, explicitly evaluating a model’s ability to understand the Cantonese lexicon (WSD), syntax (POS, DEPS), semantics (NLI, SA), and overall well-formedness with respect to both syntax and semantics (LAJ). The focus is similar to those of GLUE [25] and its derivatives–in Korean [51], Chinese [4], Vietnamese [52], and Indonesian [53].
We introduce a benchmark of seven Cantonese NLU tasks, encompassing word sense disambiguation (WSD), linguistic acceptability judgment (LAJ), language detection (LD), natural language inference (NLI), sentiment analysis (SA), part-of-speech tagging (POS), and dependency parsing (DEPS). While all tasks require Cantonese proficiency, tasks such as LD may reward Mandarin knowledge, whereas tasks such as DEPS and WSD may penalize Mandarin knowledge. WSD, LAJ, and LD datasets are novel contributions to Cantonese NLU.
First, we manually compile Cantonese words with more than one attested meaning to create the first Cantonese word sense disambiguation dataset. We collect the words’ multiples senses and two example sentences for each sense, resulting in 41 multi-sense words with a total of 109 senses. For each sense, there are least 2 example sentences containing the word. The dataset does not require fine-tuning for evaluation—model predictions are obtained by masking the target word in each sentence and comparing cosine similarities between the hidden representations at the mask position. For each target word \(w\) with two contexts \(s_i\) and \(s_j\), we obtain hidden representations for each at the masked position from the model \(\mathbf{h}_i\) and \(\mathbf{h}_j\).
Using cosine similarity, we define same-sense score \(s_{\text{same}}\) and different-sense score \(s_{\text{diff}}\) as : \[s_{same} = \frac{1}{|P_{\text{same}}|} \sum_{(i,j) \in P_{\text{same}}} \text{cos}(\mathbf{h}_i, \mathbf{h}_j),\] \[s_{\text{diff}} = \frac{1}{|P_{\text{diff}}|} \sum_{(i,j) \in P_{\text{diff}}} \text{cos}(\mathbf{h}_i, \mathbf{h}_j),\] where \(P_{\text{same}}\) and \(P_{\text{diff}}\) are the sets of sentence pairs containing the same and different senses of \(w\), respectively. A model prediction is considered correct if \(s_{\text{same}} > s_{\text{diff}}\).
LAJ is a classification task, where the model predicts whether the given sequence is linguistically acceptable. This often aligns with grammatical judgment acceptability, but may also include judgment on semantic plausibility or pragmatic felicity. We
compile the first Cantonese LAJ dataset by adapting the Cantonese portion of SiniticMTError [17],
a dataset of Sinitic translation error span annotations, where each datapoint consists of a well-formed reference sentence ref, a machine translated sentence mt, and annotations of errors in the mt sentence. We
consider error-free ref as acceptable, mt with error annotations not acceptable, to create pairs with one acceptable and one not acceptable versions of the same sentence. Unlike previous LAJ datasets such as CoLA [54], which asks for a binary acceptable-not acceptable judgment, we implement a more robust setup of providing two versions of the same
sentence and asking for a more acceptable version as is preferred in psycholinguistics and cognitive science [55], [56]. The dataset consists of 1.6k pairs.
Language Detection (LD) is a three-label classification task that identifies whether a given sentence is written in Cantonese, Mandarin, or mixed. We construct the novel dataset from the parallel translation corpus of [18], selecting the first 10,000 sentence pairs. To create mixed-language examples, we randomly replace tokens in Cantonese and Mandarin sentences with their counterparts from the other language with a set probability. In a given mixed sentence, 15%, 33%, 50% of the sentence may be from the other language, thus producing up to 6 mixed sentences for each pair. Word-level alignments are obtained using SimAlign [57], and all text is converted to traditional orthography using HanziConv [58] to prevent script-based shallow heuristics. The resulting dataset contains 47,578 sentences, of which 27,578 are mixed. We reserve 5% each for the validation and testing splits. We note that this task requires proficiency in both Mandarin and Cantonese to perform well.
NLI is a classification task where the model is asked to predict whether the premise entails, or implies the truth of, the hypothesis. The NLI portion of our benchmark is the yue-nli-all dataset [19], which is a machine translation of English NLI datasets SNLI [59] and MNLI [60] into Cantonese. The dataset comprises
of 557k train examples, 6.6k development examples, and 6.6k test examples. Each example includes a reference premise and two hypotheses, one entailing and one contradicting, from which we create two examples with each label. This results in a balanced,
two-label classification dataset.
Sentiment analysis is a sentence-level classification task of predicting the sentiment of a given sentence. We use the OpenRice dataset [20] compiled from restaurant reviews in Hong Kong, with the 3-way label space of positive, neutral, and negative. The dataset is balanced, with 10k datapoints in its train split, 1k in its development split, and 1k in its test split.
Finally, POS and DEPS are both token-level classification tasks. A POS model predicts the part-of-speech tag (e.g. noun, verb, etc.) of a given word, while a DEPS model predicts the dependency head (answering, which word is the syntactic head of this word?) and dependency type (answering, what is the relationship between this word and its head?) of the given word. For POS and DEPS, we use the Cantonese-HK Universal Dependency dataset [21], which comprises of 1k sentences and 14k tokens. We split the dataset 9:1 for training and testing, respectively. The dataset contains 15 POS tags and 48 dependency relations, of which 17 are specific to Cantonese. We report both unlabeled (UAS) and labeled attachment scores (LAS) to measure DEPS performance.
Using the compiled Cantonese NLU benchmark, we evaluate three types of models–Cantonese monolingual, Cantonese-adapted from Mandarin, and Mandarin.
For those requiring training or adaptation into Cantonese, we use two corpora: the Cantonese Wikipedia [48] and a list of 30 million Cantonese sentences compiled by [61]. The two corpora are the publicly available and open-source subset of YueData [12]. The Wikipedia dump includes empty parentheses from their pre-processing stage that removes text in other languages. For example, (5) becomes (6), whose process often yields parentheses that are either empty or only contain punctuation marks. We remove them. Then, the corpora are divided into excerpts of maximum length 128 for pre-training. The Cantonese Wikipedia contains 137k articles, totaling 40M characters, while [61] contains 30M sentences, totaling 660M characters. In total, the pre-training corpora consists of 700M characters.
UTF8gbsn
電影(英語:movie/ film),特点是运动/移动的画面(英語:motion/ moving picture)2
电影(/ ),特点是运动/移动的画面()
The Cantonese-adapted model represents the most common text processing effort in Cantonese, taking an existing model with Mandarin support and performing continued pre-training on Cantonese before applying the adapted text or speech model to downstream
tasks [12], [24], [26]. In our implementation, we take the off-the-shelf bert-base-chinese model [1]3 and continually pre-train on Cantonese text described above. We do not make any changes to the model’s
tokenizer.
In addition, we offer a comparison to a concurrent BERT-based Cantonese work in bert-base-cantonese [26], also a continually pre-trained model, but on a close-source corpus of news articles, social media posts, and web content.
Direct transfer from Mandarin without conditioning on Cantonese also represents a small subset of Cantonese NLP efforts [10], [62], including evaluating non-Cantonese trained multi-lingual LLMs such as Llama models on Cantonese knowledge benchmarks [31]. We employ bert-base-chinese [1] as a model
representing direct transfer from Mandarin, and fine-tune the model on the downstream tasks without additional conditioning on Cantonese text. We do not make any changes to the model’s tokenizer.
We train a monolingual Cantonese model from scratch using the BERT architecture [1]. While previous work in Cantonese language
modeling have incorporated additional datasets [12], many of them are proprietary, may include non-Cantonese text, or may not be
freely used as pointed out by [14]. Thus, similar to the Cantonese-adapted model, we train a monolingual Cantonese model
using the publicly available Cantonese Wikipedia dump [48] and the cantonese-sentences corpus [61]. The amount of data, totaling 700M characters, is an order of magnitude smaller than what was used to train
bert-base-uncased at 3.3B words [1]. We are unable to make an exact comparison to bert-base-chinese as
training details of the model are not publicly available, although we suspect the Mandarin Wikipedia dump was used for a part of its training, which consists of around 1 million articles (cf. 137k articles for Cantonese) at the time of
bert-base-chinese training.
For the monolingual model, we train a sentencepiece byte-pair encoding tokenizer on the same data to obtain a Cantonese-only tokenizer. Unlike the tokenizer from bert-base-chinese which only includes 8k tokens of character length 1, the
Cantonese tokenizer captures subword structure by including around 32k tokens, of which 27k are multi-character tokens. Moreover, a high overlap in lexicons is maintained, with 4.8k characters being represented in both bert-base-chinese and
our Cantonese tokenizer. While we expected greater coverage of Cantonese-only characters from Hong Kong Supplementary Character Set (HKSCS), this is not the case as bert-base-chinese’s tokenizer boasts a greater coverage with 603 characters,
while our tokenizer covers only 233 characters in HKSCS. This result reflects the relative scarcity of Cantonese-specific data compared to the abundant Mandarin data available.
Each model is fine-tuned on the training split of the downstream NLU task, then evaluated on the test split of the same task with the exception of LAJ and WSD which use model surprisal without fine-tuning. We report accuracy metrics for NLI, LAJ, WSD; F1 metrics for POS, LD and SA; and UAS and LAS for DEPS. Following NLU convention [1], we do not freeze model weights and allow them to be updated during fine-tuning. We report task-specific hyperparameter choice during fine-tuning in Table 2.
| Task | LR | Batch | Epochs |
|---|---|---|---|
| NLI | 2e-5 | 16 | 3 |
| POS | 2e-5 | 32 | 3 |
| DEP | 3e-5 | 16 | 20 |
| LD | 2e-6 | 16 | 3 |
| SA | 2e-5 | 16 | 3 |
| LAJ | No fine-tuning performed | ||
| WSD | No fine-tuning performed | ||
| WSD | LAJ | LD | NLI | SA | POS | DEPS | Avg. | ||
| Model | Acc. | F1 | F1 | Acc. | F1 | F1 | UAS | LAS | |
| No Cantonese adaptation | |||||||||
| bert-base-chinese | 78.9 | 91.7 | 76.4 | 93.2 | 70.2 | 74.6 | 29.1 | 25.9 | 67.5 |
| Cantonese-adapted | |||||||||
| bert-base-cantonese | 92.7 | 89.2 | 78.7 | 93.2 | 71.9 | 72.2 | 30.2 | 26.8 | 69.4 |
| Our transfer model | 85.3 | 89.6 | 78.4 | 87.5 | 71.3 | 74.8 | 30.0 | 26.8 | 68.0 |
| Monolingual Cantonese | |||||||||
| Our monolingual model | 70.6 | 85.7 | 73.3 | 82.6 | 70.1 | 78.2 | 32.4 | 27.9 | 65.1 |
Our results across the 7 Cantonese NLU tasks are shown in Table 3, where our Cantonese monolingual model demonstrates the highest performance for POS and DEPS;
Cantonese-adapted for NLI, LD, and WSD; Mandarin model without Cantonese adaptation for NLI and LAJ. Our transfer model’s performances slightly tail that of bert-base-cantonese. We highlight three main findings. First, Cantonese monolingual
models excel in syntactic tasks, while Cantonese-adapted Mandarin models are currently the most effective approach for Cantonese semantic tasks. Second, Mandarin-only models can still perform competitively on some tasks. Finally, despite the relative
success of monolingual and transfer models, there remains substantial room for improvement in representing Cantonese lexicon, syntax, and semantics.
Supporting the empirical success of Mandarin-to-Cantonese transfer seen in contemporary Cantonese NLP, Cantonese-adapted Mandarin models offer the strongest performance with a task-averaged score of 69.4 (bert-base-cantonese) and 68.0 (our
open-source transfer model). While the monolingual model excels in syntactic tasks POS and DEPS, its average score of 65.1 lags behind those the Cantonese-adapted and Mandarin models. We attribute this primarily to the small size of our Cantonese
pretraining corpus (roughly 700M characters). As discussed in Section 5, this is an order of magnitude smaller than the datasets used to train comparable English or Mandarin models, and therefore insufficient for learning
robust linguistic representations from scratch.
Interestingly, bert-base-chinese achieves comparable, or in some cases, superior performance to its Cantonese-adapted and Cantonese-monolingual counterparts with an average score of 68.1. When scholars discuss mutual intelligibility among
Sinitic languages, they typically refer to the spoken form [16], [63]. However, written Sinitic languages are far more mutually intelligible, given the historical influence of Mandarin as the dominant language of education and literacy across Chinese-speaking regions [8], [14]. This suggests that effective written
Cantonese understanding can emerge from training on Mandarin text without explicit exposure to Cantonese text, as also observed in [12], where LLMs without explicit Cantonese support perform relatively well on Cantonese and Hong Kong-related knowledge benchmarks.
While models achieve respectable performance on NLI, POS tagging, and lexical disambiguation, dependency parsing remains notably weak—likely due to both the small fine-tuning dataset of around 1k sentences and the inherent difficulty of the task. Nonetheless, dependency parsing for other low-resource languages such as Buryat [64] and Old English [65] has reached higher performance with smaller datasets, suggesting that current Cantonese representations still lack sufficient syntactic and semantic grounding.
Taken together, these results point to the promise of continued pre-training for Cantonese NLP and to the need for richer, higher-quality Cantonese corpora to close the representational gap.
While we proposes a novel evaluation framework and recommend insights into Cantonese representation learning, several limitations remain. First, the size and coverage of available Cantonese corpora significantly constrain our results. Our monolingual Cantonese model was trained on roughly 700M characters, an order of magnitude smaller than corpora typically used to pre-train large language models in other major languages [1]. This data sparsity likely limits the model’s ability to capture the full lexical and semantic diversity of written Cantonese, and may explain the comparatively weaker performance of the monolingual model in semantic tasks. Future work would benefit from expanding and diversifying Cantonese text resources, especially in informal and user-generated domains that reflect contemporary usage.
Second, our evaluation benchmark, though designed to cover multiple aspects of Cantonese NLU, is itself constrained by data availability. Some tasks, such as dependency parsing (DEPS), rely on small fine-tuning datasets [21], making the results more susceptible to statistical noise and limiting generalization. In addition, the benchmark focuses on written Cantonese and does not address spoken or colloquial aspects of the language such as code-switching [66] or informal register in-depth; these aspects are integral to Cantonese as it is a primarily spoken language [8].
As a result, while our findings suggest that Mandarin-to-Cantonese transfer is effective for semantic tasks and Mandarin models perform sufficiently well, they should be interpreted as a reflection of current data and resource disparities, rather than representative features of Mandarin and Cantonese.
In this paper, we describe a benchmark of Cantonese NLU tasks, and evaluate a monolingual Cantonese model, two transfer models from Mandarin, and a Mandarin model on said Cantonese benchmark to investigate contexts where each type of model is most effective. Our results indicate that both Cantonese monolingual models and Cantonese-adapted models with cross-lingual transfer from Mandarin both have merit for Cantonese NLP in today’s landscape of language resources. In addition, direct transfer from a Mandarin model without Cantonese representation learning may suffice for some tasks. Our findings also suggest that the existing open-source Cantonese corpora are insufficient to train a reliable representation of Cantonese lexicon, syntax, and semantics.
Our benchmark and analyses provide the first systematic investigation into and evidence of whether and how transfer from Mandarin is effective for performing Cantonese linguistic tasks. By establishing a framework and pipeline for training monolingual or transfer models and evaluating them, we hope to catalyze broader progress in the space of Cantonese NLP.
Transfer may light the way, but data will pave the road.
While we do not collect human annotations for our work, we acknowledge that we make use of datasets that were annotated by humans or otherwise adapted from a human source.
We make use of a variety of Cantonese NLP and language resources. In addition to our discussion of them in Sections 3, 4, and 5, we acknowledge their use and organize them below.
bert-base-chinese huggingface.co/google-bert/bert-base-chinese [1]
Cantonese Wikipedia huggingface.co/datasets/wikimedia/wikipedia/viewer/20231101.zh-yue [48]
Cantonese-HK UD Corpus github.com/UniversalDependencies/UD_Cantonese-HK [21]
Cantonese Sentences huggingface.co/datasets/raptorkwok/cantonese_sentences [61]
Yue-All-NLI huggingface.co/datasets/hon9kon9ize/yue-all-nli [19]
SiniticMTError [17]
OpenRice /www.openrice.com/en/hongkong, collected by [20]
Parallel corpus from [18]
cantonese-chinese-parallel-corpus huggingface.co/datasets/HKAllen/cantonese-chinese-parallel-corpus
Hong Kong Cantonese Corpus github.com/fcbond/hkcancor [67]
CC-Canto cantonese.org
CEDict cedict.org
KaiFang CiDian kaifangcidian.com
Tatoeba tatoeba.org
Hong Kong Supplementary Character Set (HKLSCS) www.ccli.gov.hk/en/hkscs/what_is_hkscs.html
Finally, we disclose that we use ChatGPT4 and PyCharm’s AI5 as a writing and coding assistant during the project’s implementation and the paper’s writeup.