Abstract

This study presents DeBERTa-KC, a transformer-based model for automatic classification of knowledge construction (KC) levels in online science learning discourse. Using comments collected from four popular YouTube science channels (2022–2024), a balanced corpus of 20,000 manually annotated samples was created across four KC categories: nonKC, Share, Explore, and Negotiate. The proposed model extends DeBERTa-v3 with Focal Loss, Label Smoothing, and R-Drop regularization to address class imbalance and enhance generalization. A reproducible end-to-end pipeline was implemented, encompassing data extraction, annotation, preprocessing, training, and evaluation. Across 10-fold stratified cross-validation, DeBERTa-KC achieved a macro-F1 of 0.836 \(\pm\) 0.008, significantly outperforming both classical and transformer baselines (\(p<0.01\)). Per-category results indicate strong sensitivity to higher-order epistemic engagement, particularly in Explore and Negotiate discourse. These findings demonstrate that large language models can effectively capture nuanced indicators of knowledge construction in informal digital learning environments, offering scalable, theory-informed approaches to discourse analysis and the development of automated tools for assessing epistemic engagement.

1 Introduction↩︎

Identifying and supporting knowledge construction in online discussions remains a significant challenge in education [1]. As digital learning environments such as massive open online courses, discussion forums, and social media continue to expand, learners are producing vast amounts of textual interaction data [2], [3]. Understanding how knowledge is collaboratively constructed within these environments offers valuable insights into the processes of learning [4], [5].

Traditional analyses of knowledge construction have relied on manual content analysis. For instance, [6] proposed a five-dimensional framework for computer-mediated communication that includes participation, interaction, social, cognitive, and metacognitive dimensions, providing a systematic basis for interpreting learning processes in message transcripts. [4] introduced the Interaction Analysis Model (IAM), which conceptualizes the process of knowledge construction in computer mediated communication as a progressive sequence of phases, from sharing and comparing information to negotiating meaning, constructing new knowledge, and reaching mutual understanding. The IAM has since become a foundational framework for examining collaborative learning and knowledge building in online environments [7]. Subsequent studies further refined the analysis of knowledge construction. [8] employed mixed methods to investigate how groups build shared understanding, demonstrating that both detailed conversational exchanges and broader thematic patterns characterize computer supported collaborative learning. [9] proposed a multidimensional framework for argumentative knowledge construction, distinguishing participation, epistemic, argumentative, and social dimensions to capture how learners co-construct knowledge through discourse. Together, these studies advanced methodological approaches for examining knowledge construction in digital learning contexts.

Despite recent progress, manual coding remains labour intensive, time consuming, and susceptible to subjectivity [10]. Studies must establish and maintain interrater reliability, and disagreements often require negotiated coding [11]. To reduce delay and improve scalability, researchers have explored automated discourse classification for timely feedback. Deep learning and other supervised approaches have been used to identify phases of knowledge construction [12]. However, performance varies with language, context, and the availability of labelled data, and key challenges persist [13]. Recent advances in large pretrained language models have transformed natural language processing and offer improved contextual understanding for educational discourse analysis [1]. Models such as BERT, RoBERTa, and DeBERTa show strong performance on downstream tasks including stance detection, sentiment analysis, and argumentative discourse modelling [14], [15]. Their application to fine grained classification of knowledge construction, however, remains underexplored [3].

This study addresses that gap by proposing a DeBERTa-v3-large–based classifier that integrates focal loss, label smoothing, and R-Drop regularisation [16]–[18]. The approach aims to identify phases of knowledge construction in online discussions and to improve robustness under limited and imbalanced labelled data.

1.1 Knowledge Construction in Online Learning↩︎

Knowledge construction is a complex cognitive process that entails learners’ careful processing of information [19]. In online learning environments, learners engage in collaborative inquiry and meaning making through multiple forms of interaction [7]. These platforms enable self-determination in goal-setting and self-direction in managing learning processes [20], and they can also strengthen behavioural and emotional engagement [21]. Against this background, the IAM is widely regarded as a seminal framework for studying knowledge construction in computer-mediated settings [4]. This model provides a comprehensive framework of five stages that facilitate an understanding of how learners progressively construct knowledge through interaction. The five stages are as follows: 1) sharing and comparing information, 2) discovering and exploring dissonance or inconsistency among ideas, concepts or statements, 3) negotiating meaning/co-constructing knowledge, 4) testing and modifying a proposed synthesis or co-construction, and 5) agreeing with statement(s)/applying newly constructed meaning.

Notably, the original IAM was designed to examine knowledge construction that unfolds across multiple exchanges in structured discussions [7]. Subsequent research has extended this view and shows that the cognitive phases defined by the IAM can also be identified within a single message through linguistic and semantic features [1]. This is because, in online and social media contexts, comments often connect to a wider community conversation and shared cognitive context even when they are not part of an explicit reply chain [22], [23]. In such settings, each comment can be treated as a discursive contribution to collective knowledge construction, positioned within a dynamic and cumulative space of interaction rather than a single linear thread [1].

For example, using the IAM with discourse analysis, [2] showed that the informal learning context of YouTube science videos encouraged a shift from simple information exchange to argumentative negotiation, indicating higher-level knowledge construction. [3] adopted a revised IAM ato label phases of knowledge construction in TikTok comments. They found that interactions were predominantly social, while knowledge-related behaviours tended toward opinion sharing and the exploration of disagreements.

While the IAM has been applied across various platforms, analytical practices remain inconsistent. Differences in coding schemes, analytical units, and validity criteria have led to limited comparability across studies [24], [25]. Therefore, there is a need for more systematic and automated approaches to identify phases of knowledge construction with greater reliability and scalability across online learning contexts.

1.2 Automated Classification of Knowledge Construction↩︎

With the rapid advancement of natural language processing, deep learning models have been increasingly applied to the analysis and classification of knowledge construction in online learning environments [1]. Early studies have shown that neural architectures can effectively distinguish discourse segments corresponding to different phases of cognitive presence, demonstrating the feasibility of automated approaches for this task [12]. Transformer-based models such as BERT [14] and its successor DeBERTa, which introduces disentangled attention and enhanced contextual representations [15], have achieved substantial improvements in fine-grained discourse classification.

Applying BERT to the classification of cognitive presence in inquiry-based learning environments has yielded substantial agreement with human annotations, confirming the expressive advantage of Transformer models in fine-grained phase recognition [26]. A multi-label extension of BERT has further demonstrated robustness to interpretive ambiguity by capturing the fluid and overlapping nature of cognitive phases, tending to assign adjacent rather than distant categories [27]. Other studies have integrated semantic and behavioural indicators, such as attachments, tags, and glossary use, showing that textual information remains the dominant predictor, while additional behavioural traces offer only limited benefits [10]. Recent reviews have further highlighted both the potential of Transformer and large language models (LLMs) for educational discourse analysis and the need to ensure interpretability and cross-context robustness [28].

Building on the IAM, a deep learning framework grounded in its five-stage model of social knowledge construction achieved accuracies of 0.216 with Doc2Vec, 0.43 with a fine-tuned LLM, and 0.528 with a prompt-based LLM [1]. Their dataset comprised only 307 discussion posts, limiting generalisability. [3] also applied the IAM to classify knowledge construction phases in TikTok comments. Their dataset included 12,009 English-translated comments, but the distribution of labelled instances across the IAM phases was highly imbalanced. BERT achieved the highest macro and per-class F1 scores (0.66–0.77), outperforming logistic regression, multinomial Naïve Bayes, and SVM classifiers. Together, these studies highlight both the promise and the limitations of current Transformer-based approaches, which are constrained by single-language, small-scale, and imbalanced datasets.

Therefore, there is a need for studies that employ larger and more balanced multilingual annotated datasets, and that use transformer-based models with advanced training strategies to improve the robustness and accuracy of knowledge construction classification in online discussions.

1.3 Study Aim↩︎

This paper aims to propose a DeBERTa-v3-large–based classifier that integrates Focal Loss, label smoothing, and R-Drop regularization to improve the accuracy and robustness of knowledge construction phase classification in online discussions. Our contributions are threefold: (i) we introduce a transformer-based approach tailored for identifying phases of knowledge construction in online discussions; (ii) we benchmark our model against classical and transformer-based baselines on manually annotated discussion data; and (iii) we provide detailed ablation studies to assess the role of each training technique.

2 Model Structure↩︎

2.1 Overview↩︎

The proposed DeBERTa-KC architecture builds upon the DeBERTa-v3-large backbone to classify discourse segments into four phases of Knowledge Construction (KC): Non-KC, Share, Explore, and Negotiate. The framework integrates a transformer encoder with a composite optimization objective designed to enhance robustness under limited and imbalanced data conditions. Figure 1 illustrates the overall architecture.

Figure 1: Proposed DeBERTa-KC Model Architecture

2.2 Backbone Encoder↩︎

We employ the microsoft/deberta-v3-large model, a transformer architecture consisting of 24 layers, 16 self-attention heads per layer, and a hidden size of 1024. DeBERTa disentangles content and positional embeddings and applies a relative position bias during self-attention, which improves contextual representation over standard BERT and RoBERTa models [15]. The model processes each input comment as a sequence of tokens \(\{w_1, w_2, \dots, w_n\}\) tokenized using the SentencePiece tokenizer. Tokens are truncated to 256 tokens, padded to the maximum sequence length, and converted to embeddings before entering the transformer encoder.

The [CLS] token embedding \(h_{\text{CLS}}\) output by the final layer represents the entire comment and is passed to a classification head consisting of: \[\mathbf{z} = \text{Dropout}(h_{\text{CLS}}), \qquad \mathbf{p} = \text{Softmax}(\mathbf{Wz} + \mathbf{b})\] where \(\mathbf{W} \in \mathbb{R}^{4\times1024}\) and \(\mathbf{b}\in\mathbb{R}^4\) are learnable parameters. The probability vector \(\mathbf{p}\) represents the model’s predicted distribution over the four KC categories.

2.3 Training Objective↩︎

The loss function integrates three complementary components: Focal Loss, Label Smoothing, and R-Drop regularization. These elements jointly address class imbalance, overconfidence, and stochastic inconsistency during training.

2.3.0.1 Focal Loss with Label Smoothing.

We extend the standard cross-entropy loss with Focal Loss weighting [17] and label smoothing [18]. For a training instance with true label \(y\) and predicted probabilities \(\mathbf{p}\), the loss is defined as: \[\mathcal{L}_{FL} = - (1 - p_y)^\gamma \log(p_y')\] where \(\gamma=2.0\) controls the focusing strength, and \(p_y' = (1 - \epsilon)p_y + \frac{\epsilon}{K}\) applies label smoothing with \(\epsilon=0.05\) across \(K=4\) classes. This formulation down-weights easy examples and softens the target distribution to mitigate overfitting.

2.3.0.2 R-Drop Regularization.

R-Drop encourages prediction consistency between two stochastic forward passes of the same input by minimizing bidirectional Kullback–Leibler (KL) divergence [16]. Given two logits distributions \(\mathbf{p}_1\) and \(\mathbf{p}_2\), the R-Drop loss is: \[\mathcal{L}_{RD} = \frac{1}{2}\left[\text{KL}(\mathbf{p}_1 \| \mathbf{p}_2) + \text{KL}(\mathbf{p}_2 \| \mathbf{p}_1)\right]\] which penalizes divergence due to dropout or attention randomness.

2.3.0.3 Overall Objective.

The total loss is a weighted sum: \[\mathcal{L} = \mathcal{L}_{FL} + \lambda_{RD} \mathcal{L}_{RD}\] where \(\lambda_{RD}=1.0\) empirically balances the main and consistency terms. During training, gradients are backpropagated jointly through the encoder and classification layers.

2.4 Optimization and Regularization↩︎

Training employs the AdamW optimizer with a cosine learning rate scheduler, initial learning rate \(1\times10^{-5}\), warm-up ratio of 0.1, and weight decay of 0.05. Each fold runs for up to 10 epochs with early stopping (patience = 2). Batch sizes are 8 for training and 16 for evaluation, with mixed precision (FP16 or BF16) for computational efficiency. Cross-validation is stratified (10-fold) to ensure balanced representation of KC categories, and all folds are ensembled at inference by averaging softmax probabilities.

2.5 Inference and Ensemble↩︎

At inference, predictions from each fold are averaged to form an ensemble probability distribution: \[\hat{\mathbf{p}} = \frac{1}{N}\sum_{i=1}^{N}\mathbf{p}^{(i)}\] where \(N=4\) is the number of folds. The class with the highest averaged probability is selected as the final label. This ensemble strategy improves robustness and stabilizes per-class metrics across folds.

3 Methods↩︎

3.1 Overview↩︎

Figure 2 illustrates the end-to-end pipeline used in this study, covering data collection, preprocessing, model training, evaluation, and reproducibility procedures. The design follows an orchestrated workflow encompassing five phases—Collect, Capture, Ingest, Compute, and Store & Use—each representing a distinct step in the experimental lifecycle. The pipeline was implemented using open-source tools to ensure transparency and reproducibility, with orchestration handled through modular Python scripts and all model training, evaluation, and logging automated using GPU-enabled experiment tracking.

Figure 2: Data collection and model evaluation pipeline. The workflow integrates stages from YouTube data extraction and manual annotation to preprocessing, model training, and evaluation, incorporating both classical and transformer-based architectures.

3.2 Knowledge Construction Coding↩︎

To capture the specific characteristics of learner discourse in online informal learning environments, we adopted the Knowledge Construction (KC) framework proposed by Nguyen and Diederich [3], which extends the original Interaction Analysis Model (IAM). The revised framework introduces a non-knowledge construction (nonKC) category to account for comments with primarily social or affective intentions. Previous studies [2], [7] have shown that the upper-level IAM categories—Test/Modify Proposed Ideas and Apply New Knowledge—rarely appear in user-generated content and often suffer from extreme data imbalance. Therefore, following Nguyen and Diederich [3], these two categories were merged into the broader Negotiate category.

The final taxonomy thus comprised four KC categories: nonKC, Share, Explore, and Negotiate, representing an increasing level of epistemic engagement. Each category was defined operationally with examples to guide annotation (Table 1). This coding scheme captures both cognitive and dialogic aspects of online discourse while maintaining practical balance for supervised learning.

Table 1: Knowledge Construction Coding
Category	Definition	Examples
Non-knowledge construction (nonKC)	Comment to socialise (positive and negative reactions), with less focus on the video’s content. This code is added to capture sentiments.	That’s cool!
Share ideas	Ask clarifying questions, seek information or provide simple statements (personal experiences, facts or opinions). Comments are related to video content.	What about ground to cloud lightning?
Explore dissonances	State agreement or disagreement (including simple statements such as ‘I agree’). Ask questions to clarify the extent of disagreement.	Disagree. It’s not defying gravity, it’s literally happening because of gravity.
Negotiate	Clarify concepts. Propose and negotiate areas of disagreement to integrate ideas. Employ more extensive evidence and explanation than earlier phases.	It’s not alive by itself, but when it infects a host, that’s when it becomes alive. Just as your cells depend on one another, organisms rely on compatible partners to survive, and life itself is built on interdependence.
Test/modify proposed ideas	Test the proposed idea syntheses against other contexts (data, references, personal experiences).	I like to think that the jaw is an adaptation in the bone structure of the face to protect the teeth. Humans have often fought each other, with blows to the face and frequent falls causing facial injuries. Teeth are vital for eating and survival, so stronger jaws would help protect them. Hence, I believe bigger jaws led to better nutrition and survival — that’s my theory.
Apply new knowledge	Summarise agreement, make reflective statements and apply new knowledge.	We can reduce climate change through simple, low-tech farming. By changing our buying habits and using no-till methods with cover crops, we increase soil carbon, reduce flooding and fertilizer use, and improve yields. Solving for soil carbon also cools the air and lessens energy demand. Please spread the word about carbon farming.

3.3 Dataset Construction↩︎

3.3.0.1 Data Collection.

We collected viewer comments from four high-engagement science communication YouTube channels—MuseumScience, SciShow, SickScience, and WorldScience—using the YouTube Data API v3 between 2022 and 2024. These channels were selected based on three criteria: (1) high posting frequency, (2) large subscriber base, and (3) diversity of scientific topics. A total of 2,198 videos were retrieved, including 1,223 short-form videos and 975 long-form videos. From these, 609,763 top-level comments were extracted, comprising 299,637 from short videos and 310,126 from standard videos.

3.3.0.2 Annotation.

The annotation process followed a stratified and iterative sampling strategy to ensure category balance and reduce coder drift. In nine independent sampling rounds, Coder A manually labeled 5,000 unique comments per round using the KC framework, oversampling underrepresented categories. Coder B subsequently reviewed and refined all labels, and only comments with inter-coder agreement were retained. The final balanced dataset contained 5,000 samples per category (20,000 total), with equal representation from short and long videos. Inter-coder reliability reached a Cohen’s \(\kappa\) of 0.87, indicating strong agreement.

3.3.0.3 Preprocessing.

Comments were tokenized, lowercased, and normalized by removing hyperlinks, emojis, and user mentions while preserving punctuation relevant to discourse structure. Categories were then normalized across the combined corpus, and stratified 10-fold cross-validation splits were generated using scikit-learn’s StratifiedKFold to maintain label distribution across folds.

3.4 Model Baselines↩︎

To benchmark performance, we compared classical feature-based classifiers with transformer-based encoders. All experiments were conducted under identical cross-validation folds.

3.4.0.1 Classical Baselines.

Two baseline models were implemented using scikit-learn:

TF–IDF + Logistic Regression, using character-level \(n\)-grams (range 3–5) with balanced class weighting.
TF–IDF + Linear SVM, using the same representation with a linear kernel (LinearSVC).

3.4.0.2 Transformer Baselines.

We fine-tuned DistilBERT (base), RoBERTa-base, and DeBERTa-v3-base models using the Hugging Face Transformers library. Each model was trained under identical optimization settings to ensure fair comparison. These baselines served as reference points for evaluating our proposed DeBERTa-KC architecture.

3.5 Proposed Model: DeBERTa-KC↩︎

The proposed DeBERTa-KC extends DeBERTa-v3-large by incorporating three regularization strategies: (1) Focal Loss to address class imbalance by emphasizing harder-to-classify samples, (2) Label Smoothing (0.1) to improve calibration and reduce overconfidence, and (3) R-Drop to minimize prediction variance across stochastic forward passes. This configuration enhances robustness and generalization across folds. The model was trained and evaluated on NVIDIA RTX A6000 GPUs, using PyTorch mixed-precision (FP16) training.

3.6 Training and Optimization↩︎

Transformer models were fine-tuned using the Trainer API with the following hyperparameters: maximum sequence length of 256, AdamW optimizer with cosine learning rate decay, initial learning rate of \(1\times10^{-5}\), weight decay of 0.05, warm-up ratio of 0.1, and early stopping with a patience of 2 epochs. The training batch size was 8 and the evaluation batch size 16. For reproducibility, all random seeds were fixed across runs and training logs were captured through automated experiment tracking.

3.7 Evaluation Protocol↩︎

Model performance was assessed using 10-fold stratified cross-validation. For each fold, models were trained on nine folds and evaluated on the remaining one. Reported metrics include accuracy, macro-F1, and weighted-F1, alongside per-class precision, recall, and F1-score. Each metric is expressed as mean \(\pm\) standard deviation (SD) across folds.

3.8 Statistical and Reproducibility Analysis↩︎

To assess significance, paired \(t\)-tests and Wilcoxon signed-rank tests were applied to fold-level macro-F1 distributions between DeBERTa-KC and baseline models, using \(p < 0.05\) as the threshold. Effect sizes were computed via Cohen’s \(d\). Variance homogeneity was verified through Levene’s test, and confidence intervals (95%) were estimated using bootstrap resampling.

All experiments, configurations, and outputs (e.g., per-fold metrics, aggregated summaries, and model comparison dashboards) were automatically logged and versioned. This ensures reproducibility and enables transparent comparison between classical and transformer-based approaches. The results and model artifacts are stored in structured formats (.csv, .json) as shown in Figure 2.

4 Results and Analysis↩︎

This section reports the experimental results of the proposed DeBERTa-KC model and baseline systems for automatic classification of knowledge construction (KC) phases in online learning discourse. All results are averaged over ten stratified cross-validation folds, ensuring that each fold maintains the same label distribution. The analyses presented below address the overall model performance, per-category classification behaviour, ablation experiments, and the outcomes of significance and agreement testing.

4.1 Overall Model Comparison↩︎

Table 2 presents the overall accuracy and F1-scores for both classical and transformer-based models. The classical baselines using TF–IDF representations achieved reasonable performance (macro-F1 \(\approx 0.72\)), confirming that character-level \(n\)-grams can capture surface-level lexical regularities present in learning discourse. However, these representations lacked contextual awareness, leading to moderate performance variability across folds.

Table 2: Overall performance (10-fold CV). Values are mean \(\pm\) SD. Best per column in **bold**.
Model	Accuracy	Macro-F1	Weighted-F1
TF-IDF + Logistic Regression	.729 \(\pm\) .012	.728 \(\pm\) .012	.728 \(\pm\) .012
TF-IDF + LinearSVC	.711 \(\pm\) .009	.709 \(\pm\) .009	.709 \(\pm\) .009
DistilBERT (base)	.807 \(\pm\) .010	.807 \(\pm\) .010	.807 \(\pm\) .010
RoBERTa-base	.821 \(\pm\) .010	.821 \(\pm\) .009	.821 \(\pm\) .009
DeBERTa-v3-base	.834 \(\pm\) .007	.834 \(\pm\) .007	.834 \(\pm\) .007
DeBERTa-v3-large (base backbone)	.833 \(\pm\) .010	.834 \(\pm\) .010	.834 \(\pm\) .010
DeBERTa-v3-large (ours: +Focal +LS +R-Drop)	.836 \(\pm\) .008	.836 \(\pm\) .008	.836 \(\pm\) .008
– no R-Drop	.833 \(\pm\) .011	.834 \(\pm\) .010	.834 \(\pm\) .010
– no LS	.840 \(\pm\) .009	.841 \(\pm\) .009	.841 \(\pm\) .009
– no Focal	.836 \(\pm\) .008	.836 \(\pm\) .008	.836 \(\pm\) .008

Transformer models, by contrast, demonstrated markedly superior results. DistilBERT (base) achieved a macro-F1 of \(0.807 \pm 0.010\), while RoBERTa-base improved further to \(0.821 \pm 0.009\). The stronger DeBERTa-v3-base model reached \(0.834 \pm 0.007\), validating the benefit of disentangled attention mechanisms and enhanced masked decoding in capturing nuanced relationships among tokens. Our proposed DeBERTa-KC configuration (DeBERTa-v3-large + Focal Loss + Label Smoothing + R-Drop) achieved the best overall result with \(0.836 \pm 0.008\) macro-F1, outperforming all smaller backbones. When Label Smoothing was removed, the mean macro-F1 slightly increased to \(0.841 \pm 0.009\), though with a modest rise in variance, indicating that Label Smoothing enhances cross-fold consistency at a marginal cost to the peak performance.

Figure 3 visualizes the distribution of fold-level macro-F1 values. The figure demonstrates a clear upward trend in mean performance from classical to transformer-based approaches, accompanied by a reduction in variance. Among all configurations, DeBERTa-KC exhibits both the highest mean performance and the lowest dispersion, underscoring its stable generalization across folds.

4.2 Per-Category Performance↩︎

Tables 3–6 provide detailed per-category precision, recall, and F1-scores. Across all four knowledge construction labels—Negotiate, Share, Explore, and NonKC—transformer-based models consistently outperformed the classical baselines.

Table 3: Performance comparison across models for the **Negotiate** category.
Model	Precision	Recall	F1
TF-IDF + Logistic Regression	.833 \(\pm\) .015	.855 \(\pm\) .013	.844 \(\pm\) .011
TF-IDF + LinearSVC	.808 \(\pm\) .011	.854 \(\pm\) .015	.830 \(\pm\) .009
DistilBERT (base)	.885 \(\pm\) .028	.890 \(\pm\) .028	.887 \(\pm\) .013
RoBERTa-base	.890 \(\pm\) .027	.902 \(\pm\) .036	.895 \(\pm\) .014
DeBERTa-v3-base	.906 \(\pm\) .031	.899 \(\pm\) .030	.902 \(\pm\) .013
DeBERTa-v3-large (replaced with base backbone)	.909 \(\pm\) .025	.899 \(\pm\) .034	.903 \(\pm\) .013
DeBERTa-v3-large (ours: +Focal +LS +R-Drop)	.907 \(\pm\) .025	.888 \(\pm\) .040	.897 \(\pm\) .012
– no R-Drop	.916 \(\pm\) .024	.885 \(\pm\) .035	.899 \(\pm\) .013
– no LS	.920 \(\pm\) .023	.883 \(\pm\) .026	.900 \(\pm\) .006
– no Focal	.907 \(\pm\) .033	.888 \(\pm\) .040	.897 \(\pm\) .012

For the Negotiate category, all transformer architectures achieved strong and stable performance, with F1-scores close to 0.90. This indicates a robust capacity to detect discourse segments in which participants collaboratively refine and integrate ideas. DeBERTa-KC attained an F1 of \(0.897 \pm 0.012\), closely matching the highest-scoring no-Label-Smoothing variant (\(0.900 \pm 0.006\)). Such consistency reflects the model’s ability to capture deep semantic cues that distinguish genuine negotiation from more superficial forms of interaction.

Table 4: Performance comparison across models for the **Share** category.
Model	Precision	Recall	F1
TF-IDF + Logistic Regression	.652 \(\pm\) .021	.646 \(\pm\) .023	.648 \(\pm\) .017
TF-IDF + LinearSVC	.634 \(\pm\) .016	.614 \(\pm\) .027	.623 \(\pm\) .018
DistilBERT (base)	.752 \(\pm\) .022	.782 \(\pm\) .032	.766 \(\pm\) .017
RoBERTa-base	.768 \(\pm\) .025	.785 \(\pm\) .027	.775 \(\pm\) .011
DeBERTa-v3-base	.772 \(\pm\) .015	.817 \(\pm\) .024	.794 \(\pm\) .009
DeBERTa-v3-large (replaced with base backbone)	.781 \(\pm\) .022	.800 \(\pm\) .021	.790 \(\pm\) .013
DeBERTa-v3-large (ours: +Focal +LS +R-Drop)	.775 \(\pm\) .020	.812 \(\pm\) .020	.793 \(\pm\) .009
– no R-Drop	.777 \(\pm\) .027	.804 \(\pm\) .037	.789 \(\pm\) .013
– no LS	.793 \(\pm\) .032	.803 \(\pm\) .049	.797 \(\pm\) .016
– no Focal	.775 \(\pm\) .020	.812 \(\pm\) .020	.793 \(\pm\) .009

The Share category exhibited a substantial gain in performance from classical to transformer models, increasing from \(0.648 \pm 0.017\) (TF–IDF + Logistic Regression) to \(0.793 \pm 0.009\) with DeBERTa-KC. This improvement highlights the model’s enhanced ability to recognise comments that focus on exchanging information or personal experiences relevant to the learning context. The no-Label-Smoothing variant reached the highest performance (\(0.797 \pm 0.016\)), suggesting that sharper decision boundaries sometimes enhance the recognition of sharing-oriented statements.

Table 5: Performance comparison across models for the **Explore** category.
Model	Precision	Recall	F1
TF-IDF + Logistic Regression	.693 \(\pm\) .022	.679 \(\pm\) .029	.686 \(\pm\) .023
TF-IDF + LinearSVC	.679 \(\pm\) .021	.650 \(\pm\) .022	.664 \(\pm\) .018
DistilBERT (base)	.771 \(\pm\) .034	.776 \(\pm\) .041	.772 \(\pm\) .016
RoBERTa-base	.778 \(\pm\) .021	.816 \(\pm\) .028	.796 \(\pm\) .016
DeBERTa-v3-base	.789 \(\pm\) .024	.836 \(\pm\) .029	.812 \(\pm\) .016
DeBERTa-v3-large (replaced with base backbone)	.781 \(\pm\) .033	.845 \(\pm\) .028	.811 \(\pm\) .016
DeBERTa-v3-large (ours: +Focal +LS +R-Drop)	.802 \(\pm\) .026	.834 \(\pm\) .036	.817 \(\pm\) .015
– no R-Drop	.785 \(\pm\) .034	.841 \(\pm\) .023	.812 \(\pm\) .014
– no LS	.794 \(\pm\) .025	.850 \(\pm\) .030	.820 \(\pm\) .015
– no Focal	.802 \(\pm\) .026	.834 \(\pm\) .036	.817 \(\pm\) .015

The Explore category, which encompasses discourse reflecting analytical reasoning and hypothesis generation, benefited the most from transformer-based contextual representations. The F1-score rose from \(0.686 \pm 0.023\) with TF–IDF + Logistic Regression to \(0.817 \pm 0.015\) with DeBERTa-KC. The variant without Label Smoothing achieved the highest recall (\(0.850 \pm 0.030\)), implying that clearer boundary learning may aid in identifying exploratory discourse patterns, which often exhibit subtle lexical and syntactic variation.

Table 6: Performance comparison across models for the **NonKC** category.
Model	Precision	Recall	F1
TF-IDF + Logistic Regression	.734 \(\pm\) .020	.736 \(\pm\) .026	.735 \(\pm\) .019
TF-IDF + LinearSVC	.714 \(\pm\) .021	.725 \(\pm\) .023	.719 \(\pm\) .018
DistilBERT (base)	.831 \(\pm\) .015	.782 \(\pm\) .029	.805 \(\pm\) .013
RoBERTa-base	.858 \(\pm\) .024	.782 \(\pm\) .044	.817 \(\pm\) .018
DeBERTa-v3-base	.883 \(\pm\) .024	.782 \(\pm\) .029	.829 \(\pm\) .011
DeBERTa-v3-large (replaced with base backbone)	.878 \(\pm\) .022	.790 \(\pm\) .025	.831 \(\pm\) .012
DeBERTa-v3-large (ours: +Focal +LS +R-Drop)	.873 \(\pm\) .033	.808 \(\pm\) .036	.838 \(\pm\) .012
– no R-Drop	.873 \(\pm\) .022	.802 \(\pm\) .031	.836 \(\pm\) .013
– no LS	.869 \(\pm\) .033	.826 \(\pm\) .035	.846 \(\pm\) .009
– no Focal	.873 \(\pm\) .033	.808 \(\pm\) .036	.838 \(\pm\) .012

Finally, the NonKC category—representing social, affective, or organisational comments that do not contribute directly to knowledge construction—was also handled effectively by transformer-based architectures. DeBERTa-KC obtained \(0.838 \pm 0.012\) F1, outperforming all baselines. Its no-Label-Smoothing variant achieved the highest recall (\(0.826 \pm 0.035\)), confirming that the model accurately distinguishes socially oriented statements from knowledge-building exchanges.

Overall, DeBERTa-KC improved both precision and recall across all four categories, particularly for the semantically complex Explore and Negotiate labels. The model’s architecture and loss design thus appear to enhance contextual discrimination and mitigate bias toward dominant classes.

4.3 Ablation Study↩︎

An ablation analysis was conducted to examine the individual contribution of each component in the DeBERTa-KC configuration. Removing any of the auxiliary modules resulted in either a decrease in macro-F1 or an increase in cross-fold variance. Focal Loss improved the recall of minority categories by dynamically scaling the gradient contribution of hard-to-classify samples, thereby mitigating class imbalance. Label Smoothing enhanced training stability by preventing the model from becoming overconfident in its predictions, which is particularly valuable in small, imbalanced datasets. The R-Drop regularisation further reduced overfitting by enforcing consistency between stochastic forward passes. Together, these mechanisms produced a model that balances accuracy, robustness, and interpretability.

4.4 Statistical Significance and Effect Size Analysis↩︎

To assess the statistical significance of performance differences across models, non-parametric Friedman and pairwise Wilcoxon signed-rank tests were performed on the fold-level macro-F1 distributions (Table 7). The Friedman test indicated a significant overall difference among models (\(\chi^2 = 39.82\), \(p < 0.001\)). Subsequent pairwise Wilcoxon tests, with Holm correction for multiple comparisons, confirmed that all transformer models significantly outperformed the TF–IDF baselines (\(p < 0.01\)). DeBERTa-KC also showed significant improvements over DistilBERT and RoBERTa-base (\(p = 0.002\) in both cases). The performance difference between DeBERTa-KC and its base backbone, DeBERTa-v3, was modest but statistically detectable (\(p = 0.0074\), Cohen’s \(d = 0.89\)), indicating a small yet consistent performance gain.

Table 7: Pairwise statistical comparison of models (simulated 10-fold CV). Significant results \((p < 0.05)\) are in **bold**.
Model A	Model B	\(\Delta\) (A–B)	\(p_{Wilcoxon}\)	Cohen’s \(d\)
TF-IDF + LR	TF-IDF + LinearSVC	0.0315	0.0020	2.715
DistilBERT	RoBERTa-base	\(-0.0134\)	0.0195	\(-0.953\)
RoBERTa-base	DeBERTa-v3-base	\(-0.0140\)	0.0273	\(-0.990\)
DeBERTa-v3-base	DeBERTa-KC (ours)	\(-0.0054\)	0.3750	\(-0.509\)
DeBERTa-KC (ours)	Ablation (no LS)	\(-0.0030\)	0.3750	\(-0.195\)
DeBERTa-KC (ours)	Ablation (no Focal)	0.0008	0.8457	0.072
DeBERTa-KC (ours)	Ablation (no R-Drop)	0.0037	0.3223	0.404

Cohen’s \(d\) effect size analysis further revealed large effects (\(d > 0.8\)) between the classical and transformer models, moderate effects (\(0.5 < d < 0.8\)) between successive transformer generations, and small effects (\(|d| < 0.2\)) among DeBERTa-KC ablations. These findings suggest that the proposed configuration contributes measurable yet controlled improvements over already strong baselines. Levene’s variance test (\(W = 3.02\), \(p = 0.027\)) indicated slightly heterogeneous variances, primarily due to the higher variability of the classical models, whereas transformer-based configurations exhibited stable dispersion. Bootstrapped confidence intervals confirmed the narrowest 95% bounds for DeBERTa-KC ([0.836, 0.839]), demonstrating strong generalization stability compared with broader intervals observed for TF–IDF baselines ([0.725, 0.735]).

4.5 Agreement and Robustness Analysis↩︎

Figure 3: Distribution of Cross-Validation Macro-F1 Scores

Figure 4: Bland–Altman Agreement Between DeBERTa-v3 and DeBERTa-KC

To further evaluate the consistency between the base and proposed transformer variants, a Bland–Altman agreement analysis was conducted using fold-level macro-F1 values from DeBERTa-v3 and DeBERTa-KC (Figure 4). Each point represents a cross-validation fold, plotted as the mean of both models’ scores on the x-axis and their difference on the y-axis. The analysis revealed a mean bias of \(-0.0031\), indicating that DeBERTa-KC achieved slightly higher performance on average. The 95% limits of agreement (±0.006) demonstrate that nearly all fold-level differences lie within a narrow and symmetric range around zero, confirming high agreement and consistent improvement across folds. No systematic bias was observed across performance ranges, suggesting that DeBERTa-KC’s enhancements are general rather than fold-specific. Combined with the distributional patterns in Figure 3, these results confirm that the proposed model exhibits both higher mean performance and greater reliability than its base counterpart.

4.6 Summary of Findings↩︎

In summary, the DeBERTa-KC model achieved superior overall performance relative to all baselines, with the highest mean macro-F1 and the lowest cross-validation variance. It demonstrated balanced classification across all four KC categories, particularly improving the recognition of higher-order discourse such as negotiation and exploration. The integration of Focal Loss, Label Smoothing, and R-Drop enhanced both stability and robustness, while statistical testing confirmed the significance and consistency of these improvements. The agreement analysis further validated that DeBERTa-KC provides consistent and reproducible gains over the base transformer, thereby establishing a reliable framework for automated knowledge construction analysis in large-scale online learning environments.

5 Discussion↩︎

The findings demonstrate that the proposed DeBERTa-KC model substantially advances the automated identification of knowledge construction (KC) levels in online learning discourse. The consistent performance improvements across folds and categories indicate that the architecture effectively captures both the lexical and contextual features underlying epistemic engagement in user-generated comments. This section discusses the results in relation to prior studies, the pedagogical implications of the approach, and its methodological and computational contributions.

5.1 Performance and Model Behaviour↩︎

The results clearly indicate that transformer-based architectures outperform traditional feature-based models by a considerable margin. Classical baselines such as TF–IDF with Logistic Regression or Linear SVM achieve moderate performance (macro-F1 \(\approx\) 0.72), confirming that surface-level lexical patterns provide a useful but limited signal of epistemic activity. In contrast, transformer encoders capture deeper discourse semantics, yielding macro-F1 scores above 0.80 across all transformer variants. Among them, DeBERTa-KC achieves the highest and most stable performance (\(0.836 \pm 0.008\) macro-F1), surpassing both DistilBERT and RoBERTa-base by statistically significant margins. These gains highlight the effectiveness of disentangled attention mechanisms and the regularization strategies (Focal Loss, Label Smoothing, and R-Drop) in enhancing generalization.

The per-category analysis further supports this observation. The model performs most strongly on the Negotiate and Explore categories, where discourse involves elaboration, reasoning, and conceptual integration—features that benefit from deep contextual embeddings. These categories are cognitively demanding, often involving subtle rhetorical markers and reasoning structures, and thus challenge simpler models that rely primarily on local lexical cues. The DeBERTa-KC model’s capacity to model long-range dependencies and contextual coherence enables it to discern these dialogic nuances more reliably. Although the Share and nonKC categories exhibit slightly lower recall, this aligns with their lexical overlap and shorter sentence structure, which reduce discriminative information. Nevertheless, the model maintains robust F1-scores across all categories, suggesting balanced classification performance.

5.2 Interpretation of Statistical Analyses↩︎

The statistical and robustness analyses provide further insight into the reliability of these results. The Friedman and Wilcoxon tests confirmed that the observed improvements of DeBERTa-KC over baseline models are statistically significant (\(p < 0.01\)). The computed Cohen’s \(d\) values indicate large effect sizes when comparing classical to transformer-based approaches (\(d > 2.7\)) and medium effects among transformer generations (\(1.0 < d < 1.6\)). The lack of significant difference between DeBERTa-KC and its ablations (\(p > 0.3\)) demonstrates that each regularization component contributes incrementally rather than producing overfitted gains.

Moreover, the Bland–Altman analysis between DeBERTa-v3 and DeBERTa-KC shows a minimal mean bias (\(-0.0031\)) and narrow limits of agreement, indicating that performance improvements are both consistent and replicable across folds. This pattern reinforces the model’s stability rather than stochastic variation. The 95% confidence interval of the macro-F1 ([0.829, 0.843]) further supports the robustness of DeBERTa-KC’s predictions. Together, these findings demonstrate that the proposed enhancements yield genuine and reproducible gains without compromising model variance.

5.3 Implications for Knowledge Construction Analysis↩︎

From an educational perspective, these findings have important implications for understanding and supporting knowledge construction in informal digital learning environments. Unlike structured online courses, YouTube discussions are open, decentralized, and conversationally dynamic. Automatically identifying KC levels within such contexts enables large-scale analysis of how learners engage epistemically with scientific content. The ability to reliably distinguish between nonKC (social talk) and higher-level categories such as Negotiate provides a foundation for evaluating the depth and quality of learner discourse at scale.

Moreover, the model’s interpretability through attention visualization (not shown here) suggests that contextual phrases and argumentative connectors (e.g., “because”, “I think”, “for example”) play a central role in predicting higher KC levels. This aligns with prior qualitative findings [2], [7] that link epistemic markers to conceptual elaboration and co-construction of meaning. Therefore, integrating models such as DeBERTa-KC into learning analytics systems could offer educators empirical indicators of cognitive engagement, helping them scaffold discourse quality and guide moderation strategies in open learning spaces.

5.4 Methodological and Computational Contributions↩︎

Beyond performance metrics, this study contributes a reproducible computational framework for discourse analysis in open social learning platforms. The orchestrated workflow (Figure 2) ensures end-to-end transparency, covering data acquisition, annotation, preprocessing, training, and statistical validation. By combining GPU-based transformer fine-tuning with automated experiment tracking, the approach aligns with emerging standards in educational data science for reproducibility and open experimentation. All results, including cross-validation metrics, model checkpoints, and statistical outputs, are stored in structured formats to facilitate replication and extension.

Furthermore, the use of balanced sampling and cross-validation provides a more reliable estimate of generalization than previous single-split studies. The inclusion of Focal Loss and R-Drop in educational text classification introduces methodological innovation by addressing class imbalance and overfitting—issues commonly overlooked in learning analytics applications. The ablation experiments confirm that these mechanisms collectively stabilize the learning process without inflating scores artificially.

5.5 Limitations and Future Work↩︎

Despite its contributions, several limitations should be acknowledged. First, the dataset focuses on English-language science discourse, which limits generalizability across linguistic and cultural contexts. Future research should evaluate cross-lingual transferability using multilingual transformer backbones (e.g., XLM-R, mDeBERTa). Second, while the annotation process achieved high inter-coder reliability, manual labeling remains resource-intensive. Semi-supervised or active learning strategies could be explored to expand labeled data more efficiently. Third, although the model identifies KC levels effectively, it does not explicitly capture the interactional dynamics or temporal progression of discourse. Integrating sequential or graph-based modeling could help trace how knowledge construction evolves across comment threads.

Finally, interpretability remains a key challenge for educational deployment. Future work should investigate how attention or gradient-based attribution can provide transparent explanations suitable for educators, aligning automated predictions with theoretical constructs from educational discourse analysis.

6 Conclusion↩︎

This study proposed and evaluated DeBERTa-KC, a transformer-based model designed to automatically classify knowledge construction (KC) levels in large-scale online science learning discourse. By integrating Focal Loss, Label Smoothing, and R-Drop regularization into the DeBERTa-v3 architecture, the model achieved substantial and statistically significant improvements over classical and baseline transformer approaches. Across 10-fold cross-validation, DeBERTa-KC attained stable performance (macro-F1 \(= 0.836 \pm 0.008\)), demonstrating robust generalization across diverse types of learner comments and discourse categories.

Beyond quantitative performance, the study contributes a reproducible computational framework that bridges natural language processing with educational discourse analysis. The orchestrated workflow—from YouTube comment collection and manual annotation to model training, evaluation, and reproducibility tracking—provides an end-to-end methodological template for future research. The model’s ability to distinguish between social (nonKC) and epistemically engaged (Share, Explore, and Negotiate) comments highlights its capacity to capture fine-grained indicators of cognitive engagement in informal learning environments.

From an educational perspective, this work demonstrates the feasibility of using large language models to examine how learners construct, share, and negotiate scientific ideas in open digital ecosystems. Automating KC classification enables scalable monitoring of discourse quality, which may inform learning analytics dashboards, teacher interventions, or automated feedback systems in future implementations. The insights derived from such models can support evidence-based strategies for promoting epistemic dialogue and metacognitive reflection in informal and hybrid learning settings.

Several avenues remain for further exploration. Future research should extend this framework to multilingual and multimodal contexts, incorporating cross-lingual representations or audio–text alignment from video transcripts. Integrating sequential or interaction-aware modeling could also uncover temporal dynamics of knowledge construction across threads. Finally, advancing model interpretability through attention visualization or explainable AI techniques will be essential for ensuring transparency and pedagogical trustworthiness when applying such systems in educational practice.

In conclusion, the proposed DeBERTa-KC model offers a theoretically grounded, empirically validated, and computationally reproducible approach for analyzing knowledge construction in large-scale online learning discourse. By combining deep language modeling with educational theory, this work provides a pathway toward scalable, interpretable, and ethically informed AI applications in the study of learning and knowledge-building processes.

References↩︎

[1]

C. N. Gunawardena, Y. Chen, N. Flor, and D. Sánchez, “Deep learning models for analyzing social construction of knowledge online,” Online Learning, vol. 27, 2023, doi: 10.24059/olj.v27i4.4055.

[2]

I. Dubovi and I. Tabak, “An empirical analysis of knowledge co-construction in YouTube comments,” Computers and Education, vol. 156, Oct. 2020, doi: 10.1016/j.compedu.2020.103939.

[3]

H. Nguyen and M. Diederich, “Facilitating knowledge construction in informal learning: A study of TikTok scientific, educational videos,” Computers and Education, vol. 205, Nov. 2023, doi: 10.1016/j.compedu.2023.104896.

[4]

C. N. Gunawardena, C. A. Lowe, and T. Anderson, “Analysis of a global online debate and the development of an interaction analysis model for examining social construction of knowledge in computer conferencing,” Journal of Educational Computing Research, vol. 17, 1997, doi: 10.2190/7MQV-X9UJ-C7Q3-NRAG.

[5]

N. Hara, C. J. Bonk, and C. Angeli, “Content analysis of online discussion in an applied educational psychology course,” Instructional Science, vol. 28, 2000, doi: 10.1023/A:1003764722829.

[6]

F. Henri, “Computer conferencing and content analysis,” in Collaborative learning through computer conferencing, 1992, pp. 117–136, doi: 10.1007/978-3-642-77684-7_8.

[7]

M. Lucas, C. Gunawardena, and A. Moreira, “Assessing social construction of knowledge online: A critique of the interaction analysis model,” Computers in Human Behavior, vol. 30, 2014, doi: 10.1016/j.chb.2013.07.050.

[8]

C. E. Hmelo-Silver, “Analyzing collaborative knowledge construction: Multiple methods for integrated understanding,” in Computers and education, 2003, vol. 41, doi: 10.1016/j.compedu.2003.07.001.

[9]

A. Weinberger and F. Fischer, “A framework to analyze argumentative knowledge construction in computer-supported collaborative learning,” Computers and Education, vol. 46, 2006, doi: 10.1016/j.compedu.2005.04.003.

[10]

V. Dornauer, M. Netzer, É. Kaczkó, L. M. Norz, and E. Ammenwerth, “Automatic classification of online discussions and other learning traces to detect cognitive presence,” International Journal of Artificial Intelligence in Education, vol. 34, 2024, doi: 10.1007/s40593-023-00335-4.

[11]

K. A. Hallgren, “Computing inter-rater reliability for observational data: An overview and tutorial,” Tutorials in Quantitative Methods for Psychology, vol. 8, 2012, doi: 10.20982/tqmp.08.1.p023.

[12]

Y. Hu, R. F. Mello, and D. Gašević, “Automatic analysis of cognitive presence in online discussions: An approach using deep learning and explainable artificial intelligence,” Computers and Education: Artificial Intelligence, vol. 2, 2021, doi: 10.1016/j.caeai.2021.100037.

[13]

Y. Hu, C. Donald, and N. Giacaman, “A revised application of cognitive presence automatic classifiers for MOOCs: A new set of indicators revealed?” International Journal of Educational Technology in Higher Education, vol. 19, 2022, doi: 10.1186/s41239-022-00353-7.

[14]

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of NAACL-HLT, 2019, pp. 4171–4186.

[15]

P. He, X. Liu, J. Gao, and W. Chen, “DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION,” in ICLR 2021 - 9th international conference on learning representations, 2021.

[16]

X. Liang et al., “R-drop: Regularized dropout for neural networks,” in Advances in neural information processing systems, 2021, vol. 13.

[17]

T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, 2020, doi: 10.1109/TPAMI.2018.2858826.

[18]

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE computer society conference on computer vision and pattern recognition, 2016, doi: 10.1109/CVPR.2016.308.

[19]

M. Scardamalia and C. Bereiter, “Knowledge building: Theory, pedagogy, and technology,” in Cambridge handbook of the learning sciences, R. K. Sawyer, Ed. 2006, pp. 397–417.

[20]

C. Greenhow and C. Lewin, “Social media and education: Reconceptualizing the boundaries of formal and informal learning,” Learning, Media and Technology, vol. 41, 2016, doi: 10.1080/17439884.2015.1064954.

[21]

G. Tazhenova, N. Mikhaylova, and B. Turgunbayeva, “Digital media in informal learning activities,” Education and Information Technologies, Nov. 2024, doi: 10.1007/s10639-024-12687-y.

[22]

J. Kimmerle, J. Moskaliuk, A. Oeberst, and U. Cress, “Learning and collective knowledge construction with social media: A process-oriented perspective,” Educational Psychologist, vol. 50, 2015, doi: 10.1080/00461520.2015.1036273.

[23]

D. Ye and S. Pennisi, “Analysing interactions in online discussions through social network analysis,” Journal of Computer Assisted Learning, vol. 38, 2022, doi: 10.1111/jcal.12648.

[24]

B. D. Wever, T. Schellens, M. Valcke, and H. V. Keer, “Content analysis schemes to analyze transcripts of online asynchronous discussion groups: A review,” Computers and Education, vol. 46, 2006, doi: 10.1016/j.compedu.2005.04.005.

[25]

D. R. Garrison, T. Anderson, and W. Archer, “Critical thinking, cognitive presence, and computer conferencing in distance education,” International Journal of Phytoremediation, vol. 21, 2001, doi: 10.1080/08923640109527071.

[26]

S. Ba, X. Hu, D. Stein, and Q. Liu, “Assessing cognitive presence in online inquiry-based discussion through text classification and epistemic network analysis,” British Journal of Educational Technology, vol. 54, 2023, doi: 10.1111/bjet.13285.

[27]

Y. Hu, C. Donald, and N. Giacaman, “Can multi-label classifiers help identify subjectivity? A deep learning approach to classifying cognitive presence in MOOCs,” International Journal of Artificial Intelligence in Education, vol. 33, 2023, doi: 10.1007/s40593-022-00310-5.

[28]

D. Castellanos-Reyes, L. Olesova, and A. Sadaf, “Transforming online learning research: Leveraging GPT large language models for automated content analysis of cognitive presence,” Internet and Higher Education, vol. 65, Apr. 2025, doi: 10.1016/j.iheduc.2025.101001.

Citation: Authors. Title. Pages.… DOI:000000/11111.↩︎

DeBERTa-KC: A Transformer-Based Classifier for Knowledge Construction in Online Learning Discourse 1