November 18, 2023
In recent years, graph contrastive learning (GCL) has emerged as one of the optimal solutions for various supervised tasks at the node level. However, for unsupervised and structure-related tasks such as community detection, current GCL algorithms face difficulties in acquiring the necessary community-level information, resulting in poor performance. In addition, general contrastive learning algorithms improve the performance of downstream tasks by increasing the number of negative samples, which leads to severe class collision and unfairness of community detection. To address above issues, we propose a novel Community-aware Efficient Graph Contrastive Learning Framework (CEGCL) to jointly learn community partition and node representations in an end-to-end manner. Specifically, we first design a personalized self-training (PeST) strategy for unsupervised scenarios, which enables our model to capture precise community-level personalized information in a graph. With the benefit of the PeST, we alleviate class collision and unfairness without sacrificing the overall model performance. Furthermore, the aligned graph clustering (AlGC) is employed to obtain the community partition. In this module, we align the clustering space of our downstream task with that in PeST to achieve more consistent node embeddings. Finally, we demonstrate the effectiveness of our model for community detection both theoretically and experimentally. Extensive experimental results also show that our CEGCL exhibits state-of-the-art performance on three benchmark datasets with different scales.
<ccs2012> <concept> <concept_id>10002951.10003227.10003351.10003444</concept_id> <concept_desc>Information systems Clustering</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10010147.10010257.10010293.10010294</concept_id> <concept_desc>Computing methodologies Neural networks</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10002950.10003624.10003633.10010917</concept_id> <concept_desc>Mathematics of computing Graph algorithms</concept_desc> <concept_significance>300</concept_significance> </concept> </ccs2012>
With the development of the information age, data with complicated relationships have emerged in large quantities, such as social networks [1], [2]. Meanwhile, detecting community structures on user-generated content websites like Twitter, Facebook and Weibo has become a research hotspot. Community detection [3]–[5], also known as graph clustering, is an essential task in complex network analysis. Specifically, it is devoted to dividing the nodes of network into clusters without the help of any labels, and each cluster corresponds to a community. This division needs to ensure that nodes within the same community are densely connected, while the connections between different communities are sparse [6]. Mining the community structure is the key to revealing and understanding the organizational principle of complex network systems. In a nutshell, community detection has gradually evolved into an increasingly important cross-cutting research area.
In recent years, graph representation learning has emerged as an important backbone for community detection that incorporates information about the network topology and node attributes [7]–[10]. Especially, graph contrastive learning is a superior technique to enhance the discriminability of nodes in graph representation learning by comparing positive and negative samples. It has demonstrated remarkable success in many supervised or semi-supervised downstream tasks such as node classification [11]–[13], graph classification [14], [15] and link prediction [16], [17]. Nonetheless, within the context of graph representation learning utilizing GCL, two significant challenges persist:
Challenge 1. There are some unsupervised learning tasks in network analysis that also attract much attention, such as community detection. Most graph contrastive learning frameworks have difficulty learning the structural information required in these specific problems. They focus only on node-level similarity but fail to consider the community structure in the graph when sampling positive and negative node pairs. To further illustrate the importance of community-level information, we use t-SNE [18] to visualize node representations from gCooL and our CEGCL in Figure 1, and report the accuracy (ACC) of their community detection. gCooL [13] is a novel Graph Communal Contrastive Learning algorithm, which can learn both the node representations and community partition. Comparing with Figure 1 (b) and 1 (c), we observe that the node representation space of gCooL does not have good community properties, and also its community detection accuracy is significantly lower than ours. Specifically, our model pays attention to learning the personalized features of each community in the graph, so as to obtain a compact embedding space within the clusters, but existing methods obviously fail to do so.
Neg_Sam Size | 50 | 100 | 500 | 1000 | 1500 |
---|---|---|---|---|---|
Micro-F1 | 32.3 | 23.4 | 18.1 | 6.9 | 4.6 |
Macro-F1 | 31.3 | 25.5 | 19.3 | 5.4 | 3.9 |
Number of nodes in \(4^{th}com\) | 380 | 343 | 54 | 34 | 5 |
time / epoch (s) | 0.6 | 0.7 | 0.9 | 1.0 | 1.2 |
Challenge 2. Additionally, contrastive-based methods typically require a large number of negative samples, allowing for uniform separation of representations across all samples. However, the contrast of negative samples would inevitably lead to the class collision issue, especially as the number of negative samples increases [19]. Specifically, class collision means that different samples from the same class are treated as negative sample pairs and pushed away incorrectly. On the one hand, we find that class collision induces unfairness in community detection. To further explain, we conduct community detection with the general GCL model on Citeseer and report the experimental results in Table 1. We empirically observe that as the negative sample size (Neg_Sam Size) increases, the F1 score shows a substantial decrease. Furthermore, the number of nodes that the model marks as the fourth community (\(4^{th}com\)), named ‘ML’, on the Citeseer keep decreasing, actually should be 596. This means that almost all nodes in the ‘ML’ community (class) are pulled into other communities by the contrastive-based model because they are incorrectly identified as negative sample pairs. Based on the above two experimental phenomena, we conclude that the contrastive-based model improves the overall performance by sacrificing the detection accuracy of one class, which is unfair and unjustified. On the other hand, large-scale negative sampling tends to cause high time consumption.
To address the above problems, we propose a novel graph contrastive learning method, called Community-aware Efficient Graph Contrastive Learning (CEGCL). Firstly, based on the common contrastive learning framework [20], we design a personalized self-training (PeST) strategy for unsupervised scenarios that allows our model to learn community-level personalized features by incrementally sampling and learning medoids of each community. Afterward, CEGCL contains an aligned graph clustering (AlGC) module to achieve community detection results. In this module, we align the learnable cluster centers in downstream task with the community medoids in PeST by self-training to obtain more consistent node embedding space. To reduce the number of false negative samples, we filter them by pseudo-labels obtained from the clustering layer. These modules can be mutually beneficial during iteration.
In summary, the main contributions of our paper are summarized as follows:
We present an end-to-end efficient graph contrastive learning framework, called CEGCL, that jointly trains community detection results and node embeddings.
We propose a personalized self-training (PeST) strategy for unsupervised scenarios that allows our model to actively learn the personalized feature of each community. Moreover, it is theoretically demonstrated that our PeST strategy adaptively results in a more compact internal structure of each community.
Furthermore, we point out that our PeST module is actually more unbiased and efficient for negative sampling, thereby effectively alleviating the class collision problem and reducing the learning time cost of negative samples by one order compared to existing contrastive learning methods.
Experimental results over the three benchmark datasets indicate that our CEGCL outperforms the state-of-the-art. Moreover, the effectiveness of our model for mitigating the unfairness problem is verified.
Complex networks in the real world often manifest themselves as multiple, tightly connected sub-networks, i.e. communities. In recent years, community detection has become a very popular research topic. There are some traditional methods, such as K-means [21] and Spectral clustering [22]. More recently, graph representation learning combined with clustering algorithms has become a popular trend in the field of community detection. ARVGA [23] uses a graph autoencoder to map the node on the attribute graph into a compact representation, and combines an adversarial training module to make it conform to a prior distribution. VGAER [24] proposes a variational graph autoencoder reconstruction-based community detection for the first time, and gives its non-probabilistic version. However, such reconstruction-based algorithms extract features at a fine granularity, so they are not suitable for downstream node-level tasks, e.g., node classification and clustering. CommDGI [3] introduces the trainable clustering layer and community-oriented objectives to DGI [11], which learns the community-related node representations through joint optimization. GDCL [4] combines a clustering layer with contrastive learning framework, which utilizes a debiased negative sampling strategy to reduce the false negative samples. gCooL [13] proposes the reweighted self-supervised cross-contrastive training module to reduce the bias in contrastive learning using community information. \(S^{3}\)-CL [25] infers node clusters and prototypes, and enforces nodes in the same cluster to group around their corresponding prototypes in the latent space. However, graph prototypical contrastive learning does not account for the sampling bias in positives. Moreover, strictly enforcing intra-cluster compactness may reduce the continuity and generalization capability of the representation space. In general, the existing graph representation algorithms based on contrastive learning do not solve the class collision and unfairness problem well, especially when the number of negative samples is large, which would hamper the effect of community detection.
Graph contrastive learning is a powerful self-supervised learning technique that focuses on learning representations of nodes in the semantic space by comparing positive and negative samples. Contrasts between positive samples enable representations to learn consistency across different views of similar samples, while contrasts with negative samples emphasize the differences between different classes of samples. DGI [11] uses an objective function based on mutual information rather than random walking, which applies the ideas of DIM [26] to graph data so that the node representations contain global graph information. MVGRL [12] introduces different structural views of a graph for contrastive learning based on DGI for learning both node-level and graph-level representations. GRACE [20] proposes a simplified contrastive framework by using InfoNCE loss[27] with both inter-view and intra-view negative pairs. However, above mentioned contrastive-based methods are not suitable for community detection directly as they have difficulty in learning the specific community structure of the network.
Graph self-training is an important technique in semi-supervised learning that aims to fully exploit the value of data from unlabelled nodes in the graph. The authors in [28] propose self-training approaches including Co-training and Union to train graph convolutional network (GCN), which samples predicted labels of nodes with high confidence and adds them to the training set. M3S [29] adopts a multi-stage self-training framework with deep clustering algorithm. DR-GST [30] considers recovering the distribution of the original labeled graph dataset when introducing the augmented data from self-training. Graph self-training methods can enhance the performance of the model for downstream tasks by improving the generalization of the GCN, but there are fewer self-training mechanisms for unsupervised graph representation learning. Furthermore, due to their limited representation modeling ability, they do not perform as well as our models, even though they use a small number of labels.
In this section, we detail how our proposed CEGCL achieves community detection. Specifically, we first construct a basic Graph Contrastive Learning framework in Section 3.1. Next, based on the node embeddings encoded by GCN in Graph Contrastive Learning framework, we design a Personalized Self-Training (PeST) in Section 3.2. Meanwhile, we introduce a clustering layer on the embeddings from GCN to achieve the clustering of the downstream task (Section 3.3.1), and align the clustering space of the downstream task with the clustering space of the PeST (Section 3.3.2). Finally, we debias the negative samples and modify the loss function of graph contrastive learning in Section 3.1 (Section 3.3.3). The general framework of our model is shown in Figure 2.
Firstly, we introduce some notations used in this paper. Given a network \(\mathcal{G=(V,E,X)}\), where \(\mathcal{V}=\{v_1,v_2,\ldots,v_n\}\) denotes the set of nodes in network, \(\mathcal{E}=\{e_{ij}\}\subseteq \mathcal{V} \times \mathcal{V}\) denotes the set of edges between nodes, \(\mathcal{X}=\{\boldsymbol{x_1}, \boldsymbol{x_2},\ldots,\boldsymbol{x_n}\}\) denotes the set of node attributes and \(\boldsymbol{x_i} \in \mathbb{R}^k(i=1,2, \ldots, n)\) is the feature of \(v_i\), \(k\) is the dimension of the node attribute.
Graph contrastive learning obtains high-quality semantic representations by pulling positive sample pairs \((\boldsymbol{x_i},\boldsymbol{x_i^{+}})\) closer and negative sample pairs \((\boldsymbol{x_i},\boldsymbol{x_k^{-}})\) further away in the embedding space. Thus, CEGCL minimizes the following objective function for every node in the graph:
\[\begin{align} & \mathcal{L}_{c l}\left(\boldsymbol{x}_i\right) \\ & =-\ln \frac{e^{{g_{\Theta}\left(\boldsymbol{x_i}\right)^T g_{\Theta}\left(\boldsymbol{x_i^{+}}\right)} / {\tau}}}{e^{{g_{\Theta}\left(\boldsymbol{x_i}\right)^T g_{\Theta}\left(\boldsymbol{x_i^{+}}\right)} / {\tau}}+\sum\limits_{k \in \widetilde{\mathcal{N}}_i} e^{{g_{\Theta}\left(\boldsymbol{x_i}\right)^T g_{\Theta}\left(\boldsymbol{x}_{k}^{-}\right)} / {\tau}}}, \end{align}\] where \(g_{\boldsymbol{\Theta}}(\cdot)\) is a GCN [31] with its parameter \(\boldsymbol{\Theta}\) and \(\tau\) is temperature parameter. We mask node feature \(\boldsymbol{x}_i\) randomly and obtain the positive sample \(\boldsymbol{x}_i^{+}\), which is a kind of graph augmentation strategy. Here, \(\widetilde{\mathcal{N}}_i\) is the negative sample set of node \(v_i\). We utilize the pseudo-labels to filter false negative samples, the details of which are in Section 3.3.3. Benefiting from the PeST module, our model only needs far fewer negative samples, thus building an efficient graph contrastive learning framework.
Self-training intends to extract semantic information from unlabeled data by adding high-confidence prediction nodes and their labels from each class to the training set. Inspired by it, we propose an incremental sampling and personalized self-training (PeST) strategy in unsupervised scenarios. Moreover, we demonstrate its effectiveness in experimental and theoretical aspects, respectively.
In contrast to general self-training algorithms that explore high-confidence node prediction results [29], we choose to sample the medoids of each community in the embedding space for self-training and add them to the training set incrementally. Particularly, after each \(t\) epoch (i.e. the sampling frequency of medoids), one sampling is performed in the embedding space, and the obtained \(K\) medoid vectors are added to the training set, where \(K\) refers to the number of communities in the graph and can be obtained by some non-parametric methods [32], [33]. We use the K-medoids [34] algorithm to obtain the real node \(v_j\) in each community center with the following objective function: \[\sum_{i=1}^n \min _{j \in U_l} d\left(v_i, v_j\right)=\sum_{i=1}^n \min _{j \in U_l}\left\|g_{\Theta}\left(\boldsymbol{x}_i\right)-g_{\Theta}\left(\boldsymbol{x}_j\right)\right\|_2,\] where \(U_l=U_{l-1} \setminus M_{l-1}\) is the set of nodes that are not selected as the medoid in \((l*t)\)-th epoch and \(t\) represents the sampling frequency of medoids, \(l\) represents the sampling times. Also, we obtain \(M_l=\{m_{l,0},m_{l,1},\ldots,m_{l,K-1}\}\) as the selected \(K\) medoids in \((l*t)\)-th epoch. In community detection scenario, we do not have any label information in advance. Thus the subscript corresponding to the medoid vector is directly assigned to its own label (i.e. \(label(m_{l, i})=i\)), and the label set is denoted as \(N_l=\{0,1,\ldots,K-1\}\). Therefore, in the \(T\)-th (i.e. \(T=L*t\)) epoch, we obtain the training set \(M^T=\bigcup_{l=1}^L{M_l}\) and its label set \(N^T=\bigcup_{l=1}^L{N_l}\). In order to learn the distribution of personalized features for each community, our CEGCL minimizes the following cross-entropy objective function \(\mathcal{L}_{st}\): \[\mathcal{L}_{st}=-\frac{1}{|M^T|}\sum_{i \in M^T}\sum_{n=0}^{K-1}y_{i, n}\ln{\hat{y}_{i, n}}. \label{eq:Lst}\tag{1}\] Here, \(y_{i,n}=1\) if node \(i\) is the medoid of community \(n\) and 0 otherwise, indicating the \(n\)-th element of the one-hot label vector \(\boldsymbol{y}_i\). \(\hat{y}_{i, n}\) is the \(n\)-th element of the prediction label vector \(\boldsymbol{\hat{y}}_i\) and it calculates as: \[\boldsymbol{\hat{y}}_i=MLP(g_\Theta(\boldsymbol{x}_i)),\] where \(MLP\) is a two-layer fully connected neural network. Here, the role of the \(MLP\) is to generate the predicted label with semantic information. In addition, the incremental sampling strategy allows the more central medoids to be added in the training set earlier, thus our model could capture the personalized features of each community more accurately, as shown in Figure 3.
The reason why we choose to sample the medoids of each community for self-training is that this approach satisfies the following three essential principles:
The set of nodes is selected to be representative. General self-training algorithms tend to choose high-confidence nodes to join the training set, yet these nodes do not necessarily contain the critical information to train our model, especially the community-level semantic information. K-medoids can obtain better community centers, as theoretically demonstrated in [35], thus allowing our model to capture community-level personalized information effectively.
The class information of the selected nodes is relatively precise. We select the medoids of each community since the confidence of their class is higher than other nodes. From Figure 3, we could show that our PeST module can be regarded as the contrastive learning of debiased negative samples, considering the selection of the medoids with less bias. It achieves a more accurate community-level discriminative role compared to the general contrastive learning algorithm. Thus our model only needs a small number of negative samples in GCL.
The set of nodes is selected to ensure class balance. The balanced sampling of medoids allows our model to learn more equally the information of each community in the graph, which guarantees the fairness of community detection to a degree.
In the following, we will demonstrate experimentally and theoretically the effectiveness of our self-training strategy.
In Figure 4, we study the effectiveness of our PeST module by modularity metric, which measures the compactness of the community structure [36]. It can be observed that with the removal of the self-training module from our CEGCL, the modularity tends to decrease as training, which validates that our PeST module allows for a more compact structure within the communities. More specifically, our model learns the personalized features of each community, so its embedding space has better community properties than those of the general contrastive learning method.
In theory, our PeST strategy is equivalent to adaptively drawing closer nodes of the same community, as formally proposed in Theorem 3.1.
Theorem 1. An undirected graph \(\mathcal{G}\) has \(n\) nodes \(\mathcal{V}=\left\{v_1, \ldots, v_n\right\}\), and \(\boldsymbol{H}_i^{(t)}\) \(\left(\boldsymbol{H}_i^{(0)}=\boldsymbol{x}_i\right)\) is the embedding of node \(v_i\) after \(t\) times message propagating and aggregation. And \(\boldsymbol{y}_k \in \mathbb{R}^K\) is an one-hot label vector, where \(y^{(k)}=1\) is the \(k\)-th element of \(\boldsymbol{y}_k\). Assume that node \(v_i\) belongs to community \(k\), then we have \(\forall \epsilon >0, \exists T \in \mathbb{N}^{+}\), such that \(\forall t>T\), then \(\left\|\boldsymbol{H}_i^{(t)}-\boldsymbol{y}_k\right\|_2 \leq \epsilon\).
Proof. Please refer to Appendix 6.1 in the supplementary material for the complete proof. ◻
The above theorem demonstrates that the nodes of the same community all adaptively converge to a specified center. These center vectors are two-by-two orthogonal and span to a standard orthogonal space. Both experiment and theory verify that our PeST module induces a more compact community structure.
In conclusion, we use the K-medoids algorithm to incrementally extract the medoid nodes of each community and converge their semantic features to a set of standard orthogonal bases in the Euclidean space by optimizing the cross-entropy loss. The PeST module in our CEGCL allows us to obtain a node embedding space that is tight within the community and sparse between the communities.
In this subsection, we propose an aligned graph clustering layer and jointly optimize it with graph contrastive learning and personalized self-training module. Based on the DEC algorithm [37], we add the learnable cluster centers in downstream task to the self-training, in order to align them with the community medoids in PeST. In addition, to reduce the number of false negative samples, we filter them using pseudo-labels derived from clustering layer.
To obtain the community detection results in downstream task, we develop a clustering layer. At first, we compute a soft assignment between the node embedding \(g_\Theta(\boldsymbol{x}_i)\) and the cluster center parameter \(\boldsymbol{\mu}_k\). The similarity between them is measured by the Student’s \(t\)-distribution \(Q\) as follows: \[q_{ik}=\frac{\left(1+\left\|g_{\boldsymbol{\Theta}}\left(\boldsymbol{x}_i\right)-\boldsymbol{\mu}_k\right\|^2 / \alpha \right)^{-\frac{\alpha + 1}{2}}}{\sum_j\left(1+\left\|g_{\boldsymbol{\Theta}}\left(\boldsymbol{x}_i\right)-\boldsymbol{\mu}_k\right\|^2 / \alpha \right)^{-\frac{\alpha + 1}{2}}},\] where \(\alpha\) is the degree of freedom of the t-distribution and \(q_{ik}\) can be interpreted as the probability of assigning node \(i\) to cluster \(k\). We normally set \(\alpha=1\) and initialize the cluster centers \(\{\boldsymbol{\mu}_k\}_{k=1}^{K}\) by K-means. Secondly, we define the \(L_{clus}\) as the KL divergence loss between the soft assignments \(Q\) and the target distribution \(P\) as follows: \[L_{clus}=K L(P \| Q)=\sum_i \sum_k p_{i k} \log \frac{p_{i k}}{q_{i k}},\] where \(p_{i k}\) is the target distribution \(P\) with higher confidence assignments defined as follows: \[p_{i k}=\frac{q_{i k}^2 / f_k}{\sum_{k} q_{i k}^2 / f_{k }},\] where \(f_k = \sum_i q_{ik}\) is the soft cluster frequency.
The clustering space of the downstream task is inconsistent with that in PeST, which will cause the information in PeST to be useless or even harmful to our downstream task. So we align these two spaces to reduce the interference of this inconsistent information on downstream tasks. Specifically, we add \(C = \{\boldsymbol{\mu}_k\}_{k=1}^{K}\) to the training set for self-training, similar to Section 3.2. Then, we minimizes the following cross-entropy objective function \(\mathcal{L}_{al}\): \[\label{Lal} \mathcal{L}_{al}=-\frac{1}{|C|}\sum_{i \in C}\sum_{n=0}^{K-1}y_{i,n}\ln{\hat{y}_{i,n}}.\tag{2}\] Here, \(y_{i,n}=1\) if \(i = n\) and 0 otherwise, indicating the \(n\)-th element of the one-hot label vector \(\boldsymbol{y}_i\). \(\hat{y}_{i,n}\) is the \(n\)-th element of the prediction label vector \(\boldsymbol{\hat{y}}_i\) and it calculates as: \[\boldsymbol{\hat{y}}_i=MLP(g_\Theta(\boldsymbol{\mu }_i)).\] We align the clustering space in the downstream task with that in PeST, which achieves more consistent node embeddings. Moreover, the cluster centers \(\{\boldsymbol{\mu}_k\}_{k=1}^{K}\) are learnable, which means they can be optimized continuously during the training process. This provides additional training paradigms for our CEGCL to learn the personalized characteristics of each community, thereby enhancing the generalization capability of our model.
Inspired by the debiased strategy in [4], we can further reduce the bias in negative samples. Specifically, we remove all nodes with the same pseudo-label as \(v_i\) and get its negative sample set \(\widetilde{\mathcal{N}}_i\) as follows: \[\widetilde{\mathcal{N}}_i = \{v_m\}(c_m^* \neq c_i^*).\] where \(\left\{c_i^*\right\}_{i=1}^n\) is the pseudo-labels of nodes obtained from soft assignments \(Q\).
Moreover, our proposed PeST is actually more unbiased and efficient contrastive learning of negative samples (as shown in Figure 3). The unbiasedness stems from the reliability of the class information. For example, the accuracy of the medoids’ class information used for self-training on PubMed remained at 100% throughout the entire training process. In addition, it is evident the time complexity of negative sample learning in general GCL algorithms is \(O(n^2)\). While our model’s time complexity in negative sample learning is \(O(n*N_{neg})+ O(n*K) \approx O(n)\), where \(N_{neg}\) represents the number of negatives sampled from \(\widetilde{\mathcal{N}}_i\), and \(K\) represents the number of communities in the network. Benefiting from the PeST module, only a few negative samples are necessary in our contrastive learning framework. So both \(N_{neg}\) and \(K\) are much smaller than the number of nodes in the network (i.e. \(n\)), particularly for large-scale graphs.
Consequently, we jointly graph contrastive learning, personalized self-training, and aligned graph clustering in an end-to-end trainable framework for more salient community structures. The overall objective function of the proposed CEGCL is induced as: \[\label{L} \min \mathcal{L} = \frac{1}{n}\sum_{i=1}^n\mathcal{L}_{cl}\left(\boldsymbol{x}_i\right) + \gamma_{st}\mathcal{L}_{st} + \gamma_{clus}\mathcal{L}_{clus} + \gamma_{al}\mathcal{L}_{al},\tag{3}\] where \(\gamma_i > 0\) are trade-off parameters to balance the magnitudes between different loss items.
On the one hand, the node representations obtained by contrastive learning have better discriminative properties and improve the accuracy of class information, thus facilitating the PeST module. On the other hand, PeST makes the contrastive learning of negative samples more accurate and efficient. With the mutual promotion of the various modules in our CEGCL, we can both obtain discriminative node representations and fair community detection results.
Dataset | Nodes | Edges | Features | Communities |
---|---|---|---|---|
Cora | 2708 | 5429 | 1433 | 7 |
Citeseer | 3312 | 4732 | 3703 | 6 |
PubMed | 19717 | 44338 | 500 | 3 |
Method | Input | Cora | Citeseer | PubMed | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
4-12 | ACC | NMI | ARI | ACC | NMI | ARI | ACC | NMI | ARI | ||
TA | SC | \(\mathbf{A}\) | \(36.7\) | \(12.6\) | \(3.1\) | \(23.8\) | \(5.5\) | \(1.0\) | \(52.8\) | \(9.7\) | \(6.2\) |
\(K\)-means | \(\mathbf{X}\) | \(49.2\) | \(32.1\) | \(22.9\) | \(54.0\) | \(30.5\) | \(27.8\) | \(59.5\) | \(31.5\) | \(28.1\) | |
GA | VGAE | \((\mathbf{A}, \mathbf{X})\) | \(50.2\) | \(32.9\) | \(25.4\) | \(46.7\) | \(26.0\) | \(20.5\) | \(63.0\) | \(22.9\) | \(21.3\) |
ARGA | \((\mathbf{A}, \mathbf{X})\) | \(64.0\) | \(44.9\) | \(35.2\) | \(57.3\) | \(35.0\) | \(34.1\) | \(66.8\) | \(30.5\) | \(29.5\) | |
ARVGA | \((\mathbf{A}, \mathbf{X})\) | \(64.0\) | \(45.0\) | \(37.4\) | \(54.4\) | \(26.1\) | \(24.5\) | \(69.0\) | \(29.0\) | \(30.6\) | |
VGAER | \((\mathbf{A}, \mathbf{X})\) | \(45.3\) | \(29.7\) | \(21.8\) | \(30.2\) | \(21.7\) | \(7.04\) | \(30.1\) | \(22.3\) | \(8.53\) | |
CA | CommDGI | \((\mathbf{A}, \mathbf{X})\) | \(66.1\) | \(52.8\) | \(44.6\) | \(62.1\) | \(39.8\) | \(38.6\) | \(63.2\) | \(30.6\) | \(25.7\) |
MVGRL | \((\mathbf{A}, \mathbf{X})\) | \(73.6\) | \(58.8\) | \(52.7\) | \(68.1\) | \(43.2\) | \(43.4\) | \(\underline{69.2}\) | \(32.0\) | \(\underline{31.0}\) | |
GDCL | \((\mathbf{A}, \mathbf{X})\) | \(\underline{74.7}\) | \(\underline{58.9}\) | \(\underline{53.3}\) | \(\underline{69.7}\) | \(\underline{44.8}\) | \(43.3\) | \(68.4\) | \(\underline{32.7}\) | \(\underline{31.0}\) | |
gCooL | \((\mathbf{A}, \mathbf{X})\) | \(72.3\) | \(55.8\) | \(49.8\) | \(63.8\) | \(34.6\) | \(34.9\) | \(67.8\) | \(30.3\) | \(29.2\) | |
\(S^{3}\)-CL | \((\mathbf{A}, \mathbf{X})\) | \(73.9\) | \(58.0\) | \(53.0\) | \(68.8\) | \(43.7\) | \(\underline{45.5}\) | \(69.0\) | \(32.3\) | \(29.8\) | |
CA | CEGCL (Ours) | \((\mathbf{A}, \mathbf{X})\) | \(\mathbf{77.6}\) | \(\mathbf{60.8}\) | \(\mathbf{61.2}\) | \(\mathbf{71.4}\) | \(\mathbf{47.1}\) | \(\mathbf{47.4}\) | \(\mathbf{71.3}\) | \(\mathbf{36.8}\) | \(\mathbf{35.0}\) |
Gain | \(\mathbf{+2.9}\) | \(\mathbf{+1.9}\) | \(\mathbf{+7.9}\) | \(\mathbf{+1.7}\) | \(\mathbf{+2.3}\) | \(\mathbf{+1.9}\) | \(\mathbf{+2.1}\) | \(\mathbf{+4.1}\) | \(\mathbf{+4.0}\) |
In this paper, we perform experiments on three real networks with different scales widely used in community detection tasks [4], [38], [39]. The three benchmark datasets are: Cora [40], Citeseer [41], and PubMed [42]. Their statistical information is shown in Table 2 and more details are in Appendix 6.2.
In order to verify the superiority of our model, we compare it with the current state-of-the-art algorithms as follows:
Unsupervised Algorithms
Semi-Supervised Algorithms
In order to measure the performance of unsupervised algorithms, three evaluation metrics are used, namely Accuracy (ACC), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI). For these three metrics, higher values indicate better performance for community detection [43]. For semi-supervised algorithms, we only use a common metric: ACC. In addition, we adopt the F1 score including Micro-F1 and Macro-F1 to evaluate the fairness of community detection, which can comprehensively reflect the detection accuracy for each community.
Our experiments are run on the machine with two NVIDIA GeForce RTX 3090 Ti and the details of the hyperparameters are provided in the Appendix 6.3.
We conduct the community detection experiment with our CEGCL and various comparison methods on three datasets for 20 times and report the average results. We show the experimental results in Table 3, 4 and 5, where the bold and underlined text indicates the optimal and suboptimal results, respectively. It can be seen that our model outperforms all the comparison methods for different evaluation metrics. The specific analysis is as follows:
The results in Table 3 show that deep learning methods based on GCN for community detection are significantly better than traditional clustering algorithms (i.e. K-means, SC). Since GCN makes more reasonable use of graph topology and node properties. Additionally, we observed that the contrastive-based approach (i.e. CommDGI, MVGRL, GDCL, gCooL and \(S^{3}\)-CL) generally outperforms the generative-based approach (i.e. VGAE, ARGA, ARVGA and VGAER). This can be attributed to the fact that the generative-based approach tends to learn pixel-level information, whereas the contrastive-based approach learns more node-level information, making it better suitable for downstream community detection tasks.
Our proposed CEGCL outperforms all comparison methods including contrastive-based algorithms for three evaluation metrics. For example, on the Cora dataset, our model significantly outperforms the previous SOTA GDCL by 2.9%, 1.9%, and 7.9% in ACC, NMI, and ARI, respectively. It is possible that GDCL could not consider the community structure in the graph during training, while our CEGCL explicitly extracts community-level information through the integration of PeST and AlGC. Based on the experiment results from Figure 4 and Table 3, it can be concluded that our CEGCL can better solve the Challenge 1 raised in the introduction.
To verify the fairness of contrastive learning methods in community detection, we compare our CEGCL with the previous contrastive-based SOTA GDCL in Table 4. As the negative sample size increases and class collision worsens, the fairness of community detection based on GDCL receives different degrees of impact on the three datasets. By the way, since the PubMed dataset has only three classes (communities), GDCL is slightly affected compared to the other two datasets. In contrast, our model reaches significantly higher F1 scores than the GDCL, especially when the negative sample size is large. This indicates our model equally achieves considerable precision when detecting each community and solves the problem in Challenge 2 that arises when applying the contrastive framework to community detection.
Co-training, Union and M3S are semi-supervised learning algorithms for self-training. During the training process of them, we use label rates of 2%, 3%, and 0.1% w.r.t. Cora, Citeseer and PubMed. It is worth mentioning that the common label rates for these three datasets in node classification tasks are 5.2%, 3.6%, and 0.3%. As shown in Table 5, our model is almost superior to three other methods in terms of ACC when applied to community detection, even though we do not use any label information. This is because our contrastive-based CEGCL fully utilizes the information of abundant unlabeled nodes. Besides, the nodes we sample in the PeST module are more representative of each community than those in other self-training methods.
2.5pt
Neg_Sam Size | 50 | 1500 | ||||
---|---|---|---|---|---|---|
Method | Micro-F1 | Macro-F1 | Micro-F1 | Macro-F1 | ||
Cora | GDCL | 36.4 | 34.4 | 11.9 | 9.4 | |
Ours | \(35.6\) | \(\mathbf{39.5}\) | \(\mathbf{31.8}\) | \(\mathbf{36.9}\) | ||
Citeseer | GDCL | 32.3 | 31.3 | 4.6 | 3.9 | |
Ours | \(\mathbf{48.5}\) | \(\mathbf{40.3}\) | \(\mathbf{48.4}\) | \(\mathbf{40.1}\) | ||
PubMed | GDCL | 29.0 | 30.0 | 27.1 | 29.2 | |
Ours | \(\mathbf{39.2}\) | \(\mathbf{30.9}\) | \(\mathbf{40.4}\) | \(\mathbf{32.0}\) |
Method | Label | Cora | Citeseer | PubMed | |
---|---|---|---|---|---|
Co-training | 73.5 | 62.5 | 72.7 | ||
Union | 75.9 | 66.7 | 70.7 | ||
M3S | 75.6 | 70.3 | 70.6 | ||
Ours | \(\mathbf{77.6}\) | \(\mathbf{71.4}\) | \(\underline{71.3}\) |
To verify the effect of each module in our CEGCL, we conduct ablation experiments as shown in Table 6.
It is evident that our proposed personalized self-training and alignment module significantly improve the performance of our model for community detection. Examining row 1 and row 2 of this table, we can see an increase in performance of 5 to 6 points on the Cora when applying the PeST module. It indicates that this module does allow our model to learn the personalized features of each cluster. Moreover, the change in modularity shown in Figure 4 further validates the superiority of our PeST module, which leads to more compact community structures in embedding space.
1.8pt
Method | Cora | Citeseer | PubMed | ||||||
---|---|---|---|---|---|---|---|---|---|
2-10 | ACC | NMI | ARI | ACC | NMI | ARI | ACC | NMI | ARI |
w/o \(\mathcal{L}_{st}\) & \(\mathcal{L}_{al}\) | \(69.9\) | \(55.1\) | \(48.9\) | \(69.9\) | \(44.4\) | \(43.7\) | \(69.0\) | \(35.2\) | \(32.3\) |
w/o \(\mathcal{L}_{al}\) | \(75.1\) | \(59.9\) | \(55.1\) | \(70.8\) | \(46.0\) | \(45.8\) | \(70.2\) | \(35.2\) | \(33.8\) |
w/o \(\mathcal{L}_{st}\) | \(75.0\) | \(59.5\) | \(57.1\) | \(70.6\) | \(46.2\) | \(45.3\) | \(70.1\) | \(34.5\) | \(33.3\) |
Ours | \(\mathbf{77.6}\) | \(\mathbf{60.8}\) | \(\mathbf{61.2}\) | \(\mathbf{71.4}\) | \(\mathbf{47.1}\) | \(\mathbf{47.4}\) | \(\mathbf{71.3}\) | \(\mathbf{36.8}\) | \(\mathbf{35.0}\) |
To investigate the effect of each hyperparameter on our model, including the number of negative samples \(N_{neg}\); the sampling frequency \(t\) of Medoids and the dimension \(d\) of \(MLP\); the trade-off parameters \(\gamma_{st}\) and \(\gamma_{al}\), we perform experiments and analysis on the Cora dataset. Due to page limitations, the details of the latter two experiments are provided in Appendix 6.4.
Our CEGCL is a community detection method based on GCL, so we analyze the effect of \(N_{neg}\) and compare it with the GDCL. As can be seen from Figure 5, the performance of GDCL is sensitive to the change of \(N_{neg}\). In general, when the number of negative samples is larger, the performance of GDCL is also better, which is in line with the general rule of the contrastive learning model. However, an excessive number of negative samples not only increases the training time, even worse but also leads to the class collision problem. On the contrary, our CEGCL is robust to \(N_{neg}\) on both ACC and NMI. In particular, our model achieves optimal results with \(N_{neg}=10\) on the Cora dataset. This is attributed to the PeST module, which is equivalent to the contrastive learning of debiased negative samples. Also, we calculate and compare the average accuracy of class information of medoids, with the highest at 86% for \(N_{neg}=10\).
In this paper, we propose a community-aware efficient graph contrastive learning framework (CEGCL). The personalized self-training strategy can enhance the ability of our model to perceive communities. By jointly training the three modules of graph contrastive learning, personalized self-training, and aligned graph clustering, we finally obtain an ideal community embedding space that is compact within the same cluster and sparse between the distinct clusters. Extensive experiments also demonstrate the superiority of our CEGCL.
To further elaborate on the ideas and results in our paper, this supplementary material provides additional details omitted from the main text. Specifically, we first provide a complete proof of Theorem 3.1 in the paper; then give more information about the datasets in experiments, including data source, feature extraction, etc.; next list the specific hyperparameter settings for our experiments. Finally, through extra hyperparameters experiments, it further demonstrates the advantages of our proposed model. We hope these supplementary contents can make the details of the research work clearer and more complete.
Theorem 2 (Theorem 3.1 Restated). An undirected graph \(\mathcal{G}\) has \(n\) nodes \(\mathcal{V}=\left\{v_1, \ldots, v_n\right\}\), and \(\boldsymbol{H}_i^{(t)}\) \(\left(\boldsymbol{H}_i^{(0)}=\boldsymbol{x}_i\right)\) is the embedding of node \(v_i\) after \(t\) times message propagating and aggregation. And \(\boldsymbol{y}_k \in \mathbb{R}^K\) is an one-hot label vector, where \(y^{(k)}=1\) is the \(k\)-th element of \(\boldsymbol{y}_k\). Assume that node \(v_i\) belongs to community \(k\), then we have \(\forall \epsilon >0, \exists T \in \mathbb{N}^{+}\), such that \(\forall t>T\), then \(\left\|\boldsymbol{H}_i^{(t)}-\boldsymbol{y}_k\right\|_2 \leq \epsilon\).
Proof. Given the layer-wise propagation rule of GCN is, \[\boldsymbol{H}^{(t+1)}=\phi(\tilde{\boldsymbol{D}}^{-\frac{1}{2}} \tilde{\boldsymbol{A}} \tilde{\boldsymbol{D}}^{-\frac{1}{2}} \boldsymbol{H}^{(t)}\boldsymbol{W}),\] where \(\tilde{\mathrm{\boldsymbol{A}}}={\boldsymbol{A}}+\boldsymbol{I}\) is the adjacency matrix of \(\mathcal{G}\) with added self-loops in each node and \(\boldsymbol{I}\) is the identity diagonal matrix. \(\tilde{\boldsymbol{D}}_{i i}=\sum_j \tilde{\boldsymbol{A}}_{ij}\) and \(\boldsymbol{W}\) is weight matrix of GCN. \(\boldsymbol{H}^{(t)} \in \mathbb{R}^{n \times d}\) is the node embedding matrix in t-th layer, \(\boldsymbol{H}^{(0)}=\boldsymbol{X}\). \(\phi\) is the activation function, such as Relu. Here for a easy proof, we assume that \(\phi(x) = x\) and \(\boldsymbol{W} = \boldsymbol{I}\). So, We can rewrite the GCN layer as,
\[\boldsymbol{H}^{(t+1)}=\tilde{\boldsymbol{D}}^{-\frac{1}{2}} \tilde{\boldsymbol{A}} \tilde{\boldsymbol{D}}^{-\frac{1}{2}} \boldsymbol{H}^{(t)}=\left(\boldsymbol{I}-\boldsymbol{L}_{\text{sym}}\right) \boldsymbol{H}^{(t)},\] where \(\boldsymbol{L}_{sym}=\tilde{\boldsymbol{D}}^{-\frac{1}{2}} \tilde{\boldsymbol{L}} \tilde{\boldsymbol{D}}^{-\frac{1}{2}}\) is the symmetrically normalized Laplacian matrix and \(\tilde{\boldsymbol{L}}=\tilde{\boldsymbol{D}}-\tilde{\boldsymbol{A}}\).
Given that each node in the graph \(\mathcal{G}\) has a self-loop, it can be concluded that there are no bipartite components in the graph. Consequently, it follows that the eigenvalues of \(\boldsymbol{L}_{sym}\) are all contained within the interval [0,2) [44]. As \(\boldsymbol{L}_{sym}\) is a real symmetric matrix, it is found to have n eigenvalues, which may include instances of repeated eigenvalues. Additionally, we can derive a set of \(n\) orthogonal eigenvectors, denoted as \(\{\boldsymbol{e_j}\}_{j=1}^n\).
Assume that \(\mathcal{G}\) has \(w\) connected components \(\{C_i\}_{i=1}^w\) and define indication vector \(\mathbf{1}_i\) as follows, \[\mathbf{1}_i^{(j)}=\left\{\begin{array}{l} 1, v_j \in C_i \\ 0, v_j \notin C_i \end{array}\right..\] From [45], we know that the eigenspace of \(\boldsymbol{L}_{sym}\) corresponding to the eigenvalue of 0 is spanned by \(\left\{\boldsymbol{D}^{-\frac{1}{2}} \mathbf{1}_i\right\}_{i=1}^w\). It is easy to know that if node \(v_i\) and \(v_j\) in the same connected component, we have \[\forall k \in \{1,..,w\}, \quad \mathbf{1}_{k}^{(i)} - \mathbf{1}_{k}^{(j)} = 0.\] Let \(\left(\lambda_1, \ldots, \lambda_n\right)\) denote the eigenvalue of matrix \(\boldsymbol{I}-\boldsymbol{L}_{sym}\). Since the eigenvalues of \(\boldsymbol{L}_{sym}\) stand in [0,2), we have \[-1<\lambda_1<\lambda_2<\cdots<\lambda_{n-w}<\lambda_{n-w+1}=\cdots=\lambda_n=1.\]
If node \(v_i\) and node \(v_j\) are in the same connected component, we can rewrite \(\boldsymbol{H}_i^{(t)}-\boldsymbol{H}_j^{(t)}\) as, \[\begin{align} \begin{array}{l} \boldsymbol{H}_i^{(t)}-\boldsymbol{H}_j^{(t)} \\ =\left[\left(\boldsymbol{I}-\boldsymbol{L}_{sym}\right)^{t} \boldsymbol{H}^{(0)}\right]_i-\left[\left(\boldsymbol{I}-\boldsymbol{L}_{sym}\right)^{t} \boldsymbol{H}^{(0)}\right]_j \\ =\left[\left(\boldsymbol{I}-\boldsymbol{L}_{sym}\right)^{t} (\boldsymbol{e}_1,...,\boldsymbol{e}_n)\hat{\boldsymbol{H}}\right]_i-\left[\left(\boldsymbol{I}-\boldsymbol{L}_{sym}\right)^{t} (\boldsymbol{e}_1,...,\boldsymbol{e}_n)\hat{\boldsymbol{H}}\right]_j \\ =\left[\lambda_1^{t}\left(\boldsymbol{e}_1^{(i)}-\boldsymbol{e}_1^{(j)}\right), \ldots, \lambda_{n-w+1}^{t}\times0, \ldots, \lambda_{n}^{t}\times0\right] \hat{\boldsymbol{H}}, \end{array} \end{align}\] where \(\{\boldsymbol{e_j}\}_{j=1}^n\) are the eigenvectors of \(\boldsymbol{L}_{sym}\) (also \(\boldsymbol{I}-\boldsymbol{L}_{sym}\)) and \(\boldsymbol{e}_j^{(i)}\) is the \(i\)-th element of \(\boldsymbol{e}_j\). \(\hat{\boldsymbol{H}}\) is the coordinate matrix of \(\boldsymbol{H}^{(0)}\) in the \(n\)-dimensional orthogonal space spanned by eigenvectors of \(\boldsymbol{L}_{sym}\).
We set \(\boldsymbol{H}_j^{(t)} = \boldsymbol{M}_k^{(t)}\) and \(\boldsymbol{M}_k^{(t)}\) is the embedding of the medoid of the community \(k\). Thus, \[\left\|\boldsymbol{H}_i^{(t)}-\boldsymbol{M}_k^{(t)}\right\|_2=\sqrt{\sum_{j=1}^n\left[\sum_{m=1}^{n-w} \lambda_m^{t}\left(\boldsymbol{e}_m^{(i)}-\boldsymbol{e}_m^{(k)}\right) \hat{\boldsymbol{H}}_{m j}\right]^2}.\] Since \(-1<\lambda_1<\lambda_2<\cdots<\lambda_{n-w}<1\), we have \[\forall \epsilon >0, \exists T_1 \in \mathbb{N}^{+}, \text{s.t. } \forall t>T_1, \text{then} \left\|\boldsymbol{H}_i^{(t)}-\boldsymbol{M}_k^{(t)}\right\|_2 \leq \frac{\epsilon}{2}.\]
According to the equivalence of cross entropy and L2 norm in the gradient optimization [46], when we minimize the cross entropy loss \(\mathcal{L}_{st}\) in Eq. 1 from the main text, it is actually equivalent to optimizing the following L2 norm, \[\min \left\|\boldsymbol{M}_k^{(t)}-\boldsymbol{y}_k\right\|_2.\] So, we have, \[\forall \epsilon >0, \exists T_2 \in \mathbb{N}^{+}, \text{s.t. } \forall t>T_2, \text{then} \left\|\boldsymbol{M}_k^{(t)}-\boldsymbol{y}_k\right\|_2 \leq \frac{\epsilon}{2}.\]
In summary, according to the triangle inequality of norm, we have, \[\begin{align} & \forall \epsilon >0, \exists T=\max(T_1,T_2), \text{ s.t. } \forall t>T, \text{then} \\ & \left\|\boldsymbol{H}_i^{(t)}-\boldsymbol{y}_k\right\|_2 \leq \left\|\boldsymbol{H}_i^{(t)}-\boldsymbol{M}_k^{(t)}\right\|_2 + \left\|\boldsymbol{M}_k^{(t)}-\boldsymbol{y}_k\right\|_2 \leq \epsilon. \text{ } \end{align}\] ◻
We conduct experiments on three commonly used datasets for community detection, using the standard dataset splits. The datasets are obtained from https://linqs.org/datasets/ and detail are as follows:
Cora [40]: A citation network with 2708 nodes and 5429 edges. Each node represents a paper in the field of machine learning, and each edge represents a citation relationship between two papers. The feature of a node is represented as a 1433-dimension binary vector, which indicates the presence of the corresponding word. Each node in this graph is classified into one of the following seven classes: Case Based, Genetic Algorithms, Neural Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, Theory.
Citeseer [41]: A citation network with 3312 nodes and 4732 edges. Each node represents a paper in the field of computer science, and each edge represents a citation relationship between two papers. The feature of a node is represented as a 3703-dimension binary vector. Each node in this graph is classified into one of the following six classes: Agents, AI, DB, IR, ML, HCI.
PubMed [42]: A large-scale citation network with 19717 nodes and 44338 edges. Each node represents a paper in the field of diabetes, and each edge represents a citation relationship between two papers. The feature of a node is represented as a 500-dimension Term Frequency Inverse Document Frequency (TFIDF) vector. Each node in this graph is classified into one of the following three classes: Diabetes Mellitus, Experimental, Diabetes Mellitus Type 1, Diabetes Mellitus Type 2.
Following the suggestions in DEC [37], we pre-train the encoder of our framework. And the important hyperparameters for the community detection task are shown in Table 7.
Dataset | Backbone | Hyperparameter |
---|---|---|
Cora | GCN | epochs: 1000, lr: 5e-5, \(hidden_{GCN}\): 512, layers: 1, \(\tau\): 0.5, \(N_{neg}\): 10, t: 50, d: 256, \(\gamma_{st}\): 0.01, \(\gamma_{al}\): 0.001 |
Citeseer | GCN | epochs: 1000, lr: 5e-5, \(hidden_{GCN}\): 512, layers: 1, \(\tau\): 0.5, \(N_{neg}\): 50, t: 150, d: 256, \(\gamma_{st}\): 0.0005, \(\gamma_{al}\): 0.0001 |
PubMed | GCN | epochs: 300, lr: 5e-5, \(hidden_{GCN}\): 220, layers: 1, \(\tau\): 0.2, \(N_{neg}\): 10, t: 30, d: 256, \(\gamma_{st}\): 10, \(\gamma_{al}\): 0.001 |
In the PeST module, we sample the medoid vectors of each community in every \(t\) epochs and generate their prediction labels by using the MLP. The dimension of the hidden layer in MLP is noted by \(d\). Then, we conduct community detection at \(t\) and \(d\) from [25, 50, 75, 100, 150, 200] and [32, 64, 128, 256, 1024] respectively, and report the results in Figure 6. As can be seen, the performance of our model remains mostly stable. Furthermore, we can observe that the performance of our CEGCL is slightly higher when \(t = 25\) compared to the other \(t\). This is due to the fact that more medoid vectors are added to the training set thus allowing our model to learn more information about each community in graph.
First, we perform experiments and analyses on the trade-off parameters \(\gamma_{st}\) and \(\gamma_{al}\), as shown in Figure 7. We can observe that our CEGCL achieves optimal performance on Cora by adopting \(\gamma_{st}\) and \(\gamma_{al}\) to be 0.01 and 0.001, respectively. For different \(\gamma_{st}\), our model is essentially stable on both ACC and NMI. In addition, as \(\gamma_{al}\) gradually increases from \(10^{-5}\), the performance of our model shows an increasing trend. This also demonstrates the importance of the alignment module in AlGC, which indeed improves the generalization ability of our model. Overall, our CEGCL achieves commendable performance with a reasonable combination of trade-off parameters and is robust for both \(\gamma_{st}\) and \(\gamma_{al}\).