Abstract

We focus on collaborative edge inference over wireless, which enables multiple devices to cooperate to improve inference performance in the presence of corrupted data. Exploiting a key-query mechanism for selective information exchange (or, group formation for collaboration), we recall the effect of wireless channel impairments in feature communication. We argue and show that a disjoint approach, which only considers either the semantic relevance or channel state between devices, performs poorly, especially in harsh propagation conditions. Based on these findings, we propose a joint approach that takes into account semantic information relevance and channel states when grouping devices for collaboration, by making the general attention weights dependent of the channel information. Numerical simulations show the superiority of the joint approach against local inference on corrupted data, as well as compared to collaborative inference with disjoint decisions that either consider application or physical layer parameters when forming groups.

Collaborative inference, Semantic and goal-oriented communications, edge artificial intelligence.

1 Introduction↩︎

Edge intelligence (or, edge ai) systems are increasingly composed of lightweight, resource-constrained devices equipped with pre-trained ml/ai models capable of performing local inference on data captured from their surroundings. These edge nodes, often lightweight and bandwidth-constrained, are designed to operate autonomously, making decisions in real time without needing to offload data to centralized servers. Nonetheless, its reliability hinges on several factors: the quality of sensory data, the robustness of the deployed models, and the conditions of the wireless channel, especially when communication is involved to enable collaboration horizontally and vertically.

In isolation, a single device’s inference capability can deteriorate under such uncertain conditions. It is in the scenario that distributed collaboration can provide a promising avenue leveraging the network of edge systems. When devices share intermediate representations or contextual cues, they can collectively compensate for gaps in individual perception. However, this cooperative process must contend with constraints across multiple layers—ranging from fluctuating wireless link quality to the heterogeneity of device roles and data relevance. Effective collaboration, therefore, hinges not just on what is being shared, but with whom and under what conditions.

While this strategy offers notable benefits, including fault tolerance and reduced decision latency, it introduces a complex design space. Challenges span multiple layers: selecting which devices to involve, identifying what information to share, and managing communication cost—all while respecting device heterogeneity and varying link conditions. This work approaches collaborative inference from a cross-layer perspective, where both application-level relevance and physical-layer channel quality guide information exchange. Our methodology builds on the principles of semantic and goal-oriented communications [1], aiming to align data sharing with task-specific utility—an emerging requirement in 6G and AI-native wireless systems.

Related Work A key obstacle in collaborative inference lies in determining both what features to share and which peers to collaborate with. Our work is inspired by approaches in collaborative perception [2], though we assume the inference model is fixed and pre-trained, rather than jointly learned during deployment. For instance, [3] proposes a query-key mechanism to learn communication graphs dynamically based on feature relevance. Similarly, semantic-based data sourcing in [4] and its extension to random access protocols in [5] exploit query-key matching to guide information exchange.

However, these methods grouping decisions rely solely on semantic matching, without factoring in the potential degradation introduced by imperfect communication links. While in [6] we study the effect of communication impairments and collaboration layer on the overall performance, we do not incorporate the link quality into the semantic method.

Contributions This paper considers a scenario where inference is executed at the edge, in line with the collaborative communication framework of [3], but explicitly accounts for wireless transmission impairments. We extend the semantic matching procedure by incorporating channel state information into the learning pipeline, enabling the formation of communication graphs that are robust to both semantic and physical constraints. Specifically, our contributions include:

A novel joint grouping mechanism that leverages both semantic relevance and wireless channel quality;
An in-depth evaluation of the trade-off between communication cost and inference accuracy under realistic channel impairments.

This remainder of the paper is organized as follows: 2 describes the system model, 3 introduces the framework for communication grouping based on semantic matching. Finally, 4 details the experiments performed and discuss their results. The work is then concluded in 5.

2 System Model↩︎

We use the system model of [6], but we recall it for completeness. The system is composed by not:nUE devices empowered with ai capabilities, in this case inference models. Each device \(i\) performs inference using a pre-trained model \(\gls{not:full-model}\) on input \(x_i\), e.g,. an image collected through a camera or other modality data. The model is split into two parts, a feature extractor \(\gls{not:encoder-model}\) and a decision model \(\gls{not:decoder-model}\), as in 1, such that: \[\gls{not:full-model} (x) = \left( \gls{not:decoder-model} \circ \gls{not:encoder-model} \right) \, (x)\].

We consider a scenario where the not:nUE devices are organized into \(\gls{not:nGroups}\) groups. Devices belonging to the same group are assumed to observe identical input data. That is, if devices \(i\) and \(j\) are both members of group \(g\), then \(x_i = x_j = x_g\). For generality, we assume that the group assignment is random, and importantly, each device lacks knowledge of which other devices belong to its group. The main objective is for devices to autonomously discover their group peers to improve local inference performance, with low overhead for the network, i.e. without the need of directly sharing data.

Each device observes either a noisy version \(\hat{x}_i = M^i(x_g)\) of the true group data \(x_g\) with probability \(\gls{not:p-patch}\), or the full observation \(x_g\) with probability \(1 - \gls{not:p-patch}\). Devices then transmit the output of their local encoder, \(\gls{not:encoder-model}(\hat{x}_i)\), to peers via a generic communication system \(\gls{not:comm-system}\). This communication channel is treated abstractly—it could represent a direct wireless link, an edge-bound transmission, or any form of data transformation and transfer. Due to potential imperfections in the channel, the data received by other devices may be corrupted or degraded compared to the original transmission.

For a device \(i\), the information shared by other devices within the same group plays a crucial role in enhancing its local inference, particularly when its own observation is noisy or degraded. However, this requires discovering the suitable devices for collaboration in terms of semantic relevance and good mutual channel conditions. When device \(j\) sends a general message \(o_j\), the corresponding message received by device \(i\) via \(C\) is represented as: \[\label{eq:rx95message} y_c^{ij} = \gls{not:comm-system} \left\{ \gls{not:observation}_i \right\}\tag{1}\] , where we use the notation \(\{\cdot\}\) to denote a system, rather than a function.

The devices need to aggregate the information received by their created group. This is performed by the feature combiner, \(\gls{not:combiner-model}^i\) at device \(i\). Denoting by \(\mathbf{y}_c^i\) the aggregated information received by device \(i\) from the other devices in the group, the output of the feature combiner is \[\label{eq:feat95comb} y^i_g = \gls{not:combiner-model} \left( \gls{not:encoder-model} ( \hat{x}_i ) , \mathbf{y}_c^i \right)\tag{2}\] .

Finally, the combined information is fed to the decision model to provide the inference result at device \(i\): \[y_d^i = \gls{not:decoder-model} (y^i_g)\]. This whole procedure is illustrated in 1.

Figure 1: System model for the proposed collaborative inference problem: devices collect incomplete/corrupted data.

The ultimate objective in this scenario is to enable collaborative inference, enhancing overall accuracy through the exchange of intermediate information via wireless communication. To achieve this, devices can utilize the communication system not only to transmit their feature representations, but also to discover other devices that possess information pertinent to improving their own inference capabilities. This process requires appropriately weighting each device’s contribution within the feature combiner.

2.1 Communication System↩︎

As in [6], we consider sidelink communication, which can occur in different ways:

Unicast: One-to-one communication. Data is transmitted to a single device.
Multicast: One-to-many communication. Data is transmitted to dedicated set of devices in the area.
Broadcast: One-to-all communication. Data is transmitted to all devices in the broadcast area.

Unicast and multicast communication need network connection, since they use the uplink to request communication, i.e. request to join a group. Semantic queries are exchanged via multicast transmission, while unicast transmission is used to exchange intermediate observation.

Differently from [6], the communication system \(C\) is a modeled as an awgn channel, where ratio between the signal and the noise power is the snr. The snr of the semantic query transmission from device \(i\) to \(j\) is denoted by \(\gamma^{q}_{ij}\) while the snr of the respective data transmission from \(j\) to \(i\) is given by \(\gamma^{d}_{ji}\) We assume that the devices exchange pilots in order to have a sidelink channel estimation, and that the latter is perfect. For the sake of generality, we assume that semantic query and the data transmission channels are different, so that \(\gamma^{d}_{ij} \neq \gamma^{q}_{ij}\). This covers mobility scenarios.

2.2 Key Performance Indicators↩︎

Since we consider a classification task, we consider accuracy as KPI at the application layers, i.e., the goal of communication. This is typical of the semantic and goal-oriented approach, which does not solely focus on communication-related KPIs.

However, the cost in terms of communication and computation is also a fundamental metric to be taken into account. Therefore, as a second KPI, we consider the average resource usage in terms of sidelink connections. The latter is computed as the average number of sidelink transmissions resulting from the optimized device grouping (i.e., the communication graph).

3 Link-aware semantic matching-based grouping↩︎

Given the above task, this work leverages an attention-based mechanism similar to [3] to identify: i) whether a device needs extra information for inference due to corrupted local data, and ii) the set of devices to collaborate with, in case of bad quality local data. Adapting the framework to our system, device \(i\) compresses its observation, obtaining an intermediate representation \(\gls{not:observation}_i = \gls{not:encoder-model} \, (\hat{x}_i)\). Then, it generates: i) a low-dimensional query vector \(\gls{not:vec-query}_i\) and ii) a key vector \(\gls{not:vec-key}_i\): \[\begin{gather} \gls{not:vec-query}_i = \gls{not:generator-query} (o_i ; \theta_q ) \, \text{,} \\ \gls{not:vec-key}_i = \gls{not:generator-key} (o_i ; \theta_k) \, \text{,} \end{gather}\] where \(\gls{not:generator-key}\) and \(\gls{not:generator-query}\) are two neural networks parametrized by \(\theta_k\) and \(\theta_q\), respectively. The query is transmitted to all other devices, while the key is kept local. The query received by device \(j\) from device \(i\) is \(\hat{ \gls{not:vec-query}}^j_i=C\{\mu_i\}\).

Every device receives the queries of all others (multicast), and uses its key to compute a matching score through scaled general attention [7]. This matching score represents the relevance of a device information to another device, so that they can exchange data, or, their intermediate representation (unicast) and weight the contribution accordingly. Differently from previous works, we propose to include the wireless link information in this step, so that the matching score also weights its effect. We denote by \(m_{i j}\) the matching score for device \(j\) receiving query from device \(i\), which reads as \[\label{eq:general-attention} m_{ij} = \frac{{\gls{not:vec-key}_i}^{\intercal} \gls{not:attention-weights}^j_i \hat{\gls{not:vec-query}}^j_i}{\sqrt{K}} \, \text{,}\tag{3}\] where originally \(\gls{not:attention-weights} \in \mathbb{R}^{Q \times K}\) is a trainable parameter without any dependencies to match the query size, \(Q\), and the key size, \(K\). However, we propose to make \(\gls{not:attention-weights}\) dependant of link information as: \[\label{eq:weight-generator} \gls{not:attention-weights}^j_i = \gls{not:generator-weight} (\gamma^{d}_{ij}, \gamma^{q}_{ji}) \, \text{,}\tag{4}\] where \(\gamma^{d}_{ji}\) and \(\gamma^{q}_{ij}\) represent, respectively, the snr of the data and query transmission from device \(j\) to \(i\).

All the matching scores are used to construct a matching matrix \(\mathbf{M}\) by using a row-wise softmax, with elements \(\bar{m}_{ij}\). The latter is used to construct the communications graph, as its values \(\bar{m}_{i j}\) represent how relevant the information of device \(i\) is for device \(j\), recalling that this relevance pertains to the semantics and the wireless channel conditions. Once the groups are created, the devices share the actual data (or, their intermediate representation), which might have high dimensionality compared to the queries. To avoid high communication overhead, \(\mathbf{M}\) can be pruned with threshold \(\rho\), i.e., \(\bar{m}_{ij}^\rho=\bar{m}_{ij}\cdot\mathbf{1}\{\bar{m}_{ij}\geq\rho\}\), where \(\mathbf{1}\{\cdot\}\) denotes the indicator function. \(\mathbf{M}\) is also used to combine features (cf. 2 ) according to the following weighted average: \[y^i_g = \sum_{j=1}^{\gls{not:nUE}} \bar{m}_{i j}^\rho y_c^{ij} \, \text{.}\] where \(y_c^{ij}\) is the received version of the intermediate data, as per the definition in 1 .

We can summarize the procedure as follows:

All devices generate a key \(\gls{not:vec-key}_i\) and a query \(\gls{not:vec-query}_i\) based on their intermediate representation.
A device \(i\) transmits its query to all other devices.
When in possession of the received query \(\hat{ \gls{not:vec-query}}^j_i\), a device \(j\) calculates the matching score \(m_{ij}\) taking into consideration the wireless link information.
Device \(j\) transmits its intermediate representation to \(i\) if \(\bar{ m}_{ij} \geq \rho\)
Device \(i\) aggregates the received data according to the weights \(\bar{m}_{ij}\).

In this work, we consider image classification as application. As such, training is performed by computing the cross-entropy loss between the true label and the predicted label, \(y_d^i = \gls{not:decoder-model} (y^i_g)\). It is important to highlight that, differently from [3], only the query generator \(\gls{not:generator-query}\), the key generator \(\gls{not:generator-key}\) and the attention weights \(\gls{not:generator-weight}\) are learned. As such, the encoder model \(\gls{not:encoder-model}\) and the decoder model \(\gls{not:decoder-model}\) are assumed to be pre-trained and their parameters frozen, while only the modules needed for the communication need to be trained. As a consequence, the pre-trained encoder and decoder models are shared across all devices. Note that decentralized training is also possible, with the result of different models for each device. However, this increases the computational cost.

4 Numerical results↩︎

4.1 Simulation setting and parameters↩︎

Image classification is performed on the Imagenette dataset [8]. The pre-trained model is the MobileNetV3-Small [9], initialized with its default weights from training on the ImageNet dataset and then fine-tuned to the Imagenette dataset. The partial data observability (i.e., the local data corruption) is modeled by applying a white patch in a random position of the image, with the ratio between the white patch size and the image size being \(0.4\). In other words, \(40\)% of the image is locally missing at the device, if the latter belongs to the set of devices with corrupted data.

We consider two wireless condition scenarios:

Uniform scenario: The snrs are sampled from a uniform distribution in the interval \(\left[ \gamma_{\mathrm{min}}, \gamma_{\mathrm{max}} \right]\).
Extreme scenario: The snrs are sampled from a discrete uniform distribution with values \(\left\{ \gamma_{\mathrm{min}}, \gamma_{\mathrm{max}} \right\}\).

4.2 Baseline Solutions↩︎

The semantic grouping solution, with and without link information (Proposed and Semantic-aware in the legend) is compared with three other benchmark solutions:

Local inference: only the local observation is used for inference, possibly on corrupted data. This represents the non-collaborative case.
Full observation: Inference is performed with the true observation instead of the partial one. This represents a performance upper bound.
Top-3 Channel: Each device selects the three devices with best sidelink channels and it receives their information, which is averaged with the same weights. Not based on the semantic data relevance.

4.3 Channel effect on query and data transmission↩︎

First, we analyze the effect of considering the link information in the method, comparing the semantic grouping solution trained for different query sizes with the other baselines in terms of accuracy. These results are shown in 2, with the results in 2 (a) assuming that the query transmission is reliable and only the intermediate data transmission is affected by channel noise, whereas in 2 (b) the query transmission is also subject to noise.

From 2 (a), we notice how adding the wireless link information improves performance, for both scenarios, with the collaboration improving performance when compared to standalone inference on local corrupted data or with considering only the link quality to select sources. It is also important to highlight that the query is of small size (in bits) with respect to the data, with the performance plateauing with a query size of \(64\) for the hardest case (extreme scenario without channel information). However, 2 (a) shows that the effect of the channel on the query transmission makes the task much harder, as it needs a larger query to surpass the performance of the local-only solution. It also shows the importance of considering link information, as the link-unaware solution is unable to outperform local inference in the extreme scenario.

These results suggest that the proposed solution is able to effectively weight both the data relevance and the channel conditions, improving the semantic matching method. They also suggest that the query needs a reliable transmission scheme, compared to the data (or intermediate representation) itself. We can conclude that semantic representation helps communication robustness, but device grouping needs reliable query exchange to achieve acceptable performance. However, it should be noted that, given the small size of the query, increased communication reliability effort does not impact system cost as data transmission itself. We also note that grouping based only on the channel does not perform well since it fails to filter the data based on its relevance to a device.

4.4 Communication pruning effect↩︎

We now analyze the effect of the communication pruning, which reduces the communication by only transmitting information if the matching score is above a certain threshold. These results are shown in 3, assuming a scenario with noise affects only the data transmission, as in 2 (a), and with a query size of \(1024\). In 3 (a), we plot the average number of sidelink connections per inference task, as a function of the pruning threshold \(\rho\). Whereas 3 (b) shows the corresponding accuracy, also as a function of \(\rho\).

3 (a) shows that the lowest (positive) threshold already provides significant communication reduction. This is thanks to the fact that some of the elements of the matching matrix are very close or equal to zero. Naturally, this does not reduce accuracy, as shown in 3 (b), because only lowly weighted collaborations are pruned. 3 (a) also shows that pruning further reduces communication when the channel information is included (our semantic and link-aware approach). This is shown by the communication reduction for low pruning thresholds close to zero. It implies that when the link quality information is not available, the solution relies on more sources to counter the unknown channel conditions, whereas the proposed improvement allows to filter the sources by link quality, thus having more lower valued matching scores. The extreme scenario needs an even stronger source selection, as it experiences less connections for the same threshold. This suggests that the proposed solution allows a better semantic matching by considering the channel conditions.

Comparing 3 (a) and 3 (b), we can conclude that it is possible to reduce communication effectively without affecting accuracy. However, if communication is heavily reduced, the degradation in performance can even overcome local performance with corrupted data, making collaboration useless.

5 Conclusions and Perspectives↩︎

This work presents a step toward robust and efficient collaborative intelligence at the wireless edge by investigating how communication-aware design can enhance inference reliability in distributed AI systems. Instead of handling semantic relevance and channel conditions separately, we unify them through a joint decision-making framework that adapts dynamically to both the task relevance and the network environment, by adding the link information to the general attention weights.

Our results show that collaborative grouping strategies informed by both data relevance and link quality outperform traditional disjoint schemes, which often neglect the interplay between inference goals and physical-layer constraints. This synergy proves especially beneficial under harsh wireless conditions, where smart source selection and communication pruning can sustain high inference accuracy with minimal overhead. The proposed joint solution also reduces further the number of sidelink connections when compared with the semantic-aware solution, since it considers the link conditions when matching devices.

Looking ahead, future developments could explore query-conditioned feature encoding, enabling devices to tailor their outgoing messages based on the specific intent of incoming queries. Such query-aware compression could substantially reduce the bandwidth footprint of collaborative inference without compromising performance.

References↩︎

[1]

E. C. Strinati et al., “Goal-oriented and semantic communication in 6G AI-native networks: The 6G-GOALS approach,” in 2024 Joint European Conference on Networks and Communications & 6G Summit (EuCNC/6G Summit), 2024, pp. 1–6.

[2]

Y. Han, H. Zhang, H. Li, Y. Jin, C. Lang, and Y. Li, “Collaborative perception in autonomous driving: Methods, datasets, and challenges,” IEEE Intelligent Transportation Systems Magazine, vol. 15, no. 6, pp. 131–151, 2023.

[3]

Y.-C. Liu, J. Tian, N. Glaser, and Z. Kira, “When2com: Multi-agent perception via communication graph grouping,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2020, pp. 4106–4115.

[4]

K. Huang, Q. Lan, Z. Liu, and L. Yang, “Semantic data sourcing for 6g edge intelligence,” IEEE Communications Magazine, vol. 61, no. 12, pp. 70–76, 2023.

[5]

A. E. Kalør, P. Popovski, and K. Huang, “Random access protocols for correlated iot traffic activated by semantic queries,” in 2023 21st International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt).IEEE, 2023, pp. 643–650.

[6]

M. P. Mota, M. Merluzzi, and E. C. Strinati, “Collaborative edge inference via semantic grouping under wireless channel constraints,” accepted at European Signal Processing Conference (EUSIPCO) 2025, 2025.

[7]

M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” arXiv preprint arXiv:1508.04025, 2015.

[8]

J. Howard, “Imagenette: A smaller subset of 10 easily classified classes from imagenet,” March 2019. [Online]. Available: https://github.com/fastai/imagenette.

[9]

A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.

This work has been supported by the SNS JU project 6G-GOALS under the EU’s Horizon program Grant Agreement No. 101139232↩︎

Joint Channel and Semantic-aware Grouping for Effective Collaborative Edge Inference