A Location-Aware Hybrid Deep Learning Framework for Dynamic Near-Far Field Channel Estimation in Low-Altitude UAV Communications

Wenli Yuan, Kan Yu, , Xiaowu Liu, Kaixuan Li, Qixun Zhang, , and Zhiyong Feng,  1
2 3 4 5 6 7


Abstract

In low-altitude UAV communications, accurate channel estimation remains challenging due to the dynamic nature of air-to-ground links, exacerbated by high node mobility and the use of large-scale antenna arrays, which introduce hybrid near- and far-field propagation conditions. While conventional estimation methods rely on far-field assumptions, they fail to capture the intricate channel variations in near-field scenarios and overlook valuable geometric priors such as real-time transceiver positions. To overcome these limitations, this paper introduces a unified channel estimation framework based on a location-aware hybrid deep learning architecture. The proposed model synergistically combines convolutional neural networks (CNNs) for spatial feature extraction, bidirectional long short-term memory (BiLSTM) networks for modeling temporal evolution, and a multi-head self-attention mechanism to enhance focus on discriminative channel components. Furthermore, real-time transmitter and receiver locations are embedded as geometric priors, improving sensitivity to distance under near-field spherical wavefronts and boosting model generalization. Extensive simulations validate the effectiveness of the proposed approach, showing that it outperforms existing benchmarks by a significant margin—achieving at least a 30.25% reduction in normalized mean square error (NMSE) on average.

Channel estimation; Near-field; Deep learning; Long short-term memory; Multi-head self-attention

1 Introduction↩︎

The unmanned aerial vehicle (UAV) communications have demonstrated significant potential in low-altitude economic applications. Their flexible deployment, rapid coverage establishment, and high-altitude line-of-sight (LoS) capabilities enable reliable communication links in scenarios where conventional infrastructure is unavailable or disrupted, such as in emergency response and remote area coverage [1], [2]. However, the radio propagation environment in low-altitude settings is highly complex, presenting substantial challenges for accurate channel estimation in UAV-assisted communication systems.

One major challenge stems from the high mobility of both UAVs and ground users, which causes rapid temporal variations in air-to-ground (A2G) channels [3]. Compounding this issue, the ongoing evolution from 5G massive multiple-input multiple-output (MIMO) to 6G extremely large-scale MIMO (XL-MIMO) systems has led to the deployment of large antenna arrays to enhance spectral efficiency and coverage. This expansion in array aperture results in the division of the electromagnetic field into distinct near-field and far-field regions, with the boundary defined by the Rayleigh distance [4], [5]. Crucially, the channel characteristics in these two regions differ fundamentally: far-field channels depend primarily on angle parameters, whereas near-field channels are influenced by both angle and distance [6]. In low-altitude UAV communications, node mobility causes continuous variation in communication distances, leading to dynamic transitions between near-field and far-field propagations. Consequently, conventional channel estimation methods, which typically rely on far-field assumptions, often fail to provide accurate channel state information (CSI) in such heterogeneous and time-varying environments.

Traditional pilot-based approaches, including least squares (LS) and minimum mean square error (MMSE), perform adequately in static settings. While LS offers low computational complexity, it suffers from poor performance in high-noise and dynamic scenarios. MMSE achieves better accuracy but requires prior statistical knowledge of the channel and entails higher computational cost. Both methods exhibit significant performance degradation under rapidly changing channel conditions [7]. Deep learning (DL) techniques have emerged as a promising alternative, capable of learning complex mappings from received signals to channel responses without explicit channel modeling or strong prior assumptions [8]. For instance, CNNs can effectively extract spatial features from received signals to capture local correlations among channel elements [9]. Long short-term memory (LSTM) networks leverage temporal dependencies in sequential data to track channel variations over time [10], while attention mechanisms help highlight semantically critical features for improved estimation performance [11].

Nevertheless, most existing DL-based channel estimation methods focus predominantly on far-field scenarios, with limited consideration for the unique spatio-temporal evolution of channels in near-field environments. Moreover, they seldom incorporate geometric prior information—such as real-time transmitter and receiver positions—which could provide valuable inductive bias for distinguishing between near-field and far-field propagation modes. As a result, current approaches struggle to meet the accuracy and robustness requirements for channel estimation in practical UAV relay systems operating under dynamically near-far field propagation conditions.

this paper proposes a unified channel estimation framework suitable for both far-field and near-field regimes. The method integrates deep feature learning with location-assisted information to effectively capture the joint spatio-temporal evolution of the channel. The main contributions of this work are summarized as follows:

  • Unified channel estimation framework: We propose a unified channel estimation method for near- and far-field propagation which combines CNN and multi-head self-attention mechanism to capture local and global features, and employs BiLSTM to model bidirectional temporal dynamics, thereby enhancing modeling capability under multipath and dynamic channels.

  • Position-aided prior fusion: We incorporate transceiver position information as geometric priors into the deep network, building a position-aided branch that is jointly modeled with learned features to enhance near-field distance sensitivity and bolster the robustness of channel estimation.

  • Experimental validation in dynamic near-far field: Rigorous simulations compared to benchmarks demonstrate superior channel estimation accuracy and robustness. In particular, the framework achieves an average NMSE reduction of at least 30.25% in near-far field scenarios under an SNR of 15dB.

The remainder of this paper is organized as follows. Section 2 reviews the latest developments in far-field and near-field channel estimation methods. In Section 3, the system model is presented, and the NMSE minimization problem is defined in channel estimation. Section 4 provides a detailed introduction to the proposed unified deep learning-based channel estimation framework. Section 5 offers experimental evaluations of the proposed method. Finally, conclusions and directions for future work are provided in Section 6.

2 Related Works↩︎

Channel estimation plays a pivotal role in wireless communication systems, directly impacting the signal transmission quality and overall performance. Accurate estimation enhances interference suppression and signal recovery, yet growing environmental complexity poses severe challenges to conventional methods. Recent advances in DL offer promising alternatives, transforming channel estimation into a data-driven learning task. In this section, we review state-of-the-art techniques, classified into methodologies based on far-field, near-field, and DL, with key contributions summarized in Table 1.

Table 1: Representative Studies on Classical Channel Estimation Methods
Rref. Techniques Intelligence Channel type Residual Attention Location information NMSE/MSE Mobility
[12] MMSE \(\times\) Far-field \(\times\) \(\times\) \(\times\) \(✔\) \(✔\)
[13] Joint SBL and KF \(\times\) Near-field \(\times\) \(\times\) \(\times\) \(✔\) \(✔\)
[14] Neural Network \(✔\) Far-field and Near-field \(\times\) \(\times\) \(\times\) \(✔\) \(\times\)
[15] DNN and Stacked BiLSTM \(✔\) Far-field \(\times\) \(\times\) \(\times\) \(✔\) \(✔\)
[10] CNN and LSTM \(✔\) Far-field \(\times\) \(\times\) \(\times\) \(\times\) \(✔\)
[16] CNN and BiLSTM \(✔\) Far-field \(\times\) \(\times\) \(\times\) \(\times\) \(\times\)
[17] CNN and BiLSTM \(✔\) Far-field \(\times\) \(\times\) \(\times\) \(✔\) \(✔\)
[18] Deep Residual Attention Network \(✔\) Far-field \(✔\) SA and CA \(\times\) \(✔\) \(\times\)
[19] Attention Residual Network \(✔\) Far-field \(✔\) CA \(\times\) \(✔\) \(✔\)
[20] CNN and Transformer \(✔\) Far-field \(✔\) Muti-Head Self-Attention \(\times\) \(✔\) \(✔\)
[21] Parallel Attention-Based Transformer \(✔\) Far-field \(✔\) Multi-Head Self-Attention \(\times\) \(✔\) \(✔\)
[22] RACNN \(✔\) Far-field and Near-field \(✔\) Self-Attention \(\times\) \(✔\) \(\times\)
our work RACNN and BiLSTM \(✔\) Far-field and Near-field \(✔\) Multi-Head Self-Attention \(✔\) \(✔\) \(✔\)

2.1 Far-Field Channel Estimation Technologies↩︎

Conventional far-field channel estimation techniques, such as LS and MMSE [7], have been widely adopted. [23] and [24] both exploited the sparsity of the shared base station (BS)-to-reconfigurable intelligent surface (RIS) channel and respectively proposed methods based on compressive sensing (CS) and low-rank matrix completion (LRMC) to jointly estimate the cascaded BS-RIS-user channel. Since RIS consists of passive reflecting elements and cannot actively process signals, channel estimation must be carried out by the BS, which consequently means that the BS-RIS and RIS-user channels cannot be directly estimated independently. Therefore, Zhou et al. constructed a reciprocity link within a full-duplex BS and employed parallel factor (PARAFAC) analysis decomposition and atomic norm minimization (ANM) at the BS to effectively decouple and estimate the BS-RIS and RIS-user channels, respectively [25]. Rekik et al. exploited the orthogonality property of the orthogonal frequency division multiplexing (OFDM) system to independently estimate the covariance matrix and noise subspace for each subcarrier, and then minimized a global cost function in parallel to estimate channel coefficients efficiently [26]. In the MA systems, Xiao et al. proposed a CS framework to recover the multi-path components (MPC) by exploiting the sparsity in the angular domain [27].

A key limitation of the above-mentioned far-field studies is the neglect of time-varying channel characteristics under mobility. Xu et al. addressed Doppler effects and doubly selective fading in high-mobility RIS systems using MMSE-based interpolation and reflection optimization [12], [28]. Byun et al. decomposed UAV-RIS channels into slow-varying angles and fast-varying gains, estimated at different timescales [3].

2.2 Near-Field Channel Estimation Technologies↩︎

The emergence of XL-MIMO and high-frequency communications has brought near-field propagation to the forefront, where traditional plane-wave assumptions break down [4]. Recent studies address both channel estimation and user localization under spherical wavefronts. For instance, Yang et al. and Lei et al. systematically analyzed near-field characteristics—spherical waves, finite-depth beam focusing, and spatial non-stationarity—and reviewed corresponding localization and estimation algorithms [6], [29].

In [30], Lu et al. proposed a hybrid LoS and non-line of sight (NLoS) near-field channel estimation model, separating the geometric free-space LoS path from NLoS components and introducing the MIMO advanced Rayleigh distance to define model boundaries. Cui et al. introduced polar-domain sparse representations and developed both on-grid and off-grid recovery algorithms applicable to near- and far-field scenarios [4]. Although the polar-domain channel representation method can characterize the spherical wave properties, it requires sampling in both the angle and distance domains, leading to high storage overhead and high dictionary coherence. To address these issues, Zhang et al. proposed a distance-parameterized angular-domain model with a joint dictionary learning and orthogonal matching pursuit (OMP) algorithm. In this approach, the user distance is treated as an unknown parameter in the dictionary, so that the number of dictionary columns depends only on the angular resolution, thereby avoiding the storage and computational burden [31].

For the channel estimation in wideband millimeter-wave XL-RIS systems, Yang et al. analyzed the near-field beam squint effect and designed a wideband spherical-domain dictionary with angle-distance dimension sampling to minimize the atomic coherence [32]. To address the issue that both CS and traditional bayesian learning (BL) methods struggle to simultaneously capture sub-array-specific sparsity and sparsity shared across sub-arrays, In [33], Pisharody et al. modeled sub-array sparsity using Gaussian maxture priors and developed both centralized and decentralized variational BL algorithms. Jiang et al. propose a neural network-aided joint optimization approach that simultaneously optimizes the beamformer and the localization function, effectively integrating data capturing environmental characteristics into the system design and further exploiting range information embedded in frequency selectivity to improve estimation accuracy [14].

Few near-field studies adequately address mobility-induced time variation. Zhu et al. proposed an adaptive joint sparse Bayesian Kalman filter that leverages temporal correlation in time-varying channels [13]. While in [34], Haghshenas et al. developed a parametric maximum likelihood framework with adaptive RIS configuration for channel estimation and user tracking under mobility and Rician fading.

2.3 DL-Based Channel Estimation Technologies↩︎

DL excels at learning nonlinear mappings and implicitly capturing structural dependencies, making it well-suited for channel estimation. Subsequently, Liu et al. deeply integrated physical channel correlations with DL architectures, paving the way for numerous subsequent data-driven estimators [7]. In [35], Mei et al. provided theoretical performance analysis for DL-based estimators, addressing a gap in earlier simulation-only validations. In [36] and [37], Gizzini et al. surveyed DL methods for high-mobility and frequency-selective channels and proposed a CNN-based weighted interpolation estimator.

Hybrid architectures are widely adopted to capture both spatial and temporal features. CNN-LSTM models combine spatial feature extraction with temporal dependency modeling [10], [38], [39]. BiLSTM variants further improve temporal modeling by incorporating both forward and backward contexts [16], [17], [40]. In [41], Zhou et al. combined CNN and convolutional LSTM for spatiotemporal channel prediction. In [42], Yuan et al. integrated CNN with autoregressive and recurrent models for multi-step forecasting.

Deep neural network (DNN)–based approaches have also evolved. For example, in [43], Zheng et al. designed an adaptive online DNN trained solely on pilot signals, eliminating the need for true channel labels. In [44], Zhang et al. introduced a tensor-train DNN to reduce parameter size and allow recursive estimation by incorporating historical information. In [45], Guo et al. used DNNs to predict key channel parameters (such as gain, delay, and Doppler shift) rather than the full channel matrix. In [46], Yang et al. designed a DNN for doubly selective fading channels, incorporating pilots and historical CSI for tracking. In [47], Gao et al. combined sparse Bayesian learning with DNNs to mitigate power leakage and beam squint in mmWave MIMO systems. In [15], Yu et al. introduced a 3D geometric channel model for IRS-assisted UAV systems, combining DNN-based pre-estimation with BiLSTM-based tracking.

Attention mechanisms and residual connections further enhance the performance of DL-based channel estimators [11], [19]. Residual connections alleviate gradient vanishing and enable deeper networks, while attention mechanisms explicitly model inter-feature dependencies. Specifically, channel attention (CA) and spatial attention (SA) reweight informative channels and locations, respectively. Transformer-based self-attention captures global and long-range dependencies through query-key-value interactions [11]. As a result, several studies have successfully integrated these concepts into channel estimation frameworks. In [21], Guo et al. reformulated the problem as a parallel super-resolution task using multi-head attention. In [20], Chen et al. combined CNN and Transformer to capture long-range and nonlinear temporal dynamics. In [48], Sun et al. designed a lightweight Transformer for time-series refinement of channel estimates. In [49], Ju et al. used multi-head self-attention to learn spatio-temporal correlations and combat channel aging. In [22], Lam et al. proposed a residual attention CNN framework for hybrid-field channels. While Gao and Zheng et al. integrated channel and spatial attention with residual learning for noise suppression and high-resolution reconstruction [18], [19].

Unlike previous studies focused on static or single-scenario estimation, this work tackles time-varying channels in dynamic dual scenarios by incorporating real-time transceiver positions as geometric priors and jointly modeling spatio-temporal features within a unified learning framework.

Notations: In this paper, \((\cdot)^H\) and \((\cdot)^T\) denote the Hermitian (conjugate transpose) and transpose operations, respectively.

Figure 1: Network model

3 Network Model and Preliminaries↩︎

Consider an OFDM time division duplex (TDD) communication scenario, where the BS is unable to provide effective communication due to distance and obstructions. UAV serves as a relay system, equipped with \(N\)-element uniform linear array (ULA) and \(N_{\mathrm{RF}}\) RF chains, providing reliable communication to \(K\) ground user equipments (UEs). Due to the mobility of the UAV and the users, as well as environmental factors, both LOS and NLOS links exist in the communication channel. The UAV and UEs are assumed to move with velocities \(v_U\) and \(v_k\), respectively. As shown in Fig. 1, in practical applications, UAV’s flight altitude together with its relative position to the ground UE determines the propagation region—near field or far field [4]. Therefore, the relative distance between the UAV and the UE plays a significant role in the communication channel.

3.1 Signal Model↩︎

For uplink channel estimation, all users employ mutually orthogonal pilot sequences. This orthogonality allows the channel estimation for each user to be performed independently. We focus on an arbitrary user, who is assigned \(M\) subcarriers with a pilot length of \(Q\). In the \(t\)-th time slot, the UE transmits a known orthogonal pilot sequence, denoted by \(s_m(t)\), on the \(m\)-th subcarrier. The corresponding received pilot signal at the UAV, denoted by \(\mathbf{y}_m(t) \in \mathbb{C}^{N{\mathrm{RF}} \times 1}\), is then given by

\[\mathbf{y}_m(t) = \mathbf{W}_{\mathrm{RF}}^{H}(t) \mathbf{h}_m s_m(t) + \mathbf{W}_{\mathrm{RF}}^{H}(t) \mathbf{n}_m(t),\] where \(\mathbf{W}_{\mathrm{RF}}^{H}(t)\!\in\!\mathbb{C}^{N_{\mathrm{RF}}\times N}\) denotes the UAV RF combining matrix, \(\mathbf{h}_m\!\in\!\mathbb{C}^{N\times 1}\) is the channel at the \(m\)-th subcarrier, and \(\mathbf{n}_m(t)\!\sim\!\mathcal{CN}\!\left(\mathbf{0},\sigma^2 \mathbf{I}_N\right)\) is additive white Gaussian noise (AWGN). By concatenating all the received pilots at the \(m\)-th subcarrier (i.e., \(\mathbf{y}_m =[\mathbf{y}_m^{T}(1),\,\mathbf{y}_m^{T}(2),\,\ldots,\,\mathbf{y}_m^{T}(Q)\big]^{T} \in \mathbb{C}^{Q N_{\mathrm{RF}}\times 1}\)), we get \[\mathbf{y}_m = \mathbf{W}_{\mathrm{RF}}^{H} \mathbf{h}_m + \mathbf{n}_m,\] where \(\mathbf{W}_{\mathrm{RF}} = \left[ \mathbf{W}_{\mathrm{RF}}(1), \ldots, \mathbf{W}_{\mathrm{RF}}(Q) \right] \in \mathbb{C}^{N \times Q N_{\mathrm{RF}}}\), and \(\mathbf{n}_m = \left[ \mathbf{n}_m(1), \ldots, \mathbf{n}_m(Q) \right] \in \mathbb{C}^{Q N_{\mathrm{RF}} \times 1}\). Furthermore, the pilots received across all subcarriers (i.e. \(\mathbf{Y} = \left[ \mathbf{y}_1, \ldots, \mathbf{y}_M \right] \in \mathbb{C}^{Q N_{\mathrm{RF}} \times M}\)) can be expressed as \[\mathbf{Y} = \mathbf{W}_{\mathrm{RF}}^{H} \mathbf{H} \mathbf{S} + \mathbf{N},\] where \(\mathbf{H} = \left[ \mathbf{h}_1, \ldots, \mathbf{h}_M \right] \in \mathbb{C}^{N \times M}\), \(\mathbf{S} = \left[ \mathbf{s}_1, \ldots, \mathbf{s}_M \right] \in \mathbb{C}^{N_{\mathrm{RF}} \times M}\), and \(\mathbf{N} = \left[ \mathbf{n}_1, \ldots, \mathbf{n}_M \right] \in \mathbb{C}^{Q N_{\mathrm{RF}} \times M}\).

3.2 Channel Model↩︎

Conventional communication models predominantly rely on the far-field channel assumption, which approximates radiated electromagnetic waves as planar wavefronts [4]. However, this assumption becomes increasingly invalid with the growing number of antenna elements and reduced communication distances, where near-field effects are non-negligible [31]. The boundary between these regimes is defined by the Rayleigh distance, \(d_F = 2D^2/\lambda\), where \(D\) is the antenna array aperture and \(\lambda\) is the wavelength [4]. In UAV communications, the link distance may fall on either side of \(d_F\), leading to dynamic shifts in channel propagation characteristics. These far-field and near-field channels are fundamentally distinct, primarily in their dependencies on distance and angle, thus necessitating separate modeling approaches.

(1) Far-field Channel Model↩︎

When the UAV-UE separation exceeds the Rayleigh distance \(d_F\), the channel operates in the far-field regime, characterized by a dependence on the angle of arrival (AoA) and a linear phase variation proportional to the propagation distance. The corresponding channel at the \(m\)-th subcarrier can be modeled as [31] \[\mathbf{h}_m = \sqrt{\frac{N}{L}} \sum_{\ell=1}^{L} \alpha_{\ell} e^{-j \frac{2 \pi f_m r_{\ell}}{c}} \textrm{a}(\theta_{\ell}),\] where \(L\) is the number of effective propagation paths, \(f_m\) is the carrier frequency, \(c\) is the speed of light, \(\alpha_{\ell}\) is the complex gain of the \(l\)-th propagation path, \(r_{\ell}\) denotes the propagation distance from the center of the UAV array to the \(\ell\)-th propagation path, \(\theta_{\ell}\) is the physical angle of the \(\ell\)-th path, and \(\textrm{a}(\theta_{\ell}) \in \mathbb{C}^{N \times 1}\) represents the array response to the angle \(\theta_{\ell}\). For ULA, the far-field array response is represented as \[\textrm{a}(\theta_{\ell}) = \frac{1}{\sqrt{N}} \left[ 1, e^{j \frac{2 \pi d}{\lambda} \sin \theta_{\ell}}, \dots, e^{j \frac{2 \pi d}{\lambda} (N-1) \sin \theta_{\ell}} \right]^H,\] where \(d=\lambda/2\) is the elements spacing of the antenna array.

(2) Near-field Channel Model↩︎

When the UAV-UE distance falls below the Rayleigh distance \(d_F\), the incident wave demonstrates a spherical [14]. Consequently, both the amplitude and phase of the received signal exhibit significant variations across the antenna array due to the differing propagation distances. This necessitates an array response model that is jointly dependent on both the AoA and the distance. Thus, the near-field channel for the \(m\)-th subcarrier is modeled as [13] \[\mathbf{h}_m = \sqrt{\frac{N}{L}} \sum_{\ell=1}^{L} \alpha_{\ell} e^{-j \frac{2 \pi f_m r_{\ell}}{c}} \textrm{a}(\theta_{\ell}, r_{\ell}),\] where \(\textrm{a}(\theta_{\ell}, r_{\ell}) \in \mathbb{C}^{N \times 1}\) is the near-field array response vector for the \({\ell}\)-th path, considering the combined effects of distance and angle. For ULA, the corresponding near-field array response vector is expressed as [13] \[\textrm{a}(\theta_{\ell}, r_{\ell}) = \frac{1}{\sqrt{N}} \left[ e^{j \frac{2 \pi}{\lambda} (r_{\ell}^{(0)} - r_{\ell})}, \ldots, e^{j \frac{2 \pi}{\lambda} (r_{\ell}^{(N-1)} - r_{\ell})} \right]^H,\] where \(r_{\ell}^{(n)}\) is the distance between the UE and the \(n\)-th antenna, and \(r_{\ell}\) denoted the distance between the center of the UAV array and UEs or scatters, which is expressed as \[r_{\ell}^{(n)} = \sqrt{r_{\ell}^2 + \delta_n^2 d^2 - 2 r_{\ell} \theta \delta_n d},\] where \(\delta_n = \frac{2n - N - 1}{2}\).

3.3 Performance Metrics of Channel Estimation↩︎

To quantitatively assess the quality of the proposed channel estimation framework, it is essential to define appropriate performance metrics. The mean squared error (MSE) directly measures the average squared discrepancy between the estimated and the true channel coefficients, providing a fundamental measure of estimation accuracy. To enable a fair comparison across different scenarios and channel conditions, this error is commonly normalized, leading to the normalized mean squared error (NMSE).

1)

During training, the model parameters are optimized by minimizing the MSE between the channel estimates and the ground truth. To this end, the MSE loss function is expressed as follows [39] \[\label{eq:MSE} \mathcal{L}_{\mathrm{MSE}} = \|\mathbf{\hat{H}}_k - \mathbf{H}_k\|^2,\tag{1}\] where \(\mathbf{\hat{H}}_k\) represents the estimated channel, \(\mathbf{H}_k\) is the true channel, and \(\| \cdot \|^2\) denotes the the L2 norm.

2)

The NMSE serves as a key performance metric for the channel estimation algorithm. A lower NMSE corresponds to higher estimation accuracy. This metric is adopted and defined as follows [15] \[\mathrm{NMSE} = \frac{\mathbb{E}\left( \| \mathbf{\hat{H}}_k - \mathbf{H}_k \|^2 \right)}{\mathbb{E}\left( \| \mathbf{H}_k \|^2 \right)}.\]

3.4 Problem Formulation↩︎

Considering the scenario where UAV-assisted communication systems operate in low-altitude environments, with the distance between the UAV and users continuously varying, leading to dynamic switching between near-field and far-field channels, the goal is to use the observable received signal \(\mathbf{Y}\) to estimate the near-field and far-field channels \(\mathbf{H}\). For the \(t\)-th time slot, the propagation channel between the \(k\)-th UE and UAV is denoted as \(\mathbf{H}_k(t)\), and the corresponding estimate is \(\hat{\mathbf{H}}_k(t)\). The objective of network training is to minimize the MSE between the estimated channel and the true channel. Therefore, the pilot-aided channel estimation problem is formulated as \[\label{eq:formulated32problem} \begin{align} \min\;& \mathbb{E}\!\left[\big\| \mathbf{H}_k(t)-\hat{\mathbf{H}}_k(t)\big\|_{F}^{2}\right], \\ \text{s.t.}\;& \operatorname{tr}\!\left(\mathbf{s}_k \mathbf{s}_k^{H}\right)=E_s, \end{align}\tag{2}\] where \(\|\cdot\|_F^2\) denotes the squared Frobenius norm, \(E_s = Q T_s P\) is the energy constraint, \(T_s\) is the duration of a single time slot, and \(P\) is the transmission power of the UAV. Conventional model-based estimators often struggle to capture the complex interplay between spatial variations, which are induced by large antenna arrays and varying propagation distances, and temporal dynamics resulting from the mobility of both UAVs and user equipment. To effectively address Problem 2 , in the following section, we propose a DL-based solution that jointly exploits spatial and temporal channel structures.

4 Proposed DL-Based Channel Estimation↩︎

To address the challenge of channel estimation in dynamic propagation environments, in this section, we develop a novel DL method. The core architecture integrates three key components: CNNs for spatial feature extraction, a BiLSTM for capturing temporal evolution, and a residual attention mechanism for improved feature representation. This unified channel estimation framework, as illustrated in Fig. 2, facilitates the adaptive learning of complex channel characteristics across both far-field and near-field conditions.

Figure 2: The framework of the proposed channel estimation algorithm

4.1 Data Preprocessing↩︎

To formulate the channel estimation problem in vector form, we consider the signal matrix \(\mathbf{Y}_k(t) \in \mathbb{C}^{N \times Q}\) received from the \(k\)-th UE in the \(t\)-th time slot. This matrix is vectorized into \(\mathbf{y}_k(t) = \text{vec}(\mathbf{Y}_k(t)) \in \mathbb{C}^{N Q \times 1}\), and then split into real and imaginary components to form the real-valued input vector \(\mathcal{Y}_k(t) = [\Re(\mathbf{y}_k(t))^T, \Im(\mathbf{y}_k(t))^T]^T \in \mathbb{R}^{2N Q}\). Similarly, the target channel matrix \(\mathbf{H}_k(t)\) is vectorized as \(\mathbf{h}_k(t)\), which serves as the regression target (or “input image” \(\mathbf{X}\)) for the DL model. A key aspect of our framework is that it does not require explicit prior knowledge of the near-field or far-field region. Instead, it learns to discern the underlying channel response patterns directly from the received signals, augmented by the real-time positional information of the UAV and the UE to enhance estimation accuracy. Specifically, given the known UAV position \(\mathbf{r}_U(t) = (x_U(t), y_U(t), z_U(t))\) and UE position \(\mathbf{r}_k(t) = (x_k(t), y_k(t), z_k(t))\) at the \(t\)-th time slot, the concatenated position vector \(\mathbf{r} = [\mathbf{r}_U(t), \mathbf{r}_k(t)]\) is first normalized via Layer Normalization to stabilize training. It is then non-linearly projected by a fully-connected layer into a compact embedding, and can be given by \[R=\sigma\!\left(\mathbf{W}\cdot\left(\gamma\,\frac{\mathrm{r}-\mu_{L}}{\sqrt{\sigma_{L}^{2}+\epsilon }}+\beta\right)+b\right),\]

where \(\sigma(\cdot)\) is the ReLU activation function defined by \(\sigma(x) = \max(0, x)\), \(R\) denotes the processed position feature, \(\mathbf{W}\) and \(b\) are the weight matrix and bias term of the fully connected layer, \(\gamma\), \(\beta\), and \(\epsilon\) are the learnable scale, shift parameters and a small constant for numerical stability. Assuming that the UAV and UE move with velocities \(v_u\) and \(v_k\) along fixed directions, their positions at each time step are updated as

\[\begin{align} r_U(t) &= r_U(t-1) + v_U \cdot \Delta t \cdot d_U, \\ r_k(t) &= r_k(t-1) + v_k \cdot \Delta t \cdot d_k. \end{align}\] where \(d_U\) and \(d_k\) denote the unit movement direction vectors of UAV and UE respectively, \(\Delta t\) is the time interval.

4.2 Neural Network Architecture↩︎

The proposed network model is architected around four interconnected components that function cohesively. (1) CNNs extract local spatial features from the channel observations, with its learning process stabilized by residual connections and its representational capacity enhanced by a Transformer attention mechanism, which directs the model’s focus to critical regions. (2) The temporal dynamics inherent in the historical CSI, driven by the relative movement between the UAV and the UE, are captured by a BiLSTM network, thereby modeling the channel’s evolution over time. (3) The real-time positional data of the UAV and the UE are embedded into the network as geometric priors, offering essential structural information about the propagation environment. (4) The high-level temporal features from the BiLSTM and the embedded geometric prior are concatenated and fused through a fully connected regression layer to produce the final estimate of the channel matrix for the current time step.

4.2.1 Convolutional-Attention Feature Extraction Module↩︎

This module consists of convolutional layers (conv2D), residual attention blocks (RA), ReLU activation function, and batch normalization (BN). We represent the image \(\mathbf{X} \in \mathbb{R}^{H\times W\times C_{in}}\) as the input to the CNN module, where \(H\), \(W\), and \(C_{\text{in}}\) denote the image height, image width, and the number of input channels, respectively. In the convolution operation, the data is convolved with the kernel \(\mathrm{w}\), sliding over the input image. The stride used is set to \(1\), and padding is applied to ensure that the convolution maintains the spatial dimension of the output image consistent with the input. We use several conv2D layers, and the corresponding mathematical expression is given by \[\label{eq:Conv} \mathbf{X}'_{h,w,c} = \sum_{i=1}^{r} \sum_{j=1}^{r} \sum_{c'=1}^{C_{\text{in}}} \mathbf{X}_{h+i-1, w+j-1, c'} \cdot \mathrm{w}_{i,j,c',c} + b_c,\tag{3}\] where \(\mathbf{X}'_{h,w,c}\) denotes the output of the convolution operation at position \((h,w)\) in the feature map \(\mathbf{X}'\), the value at output channel \(c\). \(c'\) is the channel index of the input feature map, \(i\) and \(j\) are the sliding positions of the convolution kernel on the data. \(\mathrm{w} \in \mathbb{R}^{r \times r \times C_{\text{in}} \times C_{\text{out}}}\) is the convolution kernel, with \(r \times r\) representing the kernel size and \(C_{\text{out}}\) representing the number of output channels. In our implementation, the first convolution uses \(C_0 = 2\) as the input, and the number of output channels is set to \(C_1 = 64\). In the subsequent layers, the number of input and output channels is fixed as \(C_{\text{in}} = C_{\text{out}} = 64\). BN is applied after the convolutional layer for normalization. First, the feature map \(\mathbf{X}'\) of the convolutional layer needs be normalized, denoted by \(\hat{\mathbf{X}}'\), based on the conclusion of [22]

\[\label{eq:norm} \hat{\mathbf{X}}' = \frac{\mathbf{X}' - \mu_B}{\sqrt{\sigma_B^2 + \epsilon'}},\tag{4}\] where \(\mu_B\) and \(\sigma_B^2\) are the mean and variance of the current batch data, and \(\epsilon'\) is a small constant to prevent division by zero. Next, to restore the network’s expressive capacity, scaling and shifting are typically applied to obtain the final BN output

\[\label{eq:BNout} \mathbf{X}_{\rm{BN}} = \gamma \hat{\mathbf{X}}' + \beta',\tag{5}\] where \(\gamma'\) and \(\beta'\) denote learnable scale and shift parameters, respectively. To reduce overfitting during the training process, the output of some Conv2D layers uses the ReLU activation function to introduce non-linearity. The output after the series of convolution, BN, and ReLU operations is represented as \(\mathbf{X}_1\), which serves as the input to the subsequent residual attention block. The residual connection mitigates the gradient vanishing problem in deep networks while preserving the integrity of the original information, thus stabilizing the training process. The attention mechanism enables the network to adaptively focus on the most important regions of the input feature map, thereby highlighting the key features that are most valuable for channel estimation. Based on the Transformer multi-head attention architecture, we design a mechanism that allocates weights through the Query-Key-Value dot-product calculation (ranging from 0 to 1). These weights dynamically determine which relevant elements of the input sequence each output position should focus on, thereby establishing global dependencies and enhancing the model’s ability to represent complex channel environments. Specifically, the input feature matrix \(\mathbf{X}_1\) passes through three independent linear projection layers to generate the query matrix \(\mathbf{Q}\), key matrix \(\mathbf{K}\), and value matrix \(\mathbf{V}\). When the number of attention heads is \(N_h\), for each head \(i\), the following calculations are performed [21] \[\label{eq:matrix} \mathbf{Q}_i = \mathbf{X}_1 \mathbf{W}_i^q, \quad \mathbf{K}_i = \mathbf{X}_1 \mathbf{W}_i^k,~ \text{and}~\mathbf{V}_i = \mathbf{X}_1 \mathbf{W}_i^v,\tag{6}\] where \(\mathbf{Q}_i\), \(\mathbf{K}_i\), and \(\mathbf{V}_i\) represent the corresponding \(i\)-th transformation matrices, \(i \in \{1, 2, \dots, N_h\}\). The attention weights are computed by comparing the query and key vectors, where the difference between these two vectors determines the extent to which an input element should attend to other elements. Multi-head attention leverages the query \(\mathbf{Q}_i\) and key \(\mathbf{K}_i\) to capture the spatiotemporal dependencies between elements in the channel sequence. The scaled dot-product attention for each head \(i\) is computed as \[\label{eq:Attention} \mathbf{A}_i = \text{softmax}\left( \frac{\mathbf{Q}_i \mathbf{K}_i^T}{\sqrt{d_k}} \right) \mathbf{V}_i,\tag{7}\] where \(d_k\) is the dimension of the key vector, \(\sqrt{d_k}\) is the scaling factor to prevent the dot product result from becoming too large, which could lead to the softmax gradient vanishing. Then, each head performs a weighted sum of the value matrix \(\mathbf{V}\) using the computed attention weight \(\mathbf{A}_i\), and the output for each head can be calculated by \(\text{head}_i= \mathbf{A}_i \times \mathbf{V}\). Multi-head attention enables the model to simultaneously focus on different subspaces of the input feature representations and process them in parallel. The outputs of all heads are concatenated and then mapped through a linear transformation to obtain the final output. That is, \[\label{eq:Concat} \mathbf{X}_{\text{att}} = \text{Concat}(\text{head}_1, \text{head}_2, \dots, \text{head}_{N_h}) \mathbf{W}^O,\tag{8}\] where \(\text{Concat}(\cdot)\) performs concatenation, and \(\mathbf{W}^O\) is a weight matrix that projects the concatenated heads into a unified representation. This output is then additively combined with the input via a residual connection, followed by layer normalization. This process alleviates the problems of gradient vanishing/exploding and enhances training stability, producing the final normalized output, namely

\[\label{eq:res} \mathbf{X}_{\text{out}} = \text{LayerNorm}(\mathbf{X}_1 + \mathbf{X}_\text{{att}}),\tag{9}\] For temporal modeling, the spatial features are vectorized and input into the LSTM in time-major order to learn dependencies across time steps [39].

4.2.2 LSTM Temporal Modeling Module↩︎

LSTM is an advanced variant of recurrent neural networks (RNN) that addresses the vanishing gradient and long-term dependency issues in traditional RNNs by incorporating gating mechanisms. These gates enable LSTM to selectively retain important temporal information and suppress irrelevant data, making it effective for time-series modeling. In our channel estimation framework, LSTM acts as the second module, receiving spatial features extracted by the CNN module for temporal modeling. The LSTM unit comprises three key gates: the forget gate, input gate, and output gate, as well as a cell state for long-term memory. At each time step, the gates determine which information to keep, update, or output based on the current input and the previous hidden state, ensuring efficient temporal dependency learning. The first layer is the forget gate, which is composed of information from the previous layer \(x_{t-1}\) and the current \(x_t\). It controls the retention of the previous time step’s memory state \(c_{t-1}\) at the current time step \[\label{eq:f} f_{g,t} = \sigma(\mathbf{W}_f x_t + \mathbf{W}_f' x_{t-1} + b_f),\tag{10}\] The second layer is the input gate, which determines the degree of updating of the new information at the current time step \[\label{eq:i} i_{g,t} = \sigma(\mathbf{W}_i x_t + \mathbf{W}_i' x_{t-1} + b_i),\tag{11}\] The third layer is the cell input state and further implemented by using the \(\text{tanh}\) function \[\label{eq:c} c_{g,t} = \text{tanh}(\mathbf{W}_c x_t + \mathbf{W}_c' x_{t-1} + b_c),\tag{12}\] The last layer is the output gate \[\label{eq:o} o_{g,t} = \sigma(\mathbf{W}_o x_t + \mathbf{W}_o' x_{t-1} + b_o),\tag{13}\] where \(\mathbf{W}_f\), \(\mathbf{W}_i\), \(\mathbf{W}_c\), and \(\mathbf{W}_o\) denote the input weights for each gate at the current time step. \(\mathbf{W}'_f\), \(\mathbf{W}'_i\), \(\mathbf{W}'_c\), and \(\mathbf{W}'_o\) denote the historical state weights. \(b_f\), \(b_i\), \({b}_c\), and \(b_o\) are the bias. \(f_{g,t}\), \(i_{g,t}\), and \(o_{g,t}\) are control vectors. \(\sigma(\cdot)\) is the gate activation function, which is typically chosen as the sigmoid. Apart from the candidate cell input state \(c_{g,t}\) in the hidden layer, the architecture maintains two additional cell states: the previous cell state \(c(t\!-\!1)\) received by the current LSTM unit and the current cell state \(c(t)\) passes to the next LSTM unit. The output state at the current time step is updated as \[\label{eq:C95t} c(t) = f_{g,t} \odot c(t-1) + i_{g,t} \odot c_{g,t},\tag{14}\] Finally, the output layer is computed as \[\label{eq:output} h_t = o_{g,t} \odot \text{tanh}\!\big(c(t)\big),\tag{15}\] where the operation of \(\odot\) represents the hadamard product. BiLSTM overcomes the limitation of LSTM that captures only past information by simultaneously integrating forward and backward information flows, thereby modeling both historical context and future tendencies. Specifically, a BiLSTM consists of two LSTM networks that process the input sequence in the forward and reverse directions, respectively, enabling a more comprehensive capture of temporal dependencies. Hence the forward output \(\overrightarrow{h_t}\) and backward output \(\overleftarrow{h_t}\) of each LSTM cell is calculated based on the relative input and the output layer function in Eq. 15 . The output of the BiLSTM network at each time step can be represented as \[\label{eq:Concat4041} h_t = \text{Concat}(\overrightarrow{h_t}, \overleftarrow{h_t}),\tag{16}\] To fully leverage the temporal features and location information, the temporal features output by BiLSTM are concatenated with the location features obtained from the multilayer perceptron (MLP) mapping along the feature dimension to obtain \[\label{eq:R} h'_t = \text{Concat}(h_t, R).\tag{17}\]

Subsequently, to align the network’s output with the target channel dimensions, the concatenated feature is processed by a fully-connected layer that produces a real-valued vector of appropriate size. This vector is subsequently interpreted as containing the concatenated real and imaginary parts of the channel coefficients, thereby yielding the complex-valued estimate \(\hat{\mathbf{h}}_k(t)\). The final channel matrix \(\hat{\mathbf{H}}_k(t)\) is obtained by applying the requisite reshape operation to \(\hat{\mathbf{h}}_k(t)\), dentoed by \(\hat{\mathbf{H}}_k(t) = \text{reshape}\left( \hat{\mathbf{h}}_k(t) \right)\). The detailed steps of the proposed channel estimation algorithm are outlined in Algorithm 3.

Figure 3: Residual Attention-based CNN-BiLSTM Channel Estimation

Model training:

The proposed channel estimation model is trained under a supervised learning paradigm. The gradients of the loss function with respect to the network parameters are computed via backpropagation. For parameter optimization, we employ the Adam optimizer, which utilizes adaptive estimates of first- and second-order moments to facilitate stable and efficient convergence toward a high-quality solution.

4.3 Complexity Analysis↩︎

A computational complexity analysis of the proposed model reveals that the primary costs originate from the CNN, multi-head self-attention, BiLSTM, and fully-connected layers. The initial stage involves processing the input through convolutional layers to project it into a higher-dimensional feature space. The complexity of this convolutional stage is \(\mathcal{O}(L_{\text{conv}} \cdot B \cdot H \cdot W \cdot C_{\text{in}} \cdot C_{\text{out}} \cdot k^2)\), where \(B\) is the batch size, \(L_{\text{conv}}\) is the number of convolutional layers, \(H\) and \(W\) denote the spatial dimensions of the feature maps, \(C_{\text{in}}\) and \(C_{\text{out}}\) are the input and output channels, and \(k\) represents the kernel size. The self-attention mechanism in the Transformer module is computationally intensive, with a complexity of \(\mathcal{O}(B \cdot T^2 \cdot d_{\text{model}})\), where \(T\) is the sequence length and \(d_{\text{model}}\) is the model dimension. The BiLSTM’s complexity is linear with respect to the sequence length \(T\) and quadratic with respect to the hidden state dimension \(d_h\), yielding \(\mathcal{O}(B \cdot T \cdot d_h^2)\). Finally, the total complexity of the fully-connected layers is given by \(\sum_{i=1}^{L_{\text{dense}}} \mathcal{O}(B \cdot d_{\text{in}}^{(i)} \cdot d_{\text{out}}^{(i)})\), where \(L_{\text{dense}}\) is the number of layers, and \(d_{\text{in}}^{(i)}\) and \(d_{\text{out}}^{(i)}\) denote the input and output dimensions of the \(i\)-th layer, respectively.

5 Simulation Results↩︎

In this section, we present numerical simulations to evaluate the performance of the proposed channel estimation scheme. The simulation encompasses both near-field and far-field scenarios, with the UAV and UE following predefined trajectories. The dataset is randomly split into an 80% training set and a 20% validation set. Key simulation parameters are listed in Table 2.

Table 2: Simulated parameters and values
Symbol Meanings Values
\(N\) the number of antennas 16
\(f\) frequency band of communication 3.5GHz
\(\lambda\) beamwave 0.0857m
\(d_F\) Rayleigh distance 9.6m
\(Q\) pilot length 8 [14]
\(v_U\) the speed of UAV \(5 \, \text{m/s}\)
\(v_k\) the speed of UE \(5 \, \text{m/s}\)
\(L_f\) number of far-field paths 3
\(L_n\) number of near-field paths 3
\(\eta\) learning rate 0.001

5.1 Baseline frameworks↩︎

To benchmark the performance of our proposed convolutional attention-based BiLSTM network, we conduct a comprehensive comparison with several established DL channel estimators. To this end, the following baseline methods are selected for performance evaluation:

  • CNN [9]: A classic CNN that extracts spatial features from channel data through convolution operations, enabling the model to capture spatial information from the input signal.

  • RACNN [22]: CNN combines residual connections with the Transformer self-attention mechanism. It focuses on key regions within the feature map, enhancing the model’s ability to attend to important information.

  • CNN-LSTM [10]: This model combines the spatial feature extraction capability of CNN with the temporal modeling strength of LSTM. It first extracts spatial features from the input signal using CNN, then captures temporal dependencies in the signal using LSTM.

  • DNN-BiLSTM [15]: This model leverages the strengths of DNN and BiLSTM. DNN performs deep learning of nonlinear features through multiple fully connected layers, while BiLSTM captures both forward and backward temporal dependencies of the signal.

  • CBAM-BiLSTM [18]: A model combining the CBAM attention mechanism with BiLSTM. CBAM is a lightweight attention mechanism that improves feature representation through SA and CA, which are then used in conjunction with BiLSTM to model temporal dependencies.

  • RACNN-BiLSTM: The proposed fusion architecture of convolutional attention and BiLSTM, which incorporates position information as auxiliary input. The residual CNN integrates the Transformer multi-head self-attention mechanism to enhance spatial feature representation, and BiLSTM is employed to model the temporal evolution of the channel.

5.2 Performance evaluation↩︎

Fig. 4 illustrates the variation of training loss and testing loss with respect to the number of training epochs during the training process of the proposed RACNN-BiLSTM method. As shown in the figure, both the training loss and test loss decrease rapidly in the early stages, indicating that the model successfully learns effective features in the initial phases. After approximately 25 epochs, the training loss stabilizes and remains at a relatively low level. Simultaneously, the test loss follows a similar trend, gradually converging and remaining consistent with the training loss, suggesting that the model does not experience overfitting and is able to maintain strong performance on the test set. Overall, the model demonstrates good convergence behavior, achieving low loss values in a relatively short training period, which highlights the effectiveness and stability of the proposed method.

Figure 4: RACNN-BiLSTM training vs. testing Loss

To evaluate the estimation performance of the proposed method, Fig. 5 shows the NMSE performance curves of different algorithms under varying SNR conditions for both far-field and near-field scenarios. Under low SNR conditions, due to strong noise interference and low signal quality, the performance of all algorithms is poor. As the SNR increases, the NMSE performance of all algorithms decreases, indicating that improving the SNR effectively mitigates noise interference, thereby enhancing the accuracy of channel estimation. The proposed RACNN-BiLSTM algorithm consistently exhibits superior performance in both scenarios. The framework achieves an average NMSE reduction of at least 30.25% in near-field and far-field scenarios under an SNR of 15 dB. In contrast, while RACNN also employs a residual attention mechanism, it lacks temporal modeling capability. The performance of CBAM-BiLSTM is inferior to that of RACNN-BiLSTM, primarily due to the differences in the attention mechanisms used and the lack of position-aided information. CNN, CNN-LSTM, and DNN-BiLSTM, which do not incorporate attention or residual structures, rely primarily on convolutional or fully connected networks for feature extraction. These models struggle to effectively capture key information in complex multipath and spatially correlated environments. Especially in near-field channels, where the spherical wave effects and spatial variations are more intricate, models lacking attention mechanisms fail to accurately capture spatial correlations, leading to inferior performance.

Fig. 6 illustrates the impact of different pilot lengths on NMSE performance in the near-field scenario with SNR of 10 dB. When the pilot length is 2, the channel estimation accuracy is limited by the scarcity of pilot information, resulting in relatively high NMSE values for all models. In this case, although the RACNN-BiLSTM model possesses strong spatiotemporal modeling capabilities, it cannot fully leverage its advantages due to the lack of sufficient pilot data. As the pilot length increases, the NMSE values of all models gradually decrease, indicating a significant improvement in channel estimation accuracy. Notably, the RACNN-BiLSTM model consistently maintains the lowest NMSE values across the higher pilot length range, demonstrating superior channel estimation performance compared to other methods.

a
b

Figure 5: NMSE vs. SNR for far-field and near-field.. a — far-field: NMSE vs. SNR, b — near-field: NMSE vs. SNR

Figure 6: near-field: NMSE vs. pilot lengths
a
b
c

Figure 7: far-field: Comparison of the performance of different algorithms with respect to the number of transmit antennas, different user speeds (\(5\text{m/s}\) and \(30\text{m/s}\)), and the BER and NMSE performance.. a — Number of transmitting antennas, b — User speeds, c — BER vs. SNR

a
b
c

Figure 8: near-field: Comparison of the performance of different algorithms with respect to the number of transmit antennas, different user speeds (\(5\text{m/s}\) and \(30\text{m/s}\)), and the BER and NMSE performance.. a — Number of transmitting antennas, b — User speeds, c — BER vs. SNR

In Fig. 7 (a) and Fig. 8 (a), we present the NMSE performance of different algorithms under varying numbers of transmit antennas {8, 16, 32, 64, 128} in near-far field scenario, with a fixed SNR of 10 dB. As shown in the figure, the NMSE of channel estimation gradually increases with the number of transmit antennas. This is because increasing the number of antennas requires the model to estimate more channel coefficients, leading to higher errors and increased computational complexity. However, the proposed RACNN-BiLSTM algorithm consistently achieves the lowest NMSE, demonstrating its effectiveness in channel estimation even under higher complexity.

Fig. 7 (b) and Fig. 8 (b) compare the performance of different channel estimation algorithms under near-far field at various UE’s moving speeds. As the SNR increases, the NMSE for all algorithms exhibits a monotonic decrease, confirming the robustness of the models under reduced noise conditions. At the same SNR, the curve for \(5\text{m/s}\) shows a lower NMSE, indicating that relatively slower user speeds lead to stronger correlation between historical channel states, making it easier to leverage this information to improve estimation accuracy. In contrast, the rapid temporal variation at \(30\text{m/s}\) leads to a decrease in the effectiveness of temporal information. Our proposed RACNN-BiLSTM algorithm performs the best, followed closely by CBAM-BiLSTM, highlighting the advantages of the attention mechanism and temporal modeling. CNN-LSTM and DNN-BiLSTM exhibit mid-range performance, while CNN and RACNN, which do not incorporate temporal modeling, perform the worst, underscoring the importance of temporal modeling in dynamic channel estimation.

The bit error rate (BER) performance is an important indicator for evaluating how a channel estimation method impacts the overall system performance. Fig. 7 (c) and Fig. 8 (c) compare the BER performance of different algorithms with user speed of \(v_k=\text{10} \text{m}/\text{s}\) in near-far field scenarios. As the SNR increases, the BER of all models gradually decreases, demonstrating the significant impact of signal quality on model performance. Among them, the proposed RACNN-BiLSTM model exhibits the BER performance closest to that of perfect CSI, indicating its superior adaptability and accuracy in complex channel environments. RACNN-BiLSTM achieves an average performance gain of 5.0 dB over CNN and 1.7 dB over CBAM-BiLSTM across the tested SNR range.

a
b
c

Figure 9: far-field: Comparison of performance of different algorithms with respect to the number of T_history and the number of paths.. a — T_history, b — \(L_f\) = 4, c — \(L_f\) = 8

a
b
c

Figure 10: near-field: Comparison of performance of different algorithms with respect to the number of T_history and the number of paths.. a — T_history, b — \(L_n\) = 4, c — \(L_n\) = 8

The number of historical time slots refers to the quantity of past time slots utilized by the model in channel estimation, which determines the maximum temporal dependency range that the model can capture. Fig. 9 (a) and Fig. 10 (a) illustrate the performance comparison of different algorithms for channel estimation with varying historical time slot lengths, when the SNR is set to 10 dB and the user speed is  . When the number of historical time slots increases from 2 to 6, the performance of all models improves significantly. This indicates that as the number of historical time slots increases, the models can capture more prior information about the channel state, allowing them to better identify temporal variations in the channel and improve the accuracy of channel estimation. In the range of 6 to 10 historical time slots, the rate of NMSE decrease slows down significantly, and performance improvement tends to plateau. This is because, in a dynamic environment, as time progresses, channel variations become more pronounced, and information from earlier historical time slots may no longer effectively reflect the current channel state. When the number of historical time slots exceeds 10, CNN, RACNN, and CNN-LSTM exhibit slight performance degradation, indicating that an excessive number of historical time slots may cause the model to lag in response and introduce noise, making it unable to adapt to rapid channel changes, while also lacking more effective temporal modeling capabilities. Our proposed RACNN-BiLSTM method continues to show lower NMSE as the number of historical time slots increases. When the number of historical time slots is 8, the proposed method achieves an average NMSE reduction of at least 91% and 27.6% compared to CNN and CBAM-BiLSTM, respectively, in both near-field and far-field scenarios. These results suggest that in dynamic channel estimation, appropriately utilizing the length of historical information strikes an optimal balance between information gain and computational complexity. However, when the number of historical time slots becomes excessive, the model may face redundancy and overfitting issues, leading to a decline in performance.

To investigate the impact of multipath richness on channel estimation performance, we conducted comparative experiments in two scenarios with different path numbers (\(L_f=4\) and \(L_f=8\)). As illustrated in Fig. 9 (b), 9 (c), 10 (b), 10 (c), the NMSE for \(L_f=4\) is on average 1.2dB lower than for \(L_f=8\) across the whole SNR range, indicating that when the number of paths is higher, the channel environment becomes more complex, potentially leading to higher estimation errors due to path interference and signal distortion, resulting in a higher NMSE. Among all schemes, RACNN-BiLSTM and CBAM-BiLSTM yield the lowest errors, confirming the benefit of combining attention mechanisms with BiLSTM. Moreover, the consistent superiority of RACNN-BiLSTM demonstrates that its multi-head self-attention, by modeling long-range dependencies in parallel subspaces, better captures the latent time-frequency structures of dense multipath channels than the sequential channel-spatial attention used in CBAM, leading to higher estimation accuracy and stronger robustness.

6 Conclusion↩︎

In this paper, we propose a DL channel estimation method aimed at addressing the dynamic channel estimation problem in UAV-assisted communication systems. This method integrates CNNs for spatial feature extraction, BiLSTM for temporal modeling, and introduces a residual self-attention mechanism to enhance the representation of key channel features. Furthermore, integrating the location information of both communication parties into the network facilitates the model’s ability to effectively differentiate between the characteristics of near-field and far-field environments, thereby enhancing the accuracy of channel estimation. Extensive simulations demonstrate that the proposed RACNN-BiLSTM model exhibits superior estimation accuracy under both near-field and far-field conditions, benefiting from the advantages of location information and the attention mechanism. Compared to other schemes, the location-based channel estimation method improves performance by at least 36.3% and 24.2% in far-field and near-field scenarios, respectively, enabling better adaptation to dynamic environmental changes and enhancing channel estimation accuracy.

References↩︎

[1]
N. Zhao, W. Lu, M. Sheng, Y. Chen, J. Tang, F. R. Yu, and K.-K. Wong, “Uav-assisted emergency networks in disasters,” IEEE Wireless Communications, vol. 26, no. 1, pp. 45–51, 2019.
[2]
D. Yin, X. Yang, H. Yu, S. Chen, and C. Wang, “An air-to-ground relay communication planning method for uavs swarm applications,” IEEE Transactions on Intelligent Vehicles, vol. 8, no. 4, pp. 2983–2997, 2023.
[3]
Y. Byun, H. Kim, S. Kim, and B. Shim, “Channel estimation and phase shift control for uav-carried ris communication systems,” IEEE Transactions on Vehicular Technology, vol. 72, no. 10, pp. 13 695–13 700, 2023.
[4]
M. Cui and L. Dai, “Channel estimation for extremely large-scale mimo: Far-field or near-field?” IEEE transactions on communications, vol. 70, no. 4, pp. 2663–2677, 2022.
[5]
J. Rodrı́guez-Fernández, N. González-Prelcic, K. Venugopal, and R. W. Heath, “Frequency-domain compressive channel estimation for frequency-selective hybrid millimeter wave mimo systems,” IEEE Transactions on Wireless Communications, vol. 17, no. 5, pp. 2946–2960, 2018.
[6]
S. Yang, H. Chen, W. Liu, X.-P. Zhang, and C. Yuen, “Near-field channel estimation and localization: Recent developments, cooperative integration, and future directions,” IEEE Signal Processing Magazine, vol. 42, no. 1, pp. 60–73, 2025.
[7]
Z. Liu, L. Zhang, and Z. Ding, “Overcoming the channel estimation barrier in massive mimo communication via deep learning,” IEEE Wireless Communications, vol. 27, no. 5, pp. 104–111, 2020.
[8]
Q. Hu, F. Gao, H. Zhang, S. Jin, and G. Y. Li, “Deep learning for channel estimation: Interpretation, performance, and comparison,” IEEE Transactions on Wireless Communications, vol. 20, no. 4, pp. 2398–2412, 2020.
[9]
P. Jiang, C.-K. Wen, S. Jin, and G. Y. Li, “Dual cnn-based channel estimation for mimo-ofdm systems,” IEEE Transactions on Communications, vol. 69, no. 9, pp. 5859–5872, 2021.
[10]
C. Nguyen, T. M. Hoang, and A. A. Cheema, “Channel estimation using cnn-lstm in ris-noma assisted 6g network,” IEEE Transactions on Machine Learning in Communications and Networking, vol. 1, pp. 43–60, 2023.
[11]
W. Kim, Y. Ahn, J. Kim, and B. Shim, “Towards deep learning-aided wireless channel estimation and channel state information feedback for 6g,” Journal of Communications and Networks, vol. 25, no. 1, pp. 61–75, 2023.
[12]
C. Xu, J. An, T. Bai, S. Sugiura, R. G. Maunder, Z. Wang, L.-L. Yang, and L. Hanzo, “Channel estimation for reconfigurable intelligent surface assisted high-mobility wireless systems,” IEEE Transactions on Vehicular Technology, vol. 72, no. 1, pp. 718–734, 2022.
[13]
Z. Zhu, R. Yang, C. Li, Y. Huang, and L. Yang, “Adaptive joint sparse bayesian approaches for near-field channel estimation,” IEEE Transactions on Wireless Communications, 2025.
[14]
S. Jang and C. Lee, “Neural network-aided near-field channel estimation for hybrid beamforming systems,” IEEE Transactions on Communications, vol. 72, no. 11, pp. 6768–6782, 2024.
[15]
J. Yu, X. Liu, Y. Gao, C. Zhang, and W. Zhang, “Deep learning for channel tracking in irs-assisted uav communication systems,” IEEE Transactions on Wireless Communications, vol. 21, no. 9, pp. 7711–7722, 2022.
[16]
M. H. Rahman, M. A. S. Sejan, M. A. Aziz, Y.-H. You, and H.-K. Song, “Hydnn: A hybrid deep learning framework based multiuser uplink channel estimation and signal detection for noma-ofdm system,” IEEE Access, vol. 11, pp. 66 742–66 755, 2023.
[17]
Y. Liao, Y. Hua, and Y. Cai, “Deep learning based channel estimation algorithm for fast time-varying mimo-ofdm systems,” IEEE Communications Letters, vol. 24, no. 3, pp. 572–576, 2019.
[18]
X. Zheng, Z. Liu, S. Cheng, Y. Wu, Y. Chen, and Q. Zhang, “Channel estimation for reconfigurable intelligent surface-aided millimeter-wave massive multiple-input multiple-output system with deep residual attention network,” ETRI Journal, vol. 47, no. 3, pp. 480–491, 2025.
[19]
W. Gao, W. Zhang, L. Liu, and M. Yang, “Deep residual learning with attention mechanism for ofdm channel estimation,” IEEE Wireless Communications Letters, vol. 14, no. 2, pp. 250–254, 2022.
[20]
Z. Chen, F. Gu, and R. Jiang, “Channel estimation method based on transformer in high dynamic environment,” in 2020 International Conference on Wireless Communications and Signal Processing (WCSP).IEEE, 2020, pp. 817–822.
[21]
J. Guo, G. Liu, Q. Wu, and P. Fan, “Parallel attention-based transformer for channel estimation in ris-aided 6g wireless communications,” IEEE Transactions on Vehicular Technology, 2024.
[22]
V. T. Lam, D. H. Son, T. T. T. Quynh, and L. T. Thanh, “Racnn: Residual attention convolutional neural network for near-field channel estimation in 6g wireless communications,” arXiv preprint arXiv:2503.02299, 2025.
[23]
J. Chen, Y.-C. Liang, H. V. Cheng, and W. Yu, “Channel estimation for reconfigurable intelligent surface aided multi-user mmwave mimo systems,” IEEE Transactions on Wireless Communications, vol. 22, no. 10, pp. 6853–6869, 2023.
[24]
H. Chung, S. Hong, and S. Kim, “Efficient multi-user channel estimation for ris-aided mmwave systems using shared channel subspace,” IEEE Transactions on Wireless Communications, vol. 23, no. 8, pp. 8512–8527, 2024.
[25]
G. Zhou, Z. Peng, C. Pan, and R. Schober, “Individual channel estimation for ris-aided communication systems—a general framework,” IEEE Transactions on Wireless Communications, vol. 23, no. 9, pp. 12 038–12 053, 2024.
[26]
O. Rekik, K. N. Aliyu, B. M. Tuan, K. Abed-Meraim, and N. L. Trung, “Fast subspace-based blind and semi-blind channel estimation for mimo-ofdm systems,” IEEE Transactions on Wireless Communications, vol. 23, no. 8, pp. 10 247–10 257, 2024.
[27]
Z. Xiao, S. Cao, L. Zhu, Y. Liu, B. Ning, X.-G. Xia, and R. Zhang, “Channel estimation for movable antenna communication systems: A framework based on compressed sensing,” IEEE Transactions on Wireless Communications, vol. 23, no. 9, pp. 11 814–11 830, 2024.
[28]
C. Xu, J. An, T. Bai, L. Xiang, S. Sugiura, R. G. Maunder, L.-L. Yang, and L. Hanzo, “Reconfigurable intelligent surface assisted multi-carrier wireless systems for doubly selective high-mobility ricean channels,” IEEE Transactions on Vehicular Technology, vol. 71, no. 4, pp. 4023–4041, 2022.
[29]
H. Lei, J. Zhang, Z. Wang, B. Ai, and E. Bjornson, “Near-field user localization and channel estimation for xl-mimo systems: Fundamentals, recent advances, and outlooks,” IEEE Wireless Communications, 2025.
[30]
Y. Lu and L. Dai, “Near-field channel estimation in mixed los/nlos environments for extremely large-scale mimo systems,” IEEE Transactions on Communications, vol. 71, no. 6, pp. 3694–3707, 2023.
[31]
X. Zhang, H. Zhang, and Y. C. Eldar, “Near-field sparse channel representation and estimation in 6g wireless communications,” IEEE Transactions on Communications, vol. 72, no. 1, pp. 450–464, 2023.
[32]
S. Yang, C. Xie, W. Lyu, B. Ning, Z. Zhang, and C. Yuen, “Near-field channel estimation for extremely large-scale reconfigurable intelligent surface (xl-ris)-aided wideband mmwave systems,” IEEE Journal on Selected Areas in Communications, vol. 42, no. 6, pp. 1567–1582, 2024.
[33]
J. N. Pisharody, A. Rajoriya, and R. Budhiraja, “Near-field channel estimation for xl-mimo systems using variational bayesian learning,” IEEE Transactions on Wireless Communications, vol. 23, no. 9, pp. 10 740–10 756, 2024.
[34]
M. Haghshenas, P. Ramezani, M. Magarini, and E. Björnson, “Parametric channel estimation with short pilots in ris-assisted near-and far-field communications,” IEEE Transactions on Wireless Communications, vol. 23, no. 8, pp. 10 366–10 382, 2024.
[35]
K. Mei, J. Liu, X. Zhang, N. Rajatheva, and J. Wei, “Performance analysis on machine learning-based channel estimation,” IEEE Transactions on Communications, vol. 69, no. 8, pp. 5183–5193, 2021.
[36]
A. K. Gizzini and M. Chafii, “A survey on deep learning based channel estimation in doubly dispersive environments,” IEEE Access, vol. 10, pp. 70 595–70 619, 2022.
[37]
A. K. Gizzini, M. Chafii, A. Nimr, R. M. Shubair, and G. Fettweis, “Cnn aided weighted interpolation for channel estimation in vehicular communications,” IEEE Transactions on Vehicular Technology, vol. 70, no. 12, pp. 12 796–12 811, 2021.
[38]
C. Luo, J. Ji, Q. Wang, X. Chen, and P. Li, “Channel state information prediction for 5g wireless communications: A deep learning approach,” IEEE transactions on network science and engineering, vol. 7, no. 1, pp. 227–236, 2018.
[39]
L. Cai, G. Xu, Q. Zhang, Z. Song, and W. Zhang, “Deep learning based channel estimation for deep-space communications,” IEEE Transactions on Vehicular Technology, 2025.
[40]
Y. Liao, Y. Hua, X. Dai, H. Yao, and X. Yang, “Chanestnet: A deep learning based channel estimation for high-speed scenarios,” in ICC 2019-2019 IEEE international conference on communications (ICC).IEEE, 2019, pp. 1–6.
[41]
T. Zhou, H. Zhang, B. Ai, C. Xue, and L. Liu, “Deep-learning-based spatial–temporal channel prediction for smart high-speed railway communication networks,” IEEE Transactions on Wireless Communications, vol. 21, no. 7, pp. 5333–5345, 2022.
[42]
J. Yuan, H. Q. Ngo, and M. Matthaiou, “Machine learning-based channel prediction in massive mimo with channel aging,” IEEE Transactions on Wireless Communications, vol. 19, no. 5, pp. 2960–2973, 2020.
[43]
X. Zheng and V. K. Lau, “Simultaneous learning and inferencing of dnn-based mmwave massive mimo channel estimation in iot systems with unknown nonlinear distortion,” IEEE Internet of Things Journal, vol. 9, no. 1, pp. 783–799, 2021.
[44]
J. Zhang, X. Ma, J. Qi, and S. Jin, “Designing tensor-train deep neural networks for time-varying mimo channel estimation,” IEEE Journal of Selected Topics in Signal Processing, vol. 15, no. 3, pp. 759–773, 2021.
[45]
L. Guo, P. Gu, J. Zou, G. Liu, and F. Shu, “Dnn-based fractional doppler channel estimation for otfs modulation,” IEEE Transactions on Vehicular Technology, vol. 72, no. 11, pp. 15 062–15 067, 2023.
[46]
Y. Yang, F. Gao, X. Ma, and S. Zhang, “Deep learning-based channel estimation for doubly selective fading channels,” IEEE Access, vol. 7, pp. 36 579–36 589, 2019.
[47]
J. Gao, C. Zhong, G. Y. Li, J. B. Soriaga, and A. Behboodi, “Deep learning-based channel estimation for wideband hybrid mmwave massive mimo,” IEEE Transactions on Communications, vol. 71, no. 6, pp. 3679–3693, 2023.
[48]
T. Sun, J. Lv, and T. Zhou, “A transformer-based channel estimation method for otfs systems,” Entropy, vol. 25, no. 10, p. 1423, 2023.
[49]
H. Ju, S. Jeong, B. Lee, and B. Shim, “Transformer-based predictive channel estimation for mmwave massive mimo systems,” in 2024 IEEE 100th Vehicular Technology Conference (VTC2024-Fall).IEEE, 2024, pp. 1–5.

  1. This work is supported by the National Natural Science Foundation of China with Grant 62301076, Fundamental Research Funds for the Central Universities with Grant 24820232023YQTD01, National Natural Science Foundation of China with Grants 62341101 and 62321001, Beijing Municipal Natural Science Foundation with Grant L232003, and National Key Research and Development Program of China with Grant 2022YFB4300403.↩︎

  2. W. Yuan is with the School of Computer Science, Qufu Normal University, Rizhao, P.R. China. E-mail: Ywl1104@163.com;↩︎

  3. K. Yu (the corresponding author) is with the Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, 100876, P.R. China. E-mail: kanyu1108@126.com;↩︎

  4. X. Liu is with the School of Computer Science, Qufu Normal University, Rizhao, P.R. China. E-mail: liuxw@qfnu.edu.cn;↩︎

  5. Q. Zhang is with the Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, 100876, P.R. China. E-mail: zhangqixun@bupt.edu.cn;↩︎

  6. K. Li is with the School of Computer Science, Qufu Normal University, Rizhao, P.R. China. E-mail: lkx0311@126.com;↩︎

  7. Z. Feng is with the Key Laboratory of Universal Wireless Communications, Ministry of Education, Beijing University of Posts and Telecommunications, Beijing, 100876, P.R. China. E-mail: fengzy@bupt.edu.cn.↩︎