Speech-based automatic estimation of depression levels is essential for enabling early detection and timely intervention, particularly in resource-constrained mental health settings. In recent years, deep learning has demonstrated impressive success across various domains, including affective computing and mental health assessment. Most existing approaches rely on RNN-based architectures (such as LSTM and GRU) to model temporal information for depression estimation. However, the extracted features often emphasize only a few adjacent speech segments, limiting their ability to capture long-range dependencies. To overcome this limitation, we introduce a memory-based feature augmentation method that enhances the representational capacity of GRU-extracted features. Rather than indiscriminately incorporating historical data, our memory bank is designed to selectively integrate two types of components in order to reduce redundancy and irrelevance: (1) historical temporal features that closely resemble the current GRU output, offering complementary contextual information; and (2) dynamic memory features identified based on feature variability, which capture behavioral and emotional fluctuations indicative of depressive symptoms. To effectively fuse the memory-augmented features with GRU outputs, we further design a Hierarchical Attention Fusion (HAF) module. Our method is evaluated on the widely used DAIC-WOZ and E-DAIC datasets, achieving state-of-the-art performance.
Non-Hermitian (NH) topology has been extensively explored in wave and matter systems, typically relying on the routing of complex, non-reciprocal couplings in physical space. This work demonstrates the experimental realization of programmable NH topological phases within decentralized multi-robot networks. By digitally programming non-reciprocal interaction rules and establishing real-time state exchange among active robots, we observe emergent topological zero modes (TZMs) and NH skin effects in synthetic lattices spanning one to three dimensions. Dynamically tailoring non-reciprocal parameters enables the precise morphing of TZMs between localized and delocalized states, establishing a versatile framework for topological mode engineering across dimensionalities. This platform establishes multi-robot networks as highly reconfigurable systems for exploring non-equilibrium topological physics, while paving the way for topologically protected, robust collective behaviors in active matter.
Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.
Skin cancer is among the most prevalent malignancies worldwiAdbe satnradcitts early detection is essential for improving patient survival and reducing treatment costs Conventional dermoscopic and visual imaging techniques are primarily limited to the visible spectrum and often fail to capture subtle spectral signatures associated with early stage malignancies This study proposes an innovative framework that integrates a multispectral metasurface for imaging with a hybrid deep learning architecture based on Convolutional Neural Networks and Vision Transformers The designed metasurface enables noninvasive acquisition of rich spectral information highly sensitive to tissue alterations while the hybrid CNN ViT model simultaneously extracts local and global features to robustly classify skin lesions Simulation-based evaluations demonstrate that the proposed method achieves approximately 98 accuracy 95 percentages sensitivity and 99 perentage specificity surpassing conventional RGB-based and single-architecture approaches Qualitative analyses using attention maps reveal that the model focuses on clinically relevant lesion regions improving interpretability Overall the results indicate that combining metasurface based multispectral imaging with hybrid deep learning can introduce a new generation of diagnostic tools in dermatology and pave the way for portable fast and highly accurate clinical systems
The integration of green hydrogen in local energy markets is often analyzed from a technical flexibility perspective, while the effect of market design rules remains less explored. This paper proposes a coordinated local electricity-hydrogen market framework in which hydrogen participation is regulated by explicit renewable access mechanisms. A mixed-integer linear programming model is developed to co-optimize electricity trading, battery operation, wind allocation and hydrogen production under centralized coordination. Six regulatory cases are examined including hydrogen supply options and access of local wind. Results are obtained for representative seasonal weeks for Norwegian energy community. Electrolyzer, when connected as rigid load, increases grid dependence, but also improves system cost when price-based participation is activated. Direct renewable access reduces grid imports, enhances wind allocation and introduces competition with households for energy distribution and system cost optimization. Furthermore, findings show that (i) hydrogen integration in local energy systems is essentially a market design problem and (ii) renewable access rules critically determine system behaviour, flexibility interactions and seasonal performance.
Carrier-phase (CP) ranging is a key enabler of high-precision positioning in modern wireless systems. In multi-frequency OFDM-based sensing, phase observations across subcarriers provide information about the underlying propagation geometry. However, in realistic industrial and urban environments, these observations exhibit non-Gaussian and asymmetric characteristics due to deterministic multipath components, violating standard circular statistical assumptions. In this work, we analyze CP-based ranging as an estimation problem over circular phase observations. We show that conventional model-based estimators, such as circular averaging under von Mises assumptions, become biased under 3GPP-compliant propagation conditions. Using a QuaDRiGa-based simulation framework, we evaluate empirical phase distributions in Industrial Factory (InF) and Urban Microcell (UMi) scenarios and quantify their deviation from classical statistical models. To address these limitations, we propose a learning-based estimator that operates directly on empirical phase distributions without assuming a predefined statistical model. Experimental results show improved accuracy compared to classical estimators, particularly under multipath conditions.
A single-radio-frequency (RF) movable array is investigated, in which all movable elements are driven by a single RF chain with equal amplitude and equal phase. The achievable beamforming gain enabled by antenna placement is analyzed. Linear beamforming gain scaling with the number of antennas is shown to be achievable in single-path channels, while coherent-combining conditions and aperture requirements are established for multipath channels. For multiuser transmission, the optimal max-min power allocation is derived in closed form, based on which an element-wise coordinate-search algorithm is developed for antenna placement design. Numerical results validate the analysis and reveal a fundamental tradeoff: beamforming gains can be achieved through antenna placement alone, but only at the expense of increased aperture resources.
The increasing occurrence of small-signal instability, particularly sub-synchronous oscillations (SSOs), in power systems with a high penetration of inverter-based resources (IBRs) has made the planning of new IBR connections increasingly important and challenging. The impact of such connections on small-signal stability is not always straightforward, as it strongly depends on the connection location, inverter operating mode, control configuration, parametrisation, and operating conditions. This paper proposes an inverter connection screening tool (ICST) that enables efficient and accurate assessment of the impact of prospective inverter configurations on small-signal system strength. It can identify, among the candidates considered, the most suitable inverter configuration for a given connection location that avoids degrading small-signal system strength and can also enhance it. As a result, higher IBR penetration can be supported while maintaining small-signal stability. The ICST evaluates candidate inverter configurations using their admittances at critical modal frequencies, along with the system's admittance spectrum, thereby avoiding the need for analytical models. The ICST-based planning procedure, which can support system operators, asset owners, and IBR developers in decision-making across different stages of planning studies, is demonstrated using a modified IEEE 57-bus system. Comparisons with model-based studies demonstrate the accuracy of the ICST in predicting the modal impact of inverter connections and its effectiveness in selecting suitable inverter control configurations.
As renewable energy systems expand, inverter availability becomes increasingly important for grid reliability and economics, yet photovoltaic inverter repair logistics remain under-modeled. This paper presents an event-driven Monte Carlo framework for a centralized repair facility with parallel production lines, capturing the full repair cycle from administrative pre-wait and transport to health-driven repair and return-to-inventory. The model incorporates opportunistic scheduling that uses mandatory hold periods to insert additional units onto temporarily idle lines, improving throughput without added capacity. Stage durations are represented by a two-component VaR-style mixture distribution for routine and heavy-tailed delays, while a continuous health score determines repair completion. Calibrated by minimizing the one-dimensional Wasserstein distance between simulated and empirical repair-duration distributions, the model is applied to 43 field-observed repairs, reproducing the empirical bimodal structure with a Wasserstein distance of 53.3 days. Results show that 51.2% of units are accommodated through opportunistic insertion, indicating that hold periods provide a significant recoverable scheduling resource.
Although symmetricity in the converter controller is desirable for robust stability margins, a direct link between system-level asymmetricity and instability has yet to be clearly established. Converter control introduces three-phase asymmetricity through loops such as DC-link voltage control, a phase-locked loop , and a power synchronization loop. Furthermore, the inherently asymmetric topology of the two-level voltage-source converter, which converts a DC voltage into a three-phase balanced set, acts as the underlying origin of the asymmetries that propagate into the control structure. Consequently, establishing a direct relationship between system asymmetricity (rather than control asymmetricity alone) and the stability margin is essential for understanding the underlying instability mechanisms. In this work, asymmetricity is quantified using the Asymmetricity Quantification Index (AQI), derived from the sequence-domain representation of the interconnected converter-grid impedance. Within this domain, symmetricity is identified through the definition of symmetrical matrices, which serve as the benchmark against which asymmetricity is measured. A robust and generalized analysis correlates AQI with the stability margin, including both grid-following and grid-forming control structures connected to the power grid. It is found that instability arises from increased asymmetricity in the combined converter-grid system, which is dominated by asymmetric control loops and operating points. Thus, reducing asymmetricity without compromising controller functionality can improve stability margins. The analysis is validated in both control-hardware-in-the-loop and power-hardware-in-the-loop environments.
Speech foundation models often struggle in low-resource domains due to domain mismatch and data scarcity. We propose Gumbel-BEARD, a domain adaptation framework that automates Whisper encoder layer selection via an end-to-end trainable hard Gumbel-Softmax selector. It enables self-supervised adaptation with a BEST-RQ objective that dynamically adapts to target acoustic characteristics without manual tuning. Experiments on the MyST child speech corpus demonstrate efficiency and scalability: with 10 h of labeled data for fine-tuning, our method matches a fully supervised baseline trained on the complete 133 h labeled set. We establish new state-of-the-art word error rates (WERs) of 8.21% using Whisper-medium on MyST and 11.06% using Whisper-small on the OGI Spontaneous dataset. Evaluation on CORAAL further confirms robustness to adult dialectal domain shifts, with up to 6% relative WER reduction, highlighting the generalizability of our approach to diverse low-resource conditions.
The cumulative distribution transform (CDT) is a quantile-based transport representation that exactly linearizes one-dimensional translations of positive densities. We study how this structure behaves under additive perturbations and how it can be exploited for shift recovery. Under a local nondegeneracy condition, we derive a first-order expansion showing that additive noise in physical space induces a nonlocal perturbation in CDT space through the primitive of the noise, weighted by the reciprocal density. This yields an explicit description of transform-domain sensitivity and shows, in particular, that perturbations are amplified in low-density regions. When the physical-space perturbation is modeled as a centered Gaussian random field, the induced first-order CDT perturbation is again Gaussian, with an explicit covariance kernel. We then use this structure to study recovery in CDT coordinates. In the known-template setting, the transport shift is obtained by projection onto the constant mode, giving an explicit estimator together with exactness in the noiseless case and a stability bound under perturbations. In the unknown-template setting, multiple observations permit joint recovery of the shifts and a common template up to the natural constant-mode gauge, leading to a simple de-shift--and--average procedure. We also consider a signed-signal analogue based on the signed cumulative distribution transform (SCDT), where shifts are estimated numerically by feature matching and unknown templates are recovered by alternating alignment and averaging. Numerical experiments validate the perturbation analysis and illustrate effective recovery for both density-valued and signed signals.
This paper investigates coherent multiband orthogonal frequency division multiplexing (OFDM) sensing within an integrated sensing and communication (ISAC) framework. We consider an intra-band configuration in which two sensing subbands of equal width are allocated symmetrically within the same OFDM channel, while the central portion remains available for communication. We address the reconstruction of missing frequency-domain samples induced by the spectral gap and the suppression of the resulting grating lobes in the delay profile. To this end, we propose a low-complexity iterative reconstruction method consisting of an initial delay-domain equalization stage and an iterative apodization-based operator with data-consistency enforcement. Performance results for multi-target scenarios show that the proposed approach remains close to the full-band reference for moderate gap sizes and degrades only for larger gaps because of residual grating lobes. Compared with the compressed-sensing-based orthogonal matching pursuit (OMP) baseline, it exhibits a more favorable performance trend as the number of targets increases, especially in the practically relevant low-signal-to-noise ratio (SNR) regime, while offering a complexity scaling that is independent of the estimated number of targets.
The success of large-scale deep learning models in neuroscience is fundamentally constrained by severe data heterogeneity. Native fMRI data aggregated from diverse sources exhibit substantial variation in both spatial and temporal resolutions. Consequently, most existing frameworks rely on lengthy, rigid preprocessing pipelines that enforce uniformity across datasets. This practice introduces two critical limitations: (1) potential degradation of subject-specific anatomical information; (2) significant computational overhead, often requiring hours of processing per subject. Here, we propose FlexiBrain, a resolution-agnostic voxel-level encoding framework for native fMRI based on Mamba-JEPA. FlexiBrain defines patch sizes in real-world physical units and employs a dynamic patch resizing, thereby bypassing destructive spatial standardization while enabling direct ingestion of data in native space. We instantiate the framework using an efficient Mamba-JEPA backbone to model high-dimensional 4D fMRI signals. Across five diverse downstream neuroscience tasks, FlexiBrain consistently outperforms recent state-of-the-art methods, achieving gains of up to 12 percentage points without external data augmentation. Importantly, FlexiBrain functions as a seamless plug-in module, substantially reducing preprocessing costs and accelerating the development of robust voxel-level fMRI foundation models. Code is available at this https URL.
Evaluating generative spatial audio for First-Order Ambisonics (FOA) remains challenging due to a limited understanding of how metrics respond to changes in spatial parameters such as azimuth and elevation. We propose a framework to analyze metric sensitivity along continuous spatial trajectories, drawing on principles of sensitivity analysis in parametric sound synthesis. Using controlled FOA scenes with increasing scene complexity, we define three desiderata for metric behavior: Responsiveness, Smoothness, and Symmetry. We assess standard distribution-based and sample-based metrics, including Fréchet Audio Distance (FAD), intensity vectors, and acoustic maps. Our findings show that FAD using localization-specific embeddings and acoustic maps yield high Responsiveness and robust Smoothness and Symmetry across conditions, while intensity vectors degrade with increasing scene complexity. This is the first step towards investigating the sensitivity of metrics for generative spatial audio.
Pixel antennas enable antenna coding, a technique that can provide more degrees of freedom in wave manipulation, to enhance wireless communications. However, acquiring full channel state information (CSI) at the transmitter incurs prohibitive overhead due to the unique hardware constraints from pixel antennas. This paper thus proposes a limited feedback multi-input multi-output (MIMO) system using pixel antennas, where the antenna coder and digital precoder are designed based on pre-defined codebooks and efficient index feedbacks. We first derive the optimal digital precoder under practical power constraints that provides insights on simplifying the joint codebook construction for antenna coder and digital precoder. We then develop a low-complexity offline codebook construction algorithm that enables subsequent codebook designs for the antenna coder and digital precoder. Simulation results demonstrate that the proposed scheme significantly outperforms unconstrained MIMO systems using conventional antennas with fixed configurations.
Large Language Models (LLMs) have rapidly emerged as tools of interest across engineering disciplines, and Process Systems Engineering (PSE) is no exception. This survey provides a systematic review of LLM applications in PSE, organizing the literature into seven categories: (1) process design and engineering, (2) molecular design and synthesis, (3) process modeling and simulation, (4) time-series forecasting, (5) optimization and scheduling, (6) process control, and (7) fault detection and diagnosis. For each category, we summarize the state of the art, identify common methodological approaches, and critically assess demonstrated capabilities versus aspirational claims. We find that LLMs show genuine promise for tasks involving natural language, including querying documentation, synthesizing unstructured knowledge, and enabling flexible human-machine interaction. However, applications requiring real-time execution, constraint satisfaction, or formal safety guarantees remain challenging. We conclude by identifying open problems and productive research directions for the PSE community.
In this paper, we consider a class of networked systems comprising an interconnected set of linear subsystems, disturbance inputs, and performance outputs. Using dissipativity theory, we first propose a model-based hierarchical control design strategy to ensure the closed-loop networked system is dissipative from its disturbance inputs to performance outputs. This involves designing local controllers for each subsystem to enforce local dissipativity guarantees, which are then exploited to co-design distributed global controllers and the interconnection topology to enforce global dissipativity guarantees while optimizing interconnection topology costs. The overall design process requires only solving a sequence of linear matrix inequality (LMI) problems, thereby retaining compositionality and decentralizability while avoiding non-convex, iterative design processes that are inefficient and centralized. This model-based hierarchical control design strategy assumes the knowledge of the subsystem dynamics, which may not hold in many real-world networked systems. Motivated by this, we also propose a data-driven hierarchical control design strategy that assumes only the availability of rich input-state-output trajectory data from the subsystems. The proposed data-driven design process assumes that the unknown disturbances affecting the subsystem dynamics are bounded by a quadratic matrix inequality (relaxing conventional bounds) and accounts for this by using the matrix S-lemma. Finally, the effectiveness of the proposed model-based and data-driven hierarchical control designs is illustrated for a networked system representing a DC microgrid, with the aim of enforcing robust (dissipative) voltage regulation and current sharing.
The Frequency Range 3 (FR3) band is attracting increasing attention due to limited lower-frequency spectrum and growing mobile communication demand. This study experimentally investigates channel characteristics in Urban Macro (UMa) scenarios at 8 GHz and 15 GHz using a large-scale MIMO platform with time-division multiplexing (TDM). Key parameters, including root mean square (RMS) delay spread (DS) and angular spread (AS), were extracted and compared with 3rd Generation Partnership Project (3GPP) TR 38.901. Results reveal clear frequency-dependent behaviors: RMS delay spread remains nearly constant under line of sight (LOS) but decreases from 8 GHz to 15 GHz in non-line of sight (NLOS), indicating reduced multipath dispersion at higher frequencies. Both azimuthal spreads (including ASA and ASD) and elevation spreads (including ESA and ESD) exhibit a corresponding decrease with increasing frequency, demonstrating a consistent trend towards more directional propagation across all angular domains. Capacity analysis indicates that the 15 GHz channel slightly outperforms 8 GHz in both LOS and NLOS scenarios due to more concentrated multipath energy and larger dominant singular values. Higher frequencies exhibit greater directionality, whereas lower frequencies provide broader multipath distributions and more stable performance, offering valuable guidance for multi-band MIMO modeling and 6G system design.
Learning-based speech compression has achieved promising low-bitrate performance, but many neural speech codecs still describe quantized latents with preset-rate discrete symbols or apply entropy coding only after symbol generation. Such designs decouple representation learning from probability modeling, limiting their ability to exploit the non-uniform usage and temporal dependencies of learned speech latents. In this paper, we benchmark neural speech compression from a rate--distortion perspective and further investigate entropy-constrained coding for low-bitrate speech compression. We first formulate a unified learning-based speech coding pipeline and provide a benchmark-style analysis of recent neural speech codecs, showing that explicit probability modeling remains underexplored in learned speech compression. We then propose ECC, an Entropy-Constrained Codec that combines scalar quantization with a learned entropy model. ECC integrates hyperprior-based side information, channel-wise context modeling, latent residual prediction, and lightweight temporal modeling to estimate latent likelihoods for rate estimation during training and arithmetic coding during inference. To further improve low-bitrate efficiency, ECC introduces entropy skip, which omits highly predictable residual symbols using decoder-available scale estimates without transmitting additional skip masks. Extensive experiments show that ECC achieves a favorable low-bitrate rate--distortion trade-off over conventional and neural codec baselines, reducing BD-rate by 39.9% on ViSQOL and 76.3% on PESQ on average over two widely-used test sets. Ablation and diagnostic studies further validate the effectiveness of entropy modeling. Project Page: this https URL
The rapid adoption of electric vehicles (EVs) can cause severe voltage drops and line current overloads in distribution networks, creating an urgent need for scalable expansion planning methods. This paper proposes a computationally efficient violation-informed spatio-temporal adaptive targeting (STAT) framework for EV-driven distribution system expansion planning. The framework first identifies potential voltage and current violations through a violation analysis model, and then mitigates them through a joint optimal expansion planning model that co-optimizes investment decisions for line reconductoring, shunt capacitors, and battery energy storage systems. To reduce computational burden, the proposed STAT-temporal criticality assessment (STAT-TCA) method extracts primitive stress events from annual operating data, derives an initial set of candidate planning horizons from signature-consistent segments, and selects a final transferable critical horizon set through cross-horizon validation based on optimization feasibility and cost. Meanwhile, the proposed STAT-adaptive spatial targeting (STAT-AST) method constructs device-specific spatial features for BESS and SC siting to retain compact yet high-impact candidate bus sets. Case studies on 33-bus and 240-bus distribution systems demonstrate that the proposed STAT framework can substantially reduce the temporal and spatial planning dimensions while preserving planning fidelity. Full-year validation further confirms that the resulting investment plans can eliminate EV-induced voltage and thermal violations while maintaining feasible BESS operations.
Power leaking directly from transmitting into receiving radio-frequency chains is a key challenge in the realization of monostatic sensing applications with multi-antenna communication front-ends, to which a promising solution is digitally precoding transmitted signals for improved leakage suppression. While digital transmit precodings perform well in theory, real-world deployments typically exhibit severely degraded leakage suppression. This work investigates quantization noise as a primary factor limiting the performance of such precoding schemes. A closed-form solution predicting the impact of quantization noise on the performance of arbitrary digital joint leakage estimation and leakage suppression precodings is derived, numerically analyzed, and validated in a hardware testbed.
Distilling a large speech foundation model (SFM) into an efficient student model has been successfully applied to low-resource environments. Although distillation reduces inference latency, it requires an additional student model training. However, the training efficiency of SFM distillation remains underexplored. In this work, we explore training acceleration of SFM distillation to speed up model deployment. We examine the potential of stacking, in which the model depth is progressively increased through training until the target model depth is reached. While existing stacking methods improve training speed, they suffer from performance degradation. To handle this limitation, we propose interleaved stacking, a novel stacking method that consistently preserves layer position throughout the stacking process. This property is particularly critical in SFMs, in which each layer encodes distinct layer-specific knowledge. We validate the effectiveness of the proposed method on SUPERB.
Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.
Uncertainty in standalone microgrid operation usually originates from mismatches between power references and forecasts. These deviations are compensated by grid-forming controlled units, which distribute the required power contribution based on their droop gains. To introduce an additional degree of flexibility, it is possible to treat droop gains as decision variables to redistribute active-power contributions according to system-level objectives. However, directly applying updated droop gain references from a supervisory layer to the primary controllers can introduce power and frequency transients. This paper investigates transition mechanisms for applying scheduled active-power droop gain changes during operation. Hard switching, rate-limited transition, first-order IIR low-pass filtering, and cubic as well as quintic S-curve transitions are compared experimentally on two parallel 15 kW grid-forming inverter units. The results show that shaping the droop gain trajectory significantly reduces transient deviations compared to hard switching. In the considered case study, the S-curve transitions provide the strongest transient mitigation, reducing the active-power overshoot from 632.7 W to approximately 115 W and limiting the frequency overshoot to about 0.003 Hz.
Due to high power consumption and hardware costs of fully digital arrays, hybrid beamformers are often considered as a more economic alternative. Furthermore, using high resolution analog to digital converters (ADCs) can also have prohibitive power consumption, which leads to lower resolution converters being considered for radio frequency (RF) front end design. The finite quantization resolution as well as the nonlinearities caused by the power amplifiers (PAs) and low noise amplifiers (LNAs) can have a substantial impact on system performance. While widely studied for communications, the impact of hardware impairments on sensing performance is considerably less explored. In this work, we study the interplay between hybrid beamforming architectures, hardware impairments, and sensing and communications performance. Additionally, we define the concept of double-isotropy for pilot-combiner pairs, formalizing the notion of a perfectly energy-fair beam sweep. The multiple start (MS) space alternating generalized expectation maximization algorithm (SAGE) is also introduced, aimed at addressing the optimization issues arising from parametric channel estimation (PCE) in hybrid beamformed systems. We then provide a set of numerical results assessing the impacts of beamformer architecture and ADC resolution on PCE, sensing, and communications performance. The results show that medium resolution ADCs lead to the most power efficient configurations, with the best tradeoff between power consumption and performance for the majority of beamforming architectures. Additionally, fully digital beamforming architectures with high resolution converters can often be substituted for a hybrid beamformer setup with medium resolution converters without significant performance loss at a lower power consumption and overall hardware cost.
Multi-channel mixed-SNR training improves out-of-distribution (OOD) generalisation of deep learning channel estimators for IEEE 802.11p vehicular communications, yet the internal mechanism responsible for this remains unexplained. This work presents REACH (Relevance-based Explanation and Architectural Compression for cHannel estimators), a gradient-based interpretability framework that operates at two levels. Input-level attribution identifies a subset of time-frequency features consistently relevant across all evaluated channel conditions, enabling input dimensionality reduction with minimal performance loss. Filter-level attribution reveals a near-universal internal representation, providing a representational account of the observed OOD generalisation. Guided by the resulting filter taxonomy, relevance-guided architecture compression substantially reduces both the number of parameters and the number of floating-point operations (FLOPs) with sub-1 dB normalised mean square error (NMSE) degradation, and OOD generalisation degrades more slowly than within-distribution accuracy under increasing compression.
This paper investigates an analytical Koopman-based nonlinear model predictive control (K-NMPC) approach for tracking control of virtually coupled train systems. A nonlinear train movement model incorporating train dynamics, speed and control input limits, passenger comfort constraints, and collision avoidance is systematically lifted into a finite-dimensional Koopman space through closed-form observable functions. After freezing the affine parameter-varying lifted predictor along the shifted predicted trajectory, the online optimal control problem is solved as a quadratic program that can be solved efficiently. The proposed KNMPC is benchmarked against a time-discrete NMPC scheme, demonstrating comparable control performance with significantly reduced online computation time and strong potential for real-time implementation in practical virtually coupled train control systems.
We investigate the impact of power amplifier (PA) nonlinearities on the sensing performance of affine filter bank modulation (AFBM). While AFBM offers several advantageous properties for integrated sensing and communications (ISAC) - including reduced out-of-band emission (OOBE), low peak-to-average power ratio (PAPR), and natural robustness to doubly-dispersive (DD) channel effects - mitigating waveform distortion typically requires highly linear PAs. This creates a fundamental contradiction with ISAC applications, which demand high transmit power for reliable sensing. Our analytical results reveal that the structure of the effective AFBM modulation matrix dictates how distortion propagates within the ambiguity function (AF). Furthermore, simulations demonstrate that both the AF and the overall sensing performance of AFBM remain remarkably insensitive to such nonlinearities. These findings highlight the robustness of AFBM, making it a highly viable candidate for practical ISAC deployments constrained by hardware impairments.
This manuscript addresses a hierarchical control system designed to suppress traffic congestion. The lower-layered controllers, implemented in each controlled vehicle, monitor microscopic vehicle behaviors and assist human drivers to ensure sufficient spacing for following vehicles. This spacing logic is designed based on the Control Barrier Function. Meanwhile, the upper-layered controller monitors the macroscopic traffic flow and activates the necessary lower-layered controllers, using a data-driven approach for the activation logic design. Furthermore, the effectiveness of the proposed control system is evaluated in a traffic flow simulation environment constructed using real-world traffic data.
To meet the demands of 6G wireless systems operating in high-mobility scenarios, this paper presents a design of a random multiplexing (RM) communication system that is both storage-efficient and highly reliable. In principle, RM with cross-domain memory approximate message passing (CD-MAMP) can achieve replica maximum a posteriori (MAP)-optimal performance by constructing a fully dense equivalent channel matrix. However, its practical implementation is hindered by the large storage overhead of conventional interleavers and by performance degradation in severely ill-conditioned channels, which existing related work (focusing on interleaving and transform designs) fails to address simultaneously. To overcome these issues, we develop a storage-efficient and highly reliable system that integrates RM with CD-MAMP, referred to as RM-MAMP. Specifically, we propose a Logistic chaotic mapping interleaver with a quantitative parameter-selection criterion, and a dual-stage high-order permutation polynomial interleaver, both of which achieve nearly identical bit-error-rate (BER) as fully random interleavers while reducing the interleaver storage from O(N) to O(1) and significantly lowering interleaver signaling overhead. We further propose a highly reliable interleaved transform framework, comprising an interleaved phase perturbation transform and a multi-layer interleaved coupled transform, to enhance the incoherence and diversity of the equivalent channel matrix. Simulation results show that the proposed storage-efficient interleavers maintain BER performance comparable to fully random interleavers, while the highly reliable transforms provide over 4 dB gain in severely time-varying channels, confirming the dual benefits of reduced storage overhead and improved robustness for the enhanced RM-MAMP system.
We consider a multi-receiver status update system in which a transmitter monitors a finite-state semi-Markov source and decides whether to stay idle, unicast an update, or broadcast a common update. We formulate a risk-aware scheduling problem that minimizes the long-term average sum of the average Age of Incorrect Information (AoII), average risk ratio, and transmission cost. The risk state is defined by whether the AoII exceeds a prescribed threshold. We solve the problem using model-based and model-free policies and compare them with two baselines. Numerical results show that the proposed policies outperform the baselines, exploit both unicast and broadcast transmissions, and capture the effect of the dwell-time law on scheduling performance.
CSI-based localization with spatially distributed antenna arrays exposes a basic resource trade-off. Each array can provide a rich view of the channel, but forwarding observations from all arrays to a fusion center is wasteful when only a few carry useful information, and the shared uplink supports only a limited number of simultaneous transmissions. We let each array decide locally whether its current observation is worth reporting, subject to a budget on the average number of active transmitters. We refer to this abstraction as Edge-Triggered Distributed Inference (ETDI). It captures a broader class of task-oriented communication problems where resource-constrained devices share an access channel for a common inference task. We instantiate ETDI for CSI-based localization, a common scenario in vehicular IoT networks. Spatially distributed remote antenna arrays (RAAs) encode local channel state information (CSI) from user equipment (UE) transmissions into latent features, and the fusion center estimates the UE position from the subset of reported features. We propose NARRAS, a decentralized reporting policy in which each RAA combines a recurrent summary of its recent observations with a memory of the last latent it transmitted. Training controls an explicit activity budget through differentiable activity penalties and validation-calibrated deterministic thresholds, and uses channel-chart regularization to shape the latent geometry. Experiments show that, at comparable uplink activity, NARRAS improves localization accuracy over learned and heuristic sparse-reporting strategies, while dense full-report models remain useful budget-free references. In low-activity regimes, chart regularization further reduces high-percentile localization errors, suggesting that geometry-aware latent representations are more robust under sparse reporting.
High-voltage direct current (HDVC) transmission systems based on modular multilevel converters (MMCs) have become a key topology in modern power systems. The dynamics of MMCs exhibit strong multivariable coupling, constraints, and uncertainties, motivating the use of model predictive control (MPC) to enhance current regulation performance. However, MPC tuning is nontrivial and does not inherently guarantee stability or robustness, particularly in the presence of model uncertainties. This paper proposes a MPC tuning method that ensures robust performance under bounded model uncertainties. This method solves a convex linear optimization problem to compute the optimal weighting matrices Q, R, and P ensuring optimality and reproducibility. As a result, robustness is enhanced without increasing the online computation burden. The effectiveness of the method is validated through testing on a real-time digital simulator (RTDS) model of a point-to-point HVDC system. Results demonstrate improved performance compared to conventional LQR-based MPC tuning.
Epilepsy affects over 50 million individuals globally, underscoring the need for automated seizure detection systems that can alleviate clinicians workload and enhance the accuracy of patient seizure diaries. In wearable EEG applications, however, reliable detection remains challenging due to the limited spatial resolution of low-density electrode configurations, reduced signal-to-noise ratios, and the scarcity of diverse, publicly available training datasets. This study investigates the efficacy of hybrid deep learning architectures for automated seizure detection using a simulated behind-the-ear montage derived from the Temple University Seizure Corpus (TUSZ, v2.0.3). We conduct a systematic comparison of several CNN-RNN models, including LSTM- and GRU-based variants, across multiple EEG montages to evaluate their capacity to compensate for the loss of spatial information inherent to reduced electrode configurations. The proposed CNN-Merged model, which integrates temporal and spectral feature representations, demonstrates superior performance, achieving a ROC AUC of 85.89% and a balanced accuracy of 79.11% on the held-out test set. Furthermore, the model exhibits strong robustness across different reference montages, effectively bridging the performance gap between conventional full-scalp recordings and resource-constrained wearable systems. These findings substantiate the potential of hybrid deep learning models as a promising avenue toward robust, patient-independent seizure detection in low-density EEG applications.
This paper presents a collaborative adaptive formation control framework for autonomous vehicles (AVs), that explicitly handles system uncertainties, input saturation, and communication delays. To overcome the inherent physical torque limits of steering and braking actuators, an input saturation compensation mechanism is introduced to render nonlinearities tractable and improve control reliability. Additionally, a delay-compensating auxiliary system is designed to mitigate the effects of communication delays and reduce tracking errors. Our framework incorporates a dynamic-threshold event-triggered control (ETC) strategy to optimize resource usage. Additionally, uncertainty observers and symmetric barrier Lyapunov functions are developed to ensure robust and safe formation maneuvers. Finally, the effectiveness of the proposed approach is validated through numerical simulations of vehicle formations, complemented by a 3D visualization video demonstrating the dynamic fleet reconfiguration process.
We propose a zonotopic framework for synthesizing a single robust state feedback controller that is certified to stabilize every plant inside a matrix zonotope, describing linearly varying parameters or parametric uncertainty. Common robust design strategies rely on checking many vertex models or on complex gain-scheduling, leading to high offline computation and implementation complexity. Our approach finds a single gain that is provably valid across the entire parameter domain, which is simpler to implement and can reduce conservatism by exploiting the structure of the zonotope. We formulate the robust synthesis as a single convex program tailored to the zonotope representation and incorporate practical performance requirements (actuator constraints, decay rate, disturbance attenuation) into the same synthesis stage. In numerical experiments on a representative 4-state example, our controller provides larger stability coverage across the parameter domain, attains comparable transient performance and control effort to more complex designs, and significantly reduces the number and scale of offline synthesis problems required by other robust approaches, compared to common-vertex gain, $H_{\infty}$, and $\mu$-synthesis baselines.
Accurate state of charge (SOC) estimation of lithium iron phosphate (LFP) batteries remains challenging because of their flat open-circuit-voltage (OCV)-SOC characteristics, temperature-dependent dynamics, and sensitivity to initialization errors. Here, we propose a physics-guided residual Kalman learning (PRKL) framework for electrochemical-model-based SOC estimation. PRKL combines a control-oriented single-particle-model-based extended Kalman filter (EKF), which provides recursive physical state propagation, with a gated recurrent unit (GRU) residual learner that compensates structured EKF errors using electrochemical states and measurement features. The framework is evaluated on a public graphite/LFP dataset covering three dynamic drive cycles, eight temperatures from -10 to 50 degrees C, and initialization offsets up to 20 percent. Using dynamic stress test (DST) and federal urban driving schedule (FUDS) cycles for training and the supplemental federal test procedure (US06) cycle for cross-profile testing within the same cell dataset, PRKL achieves a global average root mean square error (RMSE) of 1.19 percent, corresponding to a 77 percent reduction relative to the physics-only EKF. These results show that electrochemical state information can guide residual learning and improve recursive SOC estimation for LFP batteries. The present validation supports cross-profile robustness within the studied dataset and provides a basis for future cross-cell, ageing-aware, and embedded-platform validation.
Recently, movable antenna (MA) has attracted wide attention in wireless communications due to its potential in enhancing wireless communication performance via local movement within a confined region. However, antenna position optimization (APO) has emerged as a major challenge for MAs, due to the lack of a tractable, analytical, and accurate channel model in terms of antenna positions. Although existing works have developed various algorithms for APO, most of them are based on simplified theoretical channel models, which limit their generality. To address this challenge, in this article, we present more general and effective APO algorithms for different purposes, categorized as continuous APO and discrete APO, respectively. Continuous APO is mainly applied for flexible array signal processing to boost large-scale communication performance, while discrete APO is applied for small-scale multi-path channel reshaping. Specifically, the discrete APO discretizes the antenna movement region into multiple sampling points and employs discrete algorithms to determine the optimal MA positions based on the point-wise channel state information (CSI), without the need for an analytical channel model. To reduce the overhead for CSI acquisition, we also present more efficient learning-based APO algorithms that operate without requiring full point-wise CSI. Finally, we compare the application scenarios of the proposed algorithms and validate their effectiveness with numerical results.
In this paper, we study the power allocation for an integrated sensing and communication (ISAC) system which tracks a mobile target. We first model the problem as a Markov decision process, and then tackle it with a soft actor-critic (SAC) based deep reinforcement learning (DRL) approach. We also combine a Dirichlet policy, which naturally produces normalized continuous actions under random target motion. To exploit different features of sensing and communication operations, we carefully design a reward function such that the system can dynamically control power allocation to conserve resources. The simulation results demonstrate that the proposed scheme enhances tracking performance compared to other baselines while sustaining communication performance.
Peer-to-peer (P2P) energy trading requires network-aware coordination because transactions are physically realized through distribution networks. However, sensitivity-based coordination causes a confidentiality-verifiability tradeoff, as network sensitivities may reveal vulnerable components while undisclosed sensitivities prevent participants from verifying utility-provided transaction guides. This paper proposes a zero-knowledge-proof-based method for verifying the computational integrity of network-constrained transaction guides with respect to committed private network data, without exposing network-sensitivity information. The guide defines admissible injection and withdrawal volumes derived from sign-decomposed sensitivity matrices while satisfying balance, voltage, line-flow, and optimality conditions. These conditions are encoded in an arithmetic circuit, represented as R1CS constraints and a quadratic arithmetic program, and verified using a bilinear pairing. Blockchain commitments bind the approved circuit, public inputs, statement identifiers, proof, and verification result for tamper-evident auditability. The proposed proof certifies correct guide computation from committed network data; the authenticity of the committed network data is handled through an explicit registration and attestation assumption. Case studies on a modified IEEE 33-bus system show satisfaction of network constraints after clearing, rejection of public-input and witness-inconsistency attacks, and practical on-chain overhead, with an 806-byte proof.
Compliant element systems with ultra-large deformation display rich nonlinear dynamics and pose challenging control problems, which, when solved, could enable enhancements in several mechatronics applications, such as soft robotics, MEMS, and biomedical applications. This paper considers post-buckled dynamic analysis of an inverted ultra-flexible pendulum actuated by a rotary hub. We first derive a complete set of equations capturing the dynamics of the system, essential for control development, using the assumed modes method framework, considering ultra-large deformations. Constrained Lagrange formulation is used for the same. In the perfect inverted configuration with zero hub angle, the buckled beam would display two symmetric stable equilibria and one unstable. However, as the hub angle changes on either side, the equilibrium positions shift, and eventually two of them vanish, and we are left with only one stable equilibrium. We use the dynamic equations to characterize this interesting phenomenon, demonstrating the continuous state dependence of multiple equilibria. Furthermore, experimental counterparts of the equilibrium results are meticulously obtained and discussed. Moreover, simulation results capture the nonlinear dynamics of this system. Overall, the work establishes a solid mathematical foundation with a control-amenable model for futuristic ultra-compliant mechatronic systems.
A privacy-compliant indoor localization approach utilizing a 3-D near-field (NF) passive radar imaging technique is presented. This technique leverages ubiquitously radiated electromagnetic fields for imaging, with passive tags introduced to enhance the strength of scattering fields, thereby enabling precise localization at the imaging level. The method also supports localization in non-ideal imaging scenarios, such as for limited bandwidth or in highly-reflective environments. Based on their geometrical properties the simple and low-cost passive tags enable intuitive differentiation between individuals or objects. Associated privacy protection mechanisms are discussed, where the frequency-varying properties of the passive tags provide additional flexibility and potential applications under privacy and ethical considerations. Several forms of passive tags are presented, where both simulation and experimental results validate the effectiveness of the proposed passive tag designs.
Optimization-based feedrate planning offers the potential to significantly increase machining productivity, but its industrial adoption has been limited by high computational cost and extensive tuning effort. This paper proposes a lexicographic feedrate optimization principle that adaptively balances finishing time and motion smoothness in a tuning-free manner. To further improve computational efficiency, the optimization scheme is extended by a sparsity-exploiting formulation combined with a sequential windowing strategy, enabling real-time capable execution. In addition, a unified toolpath parameterization scheme is incorporated to synchronously handle tool position and orientation within the optimization framework. For a five-axis freeform test contour, the proposed method takes 14 s on an Intel i5-3470 CPU to optimize feedrate profiles for long toolpaths with 100,000 constraint checkpoints, and 52 s on a high-performance AMD 9950X CPU to handle one million checkpoints. Compared to an industrial CNC kernel, the resulting finishing time is reduced by more than 15 %.
Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300\,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08\,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17\,Hz with intermediate-layer representation alignment.
SSpeech imagery is attractive as a brain-computer interface paradigm for communication because it is endogenous and intrinsically linguistic. Yet despite growing interest, its dominant scalp-EEG spatiotemporal characteristics remain poorly characterized. Here, we asked how speech imagery appears in scalp EEG and compared it against finger motor imagery. Using a within-subject dataset containing speech imagery, finger motor imagery, and no-task trials recorded under the same trial structure, we analyzed band-power dynamics across channels and time. Finger motor imagery showed the expected contralateral mu/alpha and low-beta desynchronization over sensorimotor areas, whereas speech imagery showed a weaker, more distributed alpha-dominant increase. After normalization to each condition's own post-trial interval, the speech-related alpha increase changed only modestly after cue onset, indicating that much of the speech-versus-no-task difference was already present during the instruction period. A classifier discriminating imagery from no-task reached mean balanced accuracies of 0.563 $\pm$ 0.072 for speech imagery and 0.718 $\pm$ 0.127 for motor imagery, with a stronger alpha/beta dependence for motor imagery than for speech imagery. Together, these results provide a clearer group-level characterization of speech imagery in scalp EEG and indicate that its dominant spatiotemporal pattern differs from that of finger motor imagery and is more consistent with substantial non-articulatory task-related contributions than with a clear articulatory-motor analogue.
Future 6G heterogeneous wireless networks (HWNs) are expected to support multiple radio access technologies (RATs), dynamic wireless environments, and applications with diverse quality-of-service (QoS) requirements. In such environments, network selection (NS) cannot rely only on instantaneous radio measurements or static ranking rules. Instead, access decisions must account for the evolving wireless state, service intent, packet-level QoS behavior, and candidate-RAT dynamics. This paper proposes a large language model (LLM)-based digital twin (DT) framework for stable, application-aware RAT selection under candidate-set evolution. The main idea is to shift NS from an instantaneous decision-matrix operation to a decision process over an evolving wireless DT state. The constructed DT combines site-specific geometry, Sionna RT-based propagation descriptors, ns-3 packet-level QoS emulation, service context, candidate-RAT information, and decision memory. Rather than acting as a general-purpose controller for 6G networks, the LLM is used for DT-grounded decision intelligence in this specific NS task. On top of this DT, a unified intent agent translates user and service requirements into structured decision priorities for two complementary NS branches: an LLM-assisted multi-attribute decision-making branch (MADM--LLM--NS) and a direct LLM-based ranking branch (LLM--NS). To improve decision stability, the framework further introduces history-aware adaptive normalization (HAAN) and DT-memory-driven retrieval-augmented in-context learning (RA--ICL). Numerical results show that the proposed framework reduces rank-reversal problem and unnecessary handover events, while improving service-aware QoS satisfaction compared with representative MADM-based NS baselines.
This paper presents a theoretical framework for multi-band localization for a single-path single-input multiple-output (SIMO) system. We derive closed-form Cramer-Rao bounds (CRBs) for angle-of-arrival (AoA) and distance for uniform linear arrays (ULAs), and an intermediate matrix-form formulation for arbitrary array shapes. We also develop benchmark single- and multi-band maximum-likelihood (ML) estimators for AoA-Distance, leveraging a structured Levenberg-Marquardt (LM) refinement procedure. A key contribution is an analytical characterization of the threshold SNR (TSNR) for the proposed estimators. This is the SNR threshold at which the estimator transitions from "off the chart" to CRB-approaching performance, for both TDoA and distance estimation. Numerical simulations confirm that the proposed single- and multi-band estimators achieve the CRB at SNRs above the predicted TSNR, and that multi-band processing simultaneously improves estimation accuracy and reduces SNR requirements. The resulting framework provides a rigorous foundation for next-generation multi-band localization and can be readily extended to elevation estimation, distributed arrays, and multi-path environments.
Conventional beamforming techniques primarily steer energy along desired directions or focus it at specific locations. These techniques become fragile when facing frequent blockage and highly dynamic propagation environments. In this article, we present caustic beamforming as a new paradigm for wireless beam control. First, we classify representative caustic beams according to their underlying mathematical origins and present three unique properties, namely self-bending, self-healing, and near-field non-diffracting. Building on these propagation properties, we then propose several application scenarios in sixth-generation (6G) networks. We undertake two case studies focused on physical layer security and service stability that highlight the capability of caustic beams to bypass potential eavesdroppers, deliver more uniform coverage, and sustain blockage-resilient links. We further discuss the enabling hardware architectures that facilitate practical deployments, and finally outline key open challenges regarding caustic beams that require further research.
This note is a tutorial on the deterministic version of the Kalman filter (state estimator), which is formulated as finding the state trajectory consistent with the system's equations with the minimal amount of $L^2$ process and measurement uncertainty. As stated, this is an input signal design problem with linear dynamics and an objective that is affine-quadratic in the state and inputs. The first step is to convert this problem to one with a purely quadratic objective by embedding in a larger system using ``homogeneous coordinates''. This converts the problem to a purely quadratic (i.e. an LQR) problem, but with non-standard initial or final state constraints. This latter problem can then be solved using a version of the matrix Differential Riccati Equation (DRE) for the larger LQR problem. The second step is a partitioning of this larger problem, which then yields the optimal dynamic observer and the DRE of the traditional Kalman filter. For comparison, the solution of the traditional LQ-tracking (Servomechanism) problem is also treated using a similar construction.
STFT-based speech enhancement typically adopts overlapping analysis frames. While overlap is essential for stable STFT processing, it makes adjacent frames highly correlated, causing redundant computation in lightweight models. We propose Half-frame-rate Adaptive Learnable Operator (HALO), a causal plug-in module that halves the internal frame rate without altering the STFT procedure. Broadly applicable to many lightweight models, HALO applies adaptive rate reduction before the backbone and restoration afterward, reconstructing the full-rate spectrum on the original STFT grid. Both reduction and restoration are implemented with lightweight dynamic convolutions. By halving the processed frame rate, HALO reduces backbone compute cost with no added algorithmic latency, freeing budget for channel widening. Experiments on the DNS3 dataset show consistent gains across diverse lightweight models under matched complexity, demonstrating the effectiveness of reducing overlap-induced redundancy.
We develop a large-signal stability analysis for a sampled-data, optimization-based secondary controller for inverter-interfaced distributed energy resources in virtual power plants.
Efficient thermal management is critical for the reliability and performance of power electronics systems in automotive applications. This work presents a computationally efficient modeling approach for transient thermal simulation of power electronic systems, with a focus on inverter modules using multiple MOSFETs mounted on a printed circuit board assembly (PCBA). A case study of an inverter module comprising six MOSFETs arranged as high-side and low-side pairs for a three phases system mounted on a PCBA, attached to a heat sink is considered. Computational fluid dynamic (CFD) simulations in Ansys Icepak are performed considering different heat transfer mechanisms, including natural convection, forced convection at constant velocity, and forced convection with varying flow velocity. A transient thermal model is developed using the Lumped Parameter Linear Superposition (LPLSP) method, a hybrid approach that combines lumped parameter modeling with the principle of linear superposition to capture transient thermal behavior efficiently. Temperatures of the components from the simulations are compared with temperatures from the LPLSP model and temperatures from a Linear Time Invariant (LTI) based reduced order model (ROM) developed for this system. It is observed that the LPLSP model is able to model a wide range of use cases very accurately with error of less than 5 %. This method enables rapid thermal performance evaluation of power electronics systems that have very fast transients in component level power dissipation and variations in ambient conditions, making it particularly well-suited for early-stage design iterations and long-duration mission profile simulations. The approach offers a practical path to reducing development cycles for automotive power electronics design.
The joint rate-distortion framework of Stavrou and Kountouris (IEEE Transactions on Communications 2023) characterises dual-fidelity tradeoffs for semantic communication on stochastic semantic sources. Many task-oriented communication systems instead use designed sources, where the semantic object is a deterministic oracle allocation $\phi^(t)$ rather than a stochastic quantity given by nature. We isolate the subclass of designed sources under smooth concave utility with assumptions A1, A2 and Euclidean allocation codomain, and restrict the encoder class to deterministic common-category mappings. Within this subclass the SK exponential-tilting decoder and generalised Blahut--Arimoto iteration specialise to conditional-mean decoding and Lloyd--Max stationarity on $\phi^(t)$. When the second fidelity is a monotone single-letter distortion, the joint problem stays inside the SK admissible class; the common-category SK rate is lower-bounded by the max of the corresponding Shannon rate-distortion functions, with equality only when the common-category reconstruction is compatible and RDF-optimal. When the second fidelity is aggregate verification, the joint problem leaves the SK single-letter class and admits a constrained-design feasibility band $R_{\min}(\varepsilon^) \leq R \leq R_{\max}(\beta^)$ of width $\log_2(K_{\max}/K_{\min})$ bits in partition cardinality. The reduction and the band are scope statements on the SK apparatus, not modifications to it. A smart-grid economic-dispatch example with a non-technical-loss-detection contrast illustrates the band.
Sustained driving automation systems are envisioned to be used as the foundation for driverless mobility services. However, both researchers and practitioners acknowledge that current driving automation systems are not yet able to handle all traffic situations that a human driver can handle. To bridge this gap and enable mobility services without an in-vehicle human driver or fallback, remote operation (or teleoperation) is increasingly discussed. Recently, first legal actions have been taken to enable some forms of remote operation on public roads. Remote operation encompasses a broad spectrum of methods to support a driving automation system, ranging from remote assistance, which includes providing information or releasing a maneuver, to remote driving, which includes driving the vehicle from a remote location. As such, safe implementation of remote operation in public road traffic challenges the collaboration of multiple academic disciplines (e.g. engineering, psychology, informatics, law, etc.) and stakeholders (e.g. remote operation service providers, remote operators, vehicle manufacturers, regulatory authorities, etc.). At the same time, the interdisciplinary discourse is often challenging due to differing expectations and language. To build a common ground, this article traces terminology back to the original differences in information processing both on human and vehicle side. This framework aims to help further discourse by directly specifying what is needed to engage a diverse audience including researchers and stakeholders of different backgrounds and interests. Recently discussed forms of teleoperation are integrated into this framework.
We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We propose q-PDGD, a quantized stochastic primal-dual method, and analyze it under relaxed global geometry. Under restricted secant inequality (RSI), a constant step-size yields linear contraction to an explicit neighborhood determined by gradient noise, quantization distortion, and network connectivity, while a diminishing step-size achieves O(1/k) convergence without shared-minimizer assumptions. Under Polyak-Lojasiewicz (PL) inequality, we obtain linear-to-neighborhood convergence in the same stochastic quantized setting. Our results match the best-known centralized stochastic rates in oracle complexity, and are supported by experiments demonstrating the predicted tradeoffs between quantization level, step-size choice, and graph structure.
Spoken language, whether produced by humans or large language models (LLM), unfolds over time with varying semantic content. However, we still lack simple, interpretable time-series features that capture how generic versus specific content is distributed over time, and that can be used to compare human and AI-generated speech. We introduce a semantic-timescale analysis pipeline that turns word-level transcripts with timestamps into semantic time-series. For each spoken narrative, we compute (i) semantic specificity using WordNet-based word depth and (ii) contextual similarity using SBERT embeddings and quantify their temporal dependence using autocorrelation-window measures (ACW-0 and related metrics). We then compare original speech to multiple shuffled controls that selectively disrupt lexical identity, temporal order, and word duration. Across human-read autobiographical narratives, TTS readings, and LLM-generated texts rendered with TTS, we find that segments with longer ACW-0 in the semantic time-series tend to contain more generic vocabulary, whereas segments with shorter ACW-0 are enriched in more specific words. These associations are strongly attenuated or abolished when word order and timing are randomized, indicating that ACW-based measures capture non-trivial temporal organization of semantic content beyond static lexical distributions. Our results suggest that ACW-based semantic timescales are a useful family of features for analyzing and comparing the temporal structure of human and AI-generated speech.
Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.
Large Audio-Language Models (LALMs) excel at audio understanding but expose little about where in an audio signal they attend. We introduce instruction-based vector steering, which constructs a steering vector by contrasting activations from differently instructed prompts while keeping the audio fixed. Through a systematic probe of LALM attention, we find that - unlike standard prompting or audio-based steering - this intervention significantly redistributes the temporal attention allocated to audio tokens, concentrating it on acoustically relevant regions. We then show that this attention shift is behaviorally meaningful: in a controlled three-event setting, reading out the temporal position of maximal steering-induced attention change recovers the location of a queried sound event without any training, attaining 60.87% and 68.72% overlap with ground-truth intervals on Qwen2-Audio and Audio Flamingo 3, far above direct prompting (31.84%, 46.75%) and random baselines (27.74%). Our results characterize a mechanistic property of instruction-based steering in LALMs and provide a training-free probe for the latent temporal structure these models encode.
In this paper, we study Mahalanobis-guided latent out-of-distribution (OOD) detection for test-time RL controller switching in nonlinear time-varying systems. RL controllers can quickly control high-dimensional systems within the training distribution, but their performance can degrade when time-varying dynamics produce unseen observations. We consider a combined ES--DRL controller, where RL provides fast in-distribution actions and bounded extremum seeking (ES) provides robust model-independent control under OOD operation. The key challenge is deciding when to switch. We train a variational autoencoder (VAE) on in-distribution beam-profile observations and use Mahalanobis distance in the VAE latent space to detect OOD beam profiles at test time. This OOD decision sets a binary switch that selects either the RL controller or the ES controller. We evaluate the approach in safety-critical particle accelerator control. In this setting, spatial magnet motion creates OOD beam profiles that were not seen during RL training. Visualization of the VAE latent space shows that the proposed method identifies this OOD scenario and provides an interpretable signal for switching between RL and ES in the combined controller.
Indoor localization from wireless measurements remains challenging in large-scale deployments due to substantial variation in building geometry, the set of detectable access points (APs), and the heterogeneity of received signals. Existing learning-based methods often perform well only in limited settings and degrade under environmental shifts, making robust anchor-free localization across diverse indoor environments notoriously difficult. In this paper, we present OmniLoc, an environment-interactive foundation model for anchor-free user equipment localization across diverse indoor environments. To the best of our knowledge, OmniLoc is the first foundation-model-based approach built directly on wireless measurements for this task. OmniLoc is built on three key designs. First, a unified input tokenization module converts heterogeneous wireless measurements into a common representation that is more amenable to learning. Second, a geometry-aware Transformer performs AP-aware feature extraction by emphasizing dominant APs while aggregating complementary evidence from supporting APs. Third, a geometry-aware location estimation module conditions regression on geometric embeddings to produce geometrically consistent location predictions. We evaluate OmniLoc on both a large-scale in-house dataset and a public benchmark dataset. Results show that OmniLoc significantly outperforms existing methods, consistently improves existing backbones when its design components are integrated, and demonstrates strong generalization in cross-environment evaluations.
The Circular Electron Positron Collider (CEPC) distributes a reference clock distributed to 192 control nodes along its 100~km underground tunnel. The required synchronization precision is 30~ps (standard deviation). We present an enhanced White Rabbit (WR)-based clock synchronization system designed to meet this requirement. A noise-budget analysis of the standard WR slave loop identifies the analog actuation chain (DAC + VCXO + multiplier PLL) and restart-induced timing uncertainty as the dominant limitations. In our redesigned node, the DAC+VCXO chain is replaced by a Si5345A DSPLL clock generator with DCO-based phase control, removing the board-level analog tuning stage. GTX transceiver phase alignment and manual byte-alignment fixing reduce restart uncertainty from 88.8~ps to 12~ps peak-to-peak. For multi-node operation, we introduce a cascaded global-control architecture with PC-side PID auto-tuned by TD3 reinforcement learning, on-chip-temperature feed-forward calibrated to $-0.76\,\mathrm{ps}/^\circ\mathrm{C}$. The measured point-to-point synchronization precision is 3.38~ps over 1~m fiber and 3.92~ps over 50~km. In a 12-level cascade, the end-node precision reaches 6.66~ps at constant temperature and 7.30~ps under a 13$\,^\circ$C temperature swing. Synchronized-clock TIE jitter stays below 1~ps regardless of cascade depth. Restart uncertainty is 2.82~ps (std.\ dev.). A 4-level cascade operated stably for 25 hours of continuous monitoring. All measured metrics fall well within the CEPC 30~ps budget.
This paper presents a novel data-free and training-free compression approach for speech foundation models using channelwise clustering via k-means. More fine-grained, mixed sparsity pruning by layer-level varying number of parameter clusters is also explored. Experiments conducted on the LibriSpeech dataset suggest that when operating with pruning sparsity of 50% on HuBERT-large, consistent WER reductions of 27.73%/18.61% absolute (34.37%/21.91% relative) over the magnitude-based pruning were obtained on the test-clean and test-other subsets before fine-tuning and 0.19%/0.79% absolute (3.36%/4.62% relative) after fine-tuning with only 3 epochs. Similar WER reductions of 2.86%/5.02% absolute (59.21%/55.29% relative) were observed against magnitudebased pruning on Whisper-large-v3 at 10% sparsity, all with no significant WER increase relative to the uncompressed baseline.
Face recognition systems have advanced significantly through deep learning techniques, delivering high performance and robustness in complex scenarios. However, these approaches incur substantial computational overhead, limiting their in situ applicability in resource-constrained platforms such as drones, where they can address challenges including non-frontal facial imagery. Memristor-based neuromorphic systems have emerged as a compelling approach for edge AI applications, combining biologically inspired processing with efficient and scalable computation. In this work, we propose a facial recognition framework that addresses non-frontal pose variations by integrating lightweight generative adversarial network (GAN)-based pose frontalisation with memristor-based neuromorphic recognition. The experimental results on two datasets demonstrate the effectiveness of combining adversarial learning with memristive technology, achieving up to 96% identification accuracy. The proposed approach alleviates the computational bottlenecks of conventional AI and offers a scalable, efficient solution for face recognition in dynamic real-world environments.
While deep learning has significantly advanced image reconstruction of Electrical Capacitance Tomography (ECT), most data-driven methods map directly between capacitance and permittivity distribution, treating the sensor as a black box. This overlooks the electric potential field -- the fundamental physical link governing the nonlinear and ill-posed ``soft-field'' effect. To address this, we propose an electric potential-augmented ECT benchmark dataset designed to explicitly integrate latent physics behind ECT into the learning process. Generated via a COMSOL-MATLAB pipeline for an eight-electrode sensor as an example, the dataset comprises 20,000 randomized samples across four typical flow patterns. Crucially, alongside the conventional capacitance vectors and permittivity distributions depicted as images, each sample preserves eight excitation-wise full-field potential maps. Beyond data release, we provide illustrative evaluation protocols for both forward and inverse problems of ECT. Through comprehensive testing on both in-distribution (IID) and out-of-distribution (OOD) scenarios, we systematically demonstrate how the inclusion of electric potential maps enhances modeling accuracy and robustness. Fundamentally, the explicit inclusion of latent field information significantly lowers the barrier to integrating physical laws into ECT modeling, thereby establishing a standardized foundation for future physics-guided machine learning of ECT image reconstruction.
Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.
Accurate identification of hydrodynamic derivatives is essential for control and navigation of Unmanned Surface Vehicles (USVs), but high-fidelity manoeuvring data from physical sea trials are constrained by cost and safety. Turning Circle (TC) and Zig-Zag (ZZ) trials remain fundamental to IMO and ITTC assessment procedures. This paper extends the Marine Robotics Unity Simulator (MARUS) by introducing a standardised Virtual Sea Trial framework for automated execution and data generation of TC/ZZ manoeuvres, with traceable command-actuation logging, system-identification (SI)-focused data conditioning, and automated extraction of IMO/ITTC-aligned manoeuvring metrics. A key contribution is a dedicated TC/ZZ data acquisition and post-processing pipeline, improving the repeatability and auditability of simulator-based manoeuvres while producing SI-ready datasets for hydrodynamic-derivative identification and digital-twin workflows. Another feature is explicit command-execution separation for differential-thrust steering, where inputs are recorded as ordered rudder-equivalent commands and realised actuation is logged as an execution-level proxy derived from applied thrust. Case-study results demonstrate repeatable and compliant manoeuvre behaviour. For TC tests, the normalised advance differs by approximately 3.9 percent between port and starboard sides, while the tactical diameter differs by approximately 4.6 to 4.7 percent. For ZZ tests, first and second overshoot excesses remain below 1 degree for both +/- 10 degree and +/- 20 degree manoeuvres, satisfying IMO criteria, while peak yaw rates range from approximately 4.1 to 5.8 deg/s. Overall, the framework provides a repeatable and auditable virtual sea-trial workflow for generating IMO/ITTC-aligned datasets and supporting system identification, hydrodynamic-derivative estimation, and digital-twin calibration.
Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at this https URL
The global energy landscape is undergoing a transformative shift towards renewable energy and advanced storage solutions, driven by the urgent need for sustainable and resilient power systems. Isolated offshore communities, such as islands and offshore platforms, which traditionally rely on mainland grids or diesel generators, stand to gain significantly from renewable energy integration. Promising offshore renewable technologies include wind turbines, wave and tidal energy converters, and floating photovoltaic systems, paired with a storage solution like battery energy storage systems. This paper introduces a renewable energy microgrid optimizer (REMO), a tool designed to identify the optimal sizes of renewable generation and storage resources for offshore microgrids. A key challenge in such models is accurately accounting for battery degradation costs. To address this, the REMO model integrates a deep neural network-based battery degradation (DNN-BD) module, which factors in variables like ambient temperature, charge/discharge rates, state of charge, depth of discharge and battery health. Simulations on six test regions demonstrate that the REMO-DNN-BD approach minimizes lifetime energy costs while maintaining high reliability and sustainability, making it a viable design solution for offshore microgrid systems.
Wind turbine vibration monitoring under variable speed operation requires separating nonstationary rotor-order components whose frequencies and operating intervals depend on operating state. These components can occupy local support regions in the short-time Fourier transform (STFT) plane rather than fixed spectral bands or continuous ridges. This study presents time-frequency mode decomposition (TFMD), a segmentation-based method that estimates connected STFT support regions and reconstructs one mode from each region. TFMD selects STFT coefficients with high magnitude, groups them by connected component labeling, filters small regions, expands retained support regions with mask dilation and conflict resolution, and reconstructs modes by inverse STFT. In a synthetic response with six operating states, TFMD separates the components of each state and produces low reconstruction error without specifying the number of components in advance. In a controlled wind turbine blade strain experiment, the first decomposition reconstructs nine modes whose peak frequencies lie near the nominal once per revolution frequencies and whose energies are concentrated in the corresponding operating intervals. Residual decomposition further reveals weaker harmonic structure. These results support TFMD as a practical candidate for vibration analysis under variable speed operation, while offshore field use requires validation under environmental loading and with measured operating references.
The transition to a modern and efficient future grid relies on the seamless coordination of distributed energy resources and applications such as Demand Response (DR). While this transformation enables greater flexibility, it increases grid complexity and decentralization, requiring the effective coordination of millions of hardware assets and software agents. Realizing this vision demands advances in interoperability to ensure these heterogeneous systems can communicate without prohibitive customization costs. Semantic interoperability aims to address this by leveraging ontologies to guarantee the unambiguous interpretation of exchanged data. However, current ontologies in the commercial building and DR domains face two critical limitations. First, existing ontologies are often developed without a formal framework that reflects real-world DR requirements. Second, proposals for integrating general and DR-specific ontologies remain mostly conceptual, lacking formalization or empirical validation. In this paper, we begin to address these gaps by applying a formal ontology evaluation/development approach to define the information requirements (IRs) necessary for semantic interoperability, focusing on incentive-based DR programs for commercial buildings in the United States as a starting point. We identify the IRs associated with each stage of the incentive-based DR. Using these IRs, we evaluate how well existing ontologies, specifically Brick, DELTA, EFOnt, and CIM support the operational needs of DR participation. Our findings reveal substantial gaps between current ontologies and practical DR requirements and we propose a roadmap of necessary extensions and integrations for these ontologies. This work ultimately aims to enhance the interoperability of today's and future smart grid, thereby facilitating scalable integration of DR systems into the grid's complex operational framework.
Wi-Fi-based human activity recognition (HAR) provides substantial convenience and has emerged as a thriving research field, yet the coarse spatial resolution inherent to Wi-Fi significantly hinders its ability to distinguish multiple subjects. By exploiting the near-field domination effect, establishing a dedicated sensing link for each subject through their personal Wi-Fi device offers a promising solution for multi-person HAR under native traffic. However, due to the subject-specific characteristics and irregular patterns of near-field signals, HAR neural network models require fine-tuning (FT) for cross-domain adaptation, which becomes particularly challenging with certain categories unavailable. In this paper, we propose WiAnchor, a novel training framework for efficient cross-domain adaptation in the presence of incomplete activity categories. This framework processes Wi-Fi signals embedded with irregular time information in three steps: during pre-training, we enlarge inter-class feature margins to enhance the separability of activities; in the FT stage, we innovate an anchor matching mechanism for cross-domain adaptation, filtering subject-specific interference informed by incomplete activity categories, rather than attempting to extract complete features from them; finally, the recognition of input samples is further improved based on their feature-level similarity with anchors. We construct a comprehensive dataset to thoroughly evaluate WiAnchor, achieving over 90% cross-domain accuracy with absent activity categories.
We introduce LibriConvo, a synthetic conversational speech corpus for speaker diarization and automatic speech recognition (ASR), built by instantiating the previously proposed Speaker-Aware Simulated Conversation (SASC) framework in a dataset and benchmarking setting. The main contribution of this paper is a corpus construction pipeline and benchmark derived from that framework. To make the data more suitable for downstream ASR and diarization, conversational timing statistics are estimated from English CallHome using external voice activity detection, long pauses are compressed, LibriTTS utterances are grouped by book to improve local semantic continuity, and room impulse responses are selected with a spatial-plausibility heuristic. The resulting corpus contains 240.1 hours of audio across 1,496 dialogues involving 830 speakers, partitioned into speaker-disjoint train, validation, and test splits. We report baseline results for both diarization and ASR. On the test split, Sortformer outperforms the pyannote pipeline in diarization (11.1\% vs.~24.4\% DER). For ASR, a Fast Conformer-CTC XLarge model fine-tuned with Serialized Output Training achieves 7.29\% WER and 6.97\% cpWER, outperforming zero-shot Whisper-large-v3. These results position LibriConvo as a practical benchmark for studying synthetic conversational speech and for evaluating multi-speaker speech processing systems.
Photoplethysmography (PPG) is widely used as a non-invasive and accessible modality for continuous health monitoring. However, despite being a peripheral hemodynamic signal intrinsically coupled with systemic circulation, existing research has largely confined its scope to a narrow range of cardiovascular tasks, leaving a fundamental question underexplored: to what extent can PPG support holistic health profiling beyond traditional cardiovascular applications? To answer this question, we present AnyPPG, a foundation model-based framework designed to reveal the broader health-profiling potential of PPG. To ensure reliable performance for this investigation, AnyPPG is pretrained with ECG guidance on the most diverse PPG corpus with synchronized ECG to date, comprising over 100,000 hours of recordings from six large-scale data sources. This pretraining yields robust and physiologically grounded PPG representations that provide a reliable basis for subsequent analysis. Building upon this pretrained model, we conduct a systematic investigation into the association between PPG and holistic health through, to our knowledge, the first PPG-based phenome-wide disease detection study, spanning 1,468 disease phenotypes in more than 15,000 subjects. Our evaluation demonstrates the effectiveness of AnyPPG: across eight clinical and wearable datasets covering 15 downstream tasks, it achieves the best performance in 13 tasks. More importantly, in the phenome-wide analysis, AnyPPG exhibits meaningful discriminative capability (AUC $\ge$ 0.70) for 307 phenotypes across 16 distinct phecode chapters, including 230 non-circulatory conditions such as dementia and chronic kidney disease, many of which have rarely been explored using PPG. Collectively, these findings indicate that easily acquired PPG signals encode rich health-related information extending well beyond conventional cardiovascular assessment.
The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean $R^2 = 0.99 \pm 0.01$). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.
This paper tackles the problem of estimating the relative position, orientation, and velocity between a UAV and a planar platform undergoing arbitrary 3D motion during approach and landing. The estimation relies on measurements from Inertial Measurement Units (IMUs) mounted on both systems, assuming there is a suitable communication channel to exchange data, together with visual information provided by an onboard monocular camera, from which the bearing (line-of-sight direction) to the platform's center and the normal vector of its planar surface are extracted. We propose a cascade observer with a complementary filter on $\mathbf{SO}(3)$ to reconstruct the relative attitude, followed by a linear Riccati observer for relative position and velocity estimation. Convergence of both observers is established under persistently exciting conditions, and the cascade is shown to be almost globally asymptotically and locally exponentially stable. We further extend the design to the case where the platform's rotation is restricted to its normal axis and show that its measured linear acceleration can be exploited to recover the remaining unobservable rotation angle. A sufficient condition for local exponential convergence in this setting is provided. The proposed observers are validated through extensive simulations.
The problem of maintaining the output of a positive time-invariant single-input single-output system within a predefined corridor of values is treated. For third-order plants possessing a certain structure, it is proven that the problem is always solvable under stationary conditions by means of pulse-modulated feedback. The obtained result is utilized to assess the feasibility of patient-specific pharmacokinetic-pharmacodynamic models with respect to patient safety. A population of Wiener models capturing the dynamics of a neuromuscular blockade agent is studied to investigate whether or not they can be driven into the desired output corridor by clinically acceptable sequential drug doses (boluses). It is demonstrated that low values of a parameter in the nonlinear pharmacodynamic part lie behind the detected model infeasibility.
Gyroscopic interconnections enable redistribution of energy among degrees of freedom while preserving passivity and total energy, and they play a central role in controlled Lagrangian methods and IDA-PBC. Yet their quantitative effect on transient energy exchange and subsystem performance is not well characterised. We study a conservative mechanical system with constant skew-symmetric velocity coupling. Its dynamics are integrable and evolve on invariant two-tori, whose projections onto subsystem phase planes provide geometric description of energy exchange. When the ratio of normal-mode frequencies is rational, these projections become closed resonant Lissajous curves, enabling structured analysis of subsystem trajectories. To quantify subsystem behaviour, we introduce the inscribed-radius metric: the radius of the largest origin-centred circle contained in a projected trajectory. This gives a lower bound on attainable subsystem energy and acts as an internal performance measure. We derive resonance conditions and develop an efficient method to compute or certify the inscribed radius without time-domain simulation. Our results show that low-order resonances can strongly restrict energy depletion through phase-locking, whereas high-order resonances recover conservative bounds. These insights lead to an explicit interconnection-shaping design framework for both energy absorption and containment control strategies, while taking responsiveness into account.
Identifying the parameters of nonlinear state-space models from input-output data typically requires solving a highly non-convex optimization problem, which is prone to slow convergence and suboptimal local solutions. This work improves the reliability and efficiency of the estimation process by decomposing the overall optimization problem into a sequence of tractable subproblems. Starting from a linear baseline model, nonlinear residual dynamics are first estimated using a guided residual search (GRS) and subsequently refined through multiple-shooting optimization. Experiments on two benchmarks show competitive performance with state-of-the-art black-box methods and improved convergence over naive initialization.
We propose Relativistic Adversarial Feedback (RAF), a novel training objective for GAN vocoders that improves in-domain fidelity and generalization to unseen scenarios. Although modern GAN vocoders employ advanced architectures, their training objectives often fail to promote generalizable representations. RAF addresses this problem by leveraging speech self-supervised learning models to assist discriminators in evaluating sample quality, encouraging the generator to learn richer representations. Furthermore, we utilize relativistic pairing for real and fake waveforms to improve the modeling of the training data distribution. Experiments across multiple datasets show consistent gains in both objective and subjective metrics on GAN-based vocoders. Importantly, the RAF-trained BigVGAN-base outperforms the LSGAN-trained BigVGAN in perceptual quality using only 12\% of the parameters. Comparative studies further confirm the effectiveness of RAF as a training framework for GAN vocoders.
The global transition towards renewable energy has accelerated the deployment of utility-scale wind farms, increasing the need for accurate performance and economic assessments. Although wind energy offers substantial potential for carbon emission reduction, investment decisions are highly sensitive to predicted annual energy production and economic profitability. Conventionally wind farm analyses often estimate turbine power output based solely on incoming wind conditions, neglecting wake interactions between turbines. These wake effects can significantly reduce downstream turbine performance, leading to overestimation of energy yield and financial returns. This study proposes WAKE-NET, a 3D wake-aware optimization framework that integrates turbine layout optimization, turbine capacity selection, cable routing, and hub height diversification within a unified profit-driven formulation. Unlike traditional approaches that assume a uniform hub height and turbine capacities or ignore wake dynamics, the proposed framework accounts for wake-induced power losses during optimization. A benchmark wake-ignorant model is also evaluated to quantify the impact of neglecting wake interactions. Results indicate that the wake-ignorant optimization can significantly overestimate annual profits, while the use of multiple hub heights and capacities reduce wake overlap and improve spatial utilization. Overall, the findings demonstrate that wake-aware optimization coupled with hub height and capacity diversification provides more reliable energy yield prediction and economic assessment, offering valuable guidance for large-scale wind farm planning and investment.
Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods learn measurement weighting through the solver, but they still use position-only objectives. As a result, the position estimate may improve while the reported covariance remains too small, too large, or incorrectly oriented. We propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. A Weighting Generation Network (WGN) predicts per-satellite reliability weights, and a differentiable Gauss-Newton solver maps these weights to a position estimate and a Hessian-derived posterior covariance. We use proper scoring rules to supervise the East-North predictive distribution end to end. We study negative log-likelihood (NLL), the energy score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in covariance credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes; on the deep-urban scene, both the mean horizontal error and the 95th-percentile error improve. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77 m to 11.68 m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05 relative to DFGO (MAE). Case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.
Understanding when linear immersions of nonlinear dynamical systems exist is important since such immersions allow us to leverage the rich tools of linear system theory to analyze nonlinear dynamics. Recently, Liu et al. (2023) showed that continuous-time dynamical systems that admit countably many but more than one omega-limit sets cannot be immersed into finite dimensional linear systems with a one-to-one and continuous mapping. In this paper, we extend these results to discrete-time dynamics and show that similar obstructions exist also in discrete time. We further consider a generalization involving alpha-limit sets. Several examples are provided to demonstrate the results.
Reachability analysis plays a central role in low-thrust spacecraft trajectory optimization by identifying which target states can be achieved under constraints on time, thrust, and propellant. Classical approaches construct reachable sets by solving many optimal control problems over grids of terminal states, requiring extensive forward simulations with fixed initial conditions. While effective, this approach is computationally expensive and becomes impractical for high-dimensional systems or strongly nonlinear dynamics, such as those encountered in cislunar environments or solar sail missions. This work introduces a dual formulation of the reachability problem. Instead of computing reachable sets directly, we determine, for fixed transfer time and boundary conditions, the maximum allowable initial mass (or, for solar sails, a scalar sail-strength parameter) that permits a successful transfer. A target is reachable if the spacecraft's initial mass does not exceed this threshold. This reformulation reduces reachability assessment to a scalar optimization problem for each target, producing a smooth scalar field that encodes equivalent feasibility information to classical reachable sets. We develop indirect maximum-initial-mass (MIM) formulations for both electric low-thrust and solar-sail dynamics and show how they can serve as efficient reachability oracles. Building on this formulation, we construct data-driven surrogate models to approximate the MIM-based reachability indicator. We investigate fully connected neural networks and demonstrate that residual networks provide the best trade-off between accuracy, training stability, and model complexity. The resulting surrogates enable rapid reachability evaluation while preserving the numerical advantages of the dual formulation, offering a practical tool for preliminary mission design and feasibility assessment.
This paper tackles the optimization of the point spread function (PSF) of unmanned aerial vehicle (UAV)-borne multiple-input multiple-output (MIMO) synthetic aperture radar (SAR) tomography systems. A swarm of UAV-borne SAR systems is deployed to image an area to obtain its height profile. To achieve a high-quality three-dimensional (3D) image of the scene, the PSF has to exhibit low sidelobes. The heavy computations, required for image generation, are performed on the ground. To this end, the sensor data collected by the UAV-SARs is offloaded in real time via a frequency division multiple access (FDMA) air-to-ground backhaul link. In this work, the UAV formation and the power allocated for offloading are jointly optimized for the minimization of the PSF sidelobe levels. To this end, we propose a novel solution based on the particle swarm optimization (PSO) algorithm, which meets practical sensing and communication constraints. Our simulation results demonstrate that the proposed solution can significantly improve sidelobe suppression compared to several benchmark schemes.
Self-supervised speech representation learning has made significant progress through Siamese networks, which leverage different views of the same input. However, existing methods often require frame-wise alignment between these views, overlooking the broader linguistic context invariance across different speaking styles. We introduce SiamCTC, a framework that integrates Siamese networks with Connectionist Temporal Classification (CTC) to learn speech representations without strict frame-level correspondence. By employing CTC loss to establish flexible, monotonic alignments between differing temporal realizations of the same content, SiamCTC accommodates speed perturbations and other temporal augmentations. This design relaxes frame-wise constraints while preserving temporal coherence and enhancing robustness to speaking-rate variations in downstream tasks. Our experiments demonstrate that SiamCTC leads to more adaptable speech representations, particularly at diverse speaking rates.
While Audio Language Models (ALMs) demonstrate strong semantic understanding, they struggle with complex affective interactions. Specifically, textual semantic dominance often overshadows acoustic nuances, and a lack of cognitive depth leads to generic, emotion-agnostic responses. We propose CogAudio-LLM\footnote{ \urlstyle{same} this https URL, a novel cognitive affective reasoning framework. To mitigate semantic dominance, we build LIME-440K, a ``lexically-identical, multi-emotion'' dataset designed to facilitate acoustic-semantic decoupling. We introduce EIPS, a 4-step Chain-of-Thought (CoT) mechanism incorporating psychological reasoning. For inference efficiency, multi-stage training explicitly establishes EIPS via supervised fine-tuning, then distills this logic into an implicit generation process. Finally, we design DR-SAPO (Dual-Route Soft Adaptive Policy Optimization) to dynamically balance the logical rigor of the CoT with the empathetic quality of the direct response.
Channel estimation in vehicular communication is a crucial element in the advancement of intelligent transportation systems. However, the use of pilot signals in the IEEE 802.11p standard is insufficient for accurate channel estimation in high-mobility scenarios. Data pilot-aided (DPA) estimation helps address this, but suffers from demapping errors. We propose a simplified Temporal Convolutional Network-based estimator (DPA-TCN) trained on a mixed signal-to-noise ratio dataset to improve estimation performance and reduce computational complexity. Our DPA-TCN estimator achieves a bit error rate comparable to a state-of-the-art long-short-term memory network with DPA and temporal averaging (LSTM-DPA-TA) while reducing the complexity of the model by approximately 65%.
Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT images alone, failing to replicate the clinicians multimodal reasoning. We explore a two-branch multimodal network combining raw MRI scans with 91 extracted radiomic features (intensity, texture, shape, and boundary descriptors) to classify brain tumors into glioma, meningioma, pituitary, and no-tumor. A pre-trained CNN backbone encodes the image stream, whereas a dedicated MLP encodes the radiomic stream. Both streams are fused via concatenation, gated, or bidirectional cross-modal attention strategies. Across nine experimental runs on a balanced 7,200 image dataset, all multimodal configurations outperform unimodal baselines with gated fusion achieving the best accuracy of 96.13%.
Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.
The capture of flying MAVs (micro aerial vehicles) has garnered increasing research attention due to its intriguing challenges and promising applications. Despite recent advancements, a key limitation of existing work is that capture strategies are often relatively simple and constrained by platform performance. This paper addresses control strategies capable of capturing high-maneuverability targets. The unique challenge of achieving target capture under unstable conditions distinguishes this task from traditional pursuit-evasion and guidance problems. In this study, we transition from larger MAV platforms to a specially designed, compact capture MAV equipped with a custom launching device while maintaining high maneuverability. We explore both time-optimal planning (TOP) and reinforcement learning (RL) methods. Simulations demonstrate that TOP offers highly maneuverable and shorter trajectories, while RL excels in real-time adaptability and stability. Moreover, the RL method has been tested in real-world scenarios, successfully achieving target capture even in unstable states.
Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that couples representation learning with an analytically solvable One-Class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a \deleted{new} benchmark based on MNIST-C, and a challenging brain MRI \deleted{subtle} lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and texture or population age variations in MRI. Results demonstrate performance and robustness of our proposed model, highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at this https URL.
We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.
Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts
This article introduces a novel cryptographic paradigm based on nonderived polyadic algebraic structures. Traditional cryptosystems rely on binary operations within groups, rings, or fields, whose well-understood properties can be exploited in cryptanalysis. To overcome these vulnerabilities, we propose a shift to polyadic rings, which generalize classical rings by allowing operations of higher arity: an $m$-ary addition and an $n$-ary multiplication. The foundation of our approach is the construction of polyadic integers -- congruence classes of ordinary integers endowed with such $m$-ary and $n$-ary operations. A key innovation is the parameter-to-arity mapping $\Phi(a,b)=(m,n)$, which links the parameters $(a,b)$ defining a congruence class to the specific arities required for algebraic closure. This mapping is mathematically intricate: it is non-injective, non-surjective, and multivalued. This complex, non-unique relationship forms the core of the proposed cryptosystem's security. We present two concrete encryption procedures that leverage this structure by encoding plaintext within the parameters of polyadic rings and transmitting information via polyadically quantized analog signals. In one method, plaintext is linked to the additive arity $m_{i}$ and secured using the summation of such signals; in the other, it is linked to a ring parameter $a_{i}$ and secured using their multiplication. In both cases, the "quantized" nature of polyadic operations generates systems of equations that are straightforward for a legitimate recipient with the correct key but exceptionally difficult for an attacker without it. The resulting framework promises a substantial increase in cryptographic security. This work establishes the theoretical foundation for this new class of encryption schemes and highlights their potential for constructing robust, next-generation cryptographic protocols.
Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty that distinguishes online RL from standard generative modeling is the lack of direct samples from the target Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which uses a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. However, it remains unclear how these objectives are formally related, or whether they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that share the same expectation. We show that existing noise-expectation and gradient-expectation methods are simply two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and it enables the principled combination of Q-value and Q-gradient information to form an effective estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.
Imperceptible text-based speech editing modifies spoken content through transcript manipulation while preserving acoustic continuity. Prior acoustic-space approaches suffer from content-style entanglement, causing unstable generation and boundary artifacts. We introduce a framework guided by the principle of "Edit Content, Preserve Acoustics". Editing is conducted in a stable semantic space, while acoustic realization is handled by a Flow Matching decoder. To ensure perceptual consistency, we propose Self-Consistency Rewards Group Relative Policy Optimization, which leverages a pre-trained Text-to-Speech model as an implicit critic, together with intelligibility and duration constraints. Experiments demonstrate consistent improvements over state-of-the-art autoregressive and non-autoregressive baselines in intelligibility, robustness, and perceptual quality.
We study the problem of monitoring model performance in dynamic environments where labeled data are limited. To this end, we propose prediction-powered risk monitoring (PPRM), a semi-supervised risk-monitoring approach based on prediction-powered inference (PPI). PPRM constructs anytime-valid lower bounds on the running risk by combining synthetic labels with a small set of true labels. Harmful shifts are detected via a threshold-based comparison with an upper bound on the nominal risk, satisfying assumption-free finite-sample guarantees on the type-I error. We demonstrate the effectiveness of PPRM through extensive experiments on image classification, large language model (LLM), and telecommunications monitoring tasks.
Conformal prediction (CP) offers distribution-free marginal coverage guarantees under an exchangeability assumption, but these guarantees can fail if the data distribution shifts. We analyze the use of pseudo-calibration as a tool to counter this performance loss under a bounded label-conditional covariate shift model. Using tools from domain adaptation, we derive a lower bound on target coverage in terms of the source-domain loss of the classifier and a Wasserstein measure of the shift. Using this result, we provide a method to design pseudo-calibrated sets that inflate the conformal threshold by a slack parameter to keep target coverage above a prescribed level. Finally, we propose a source-tuned pseudo-calibration algorithm that interpolates between hard pseudo-labels and randomized labels as a function of classifier uncertainty. Numerical experiments show that our bounds qualitatively track pseudo-calibration behavior and that the source-tuned scheme mitigates coverage degradation under distribution shift while maintaining nontrivial prediction set sizes.
We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy a suitable controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some of which may destabilize the underlying system. While classical estimator-based supervisory control guarantees asymptotic stability, it lacks quantitative finite-time performance bounds. Conversely, current non-asymptotic methods in both online learning and system identification require restrictive assumptions that are incompatible in a control setting, such as system stability, which preclude testing potentially unstable controllers. To bridge this gap, we propose a novel, non-asymptotic analysis of supervisory control that adapts multi-armed bandit algorithms to a control-theoretic setting. The proposed data-driven algorithm evaluates candidate controllers via scoring criteria that leverage system observability to isolate the effects of state history, enabling both detection of destabilizing controllers and accurate system identification. We present two algorithmic variants with dimension-free, finite-time guarantees, where each identifies the matching controller in $O(N \log^2 N)$ steps, while simultaneously achieving finite $L_2$-gain with respect to system disturbances.
The Prisoner's Dilemma, zero-sum games, LQR team problems, and differential games have shaped game theory in controls for decades, but the field's most pressing adversarial challenges demand a richer framework, and its name is Colonel Blotto. Strategic adversarial constraints represent a fundamental consideration in control systems, from cybersecurity defense to infrastructure protection. Colonel Blotto games, despite their direct relevance to such applications, remain underutilized in the controls community relative to other game-theoretic approaches. This article aims to close that gap for the controls community. Indeed, theoretical advances within the last two decades have spurred a resurgence of interest and enabled their applications across several domains. In this article, we introduce the Colonel Blotto framework, survey key analytical and computational results, and demonstrate how problems spanning cybersecurity, network defense, and multi-agent systems fit naturally within this structure. Three research directions are examined in depth: interdependent contest objectives that capture networked vulnerabilities, alternate winning rules that model partial rewards and structural asymmetries, and multi-agent competitive environments involving coalition formation and strategic concessions. Taken together, these directions reveal a framework that is both practically deployable and rich enough to capture the strategic complexity inherent in adversarial resource allocation.
Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but struggle to maintain performance on noisy and imperfect real-world datasets. In contrast, conventional multi-layer perceptrons (MLPs) are far more tolerant to noise and computationally efficient. Replacing all MLP components with KANs in HAR models often degrades accuracy and computation efficiency, highlighting an open challenge: how to combine KANs' precision with MLPs' noise robustness and efficiency. To address this, we systematically explore various placements of KAN modules within deep HAR networks and propose a hybrid architecture that strategically synergizes the strengths of both paradigms, which uses a KAN-based input embedding layer, retains MLP layers for intermediate feature mixing, and introduces a specialized LarctanKAN module for final activity classification. Across eight public HAR datasets, the hybrid KAN-MLP model achieves an average macro F1 score relative improvement of 5.33\% compared pure-MLP model, significantly outperforming standalone KAN and MLP baselines. Furthermore, integrating this hybrid strategy into other state-of-the-art HAR architectures consistently boosts their performance. Our findings demonstrate that a carefully orchestrated combination of KAN, MLP, or other conventional neural components yields more robust and accurate HAR models for real-world wearable sensing environments.
nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.