Reinforcement Learning (RL) has long been a powerful solution to various problems in communication networks. However, traditional RL models still face with several limitations. Not only do they rely on large numbers of interactions with the environment, but they are also limited in terms of modeling long-term relationships and tackling partial observability. In recent years, the Transformer model has demonstrated the ability to enhance RL models, allowing them to overcome these issues. Particularly, the self-attention mechanism within the Transformer enables efficient modeling of long-range dependencies and global correlations, as well as accelerates training processes and handles heterogeneous data modalities. In this paper, we present a comprehensive survey of Transformer-based RL algorithms and their applications in communication networks. Specifically, the paper provides the mathematical background of RL and Transformer architectures, along with insights into key issues such as resource allocation, computation offloading, routing, and trajectory control, and network security. We conclude the paper by discussing challenges, open issues, and notable future research directions, including Transformer-enhanced DRL algorithms for semantic communication and network optimization.
Oklab and its cylindrical representation Oklch are widely adopted in interpolation and design workflows as perceptually motivated color spaces, but their color difference prediction accuracy falls short of CIEDE2000. We propose Oklch+, a three-parameter extension of Oklab comprising a power transformation on the L-axis and a Naka-Rushton compression on the C-axis, with Euclidean distance computed in the resulting transformed Oklab coordinates. The Naka-Rushton function is bounded in [0,1], reflecting the saturating nature of chroma sensitivity at high colorimetric values. Evaluated on COMBVD -- 3,813 suprathreshold color difference pairs spanning six independent experimental datasets -- Oklch+ achieves STRESS = 29.09, closely matching CIEDE2000 (29.13; difference = 0.04), using only three parameters optimized against color difference data compared to approximately 17 for CIEDE2000. Cross-validation on a held-out BFD-P D65 subset (2,028 pairs) confirms generalization (STRESS = 26.14), with Oklch+ substantially outperforming Oklab (51.45) and achieving STRESS comparable to CIEDE2000 (24.12) on the held-out set. Improvement over Oklab (47.35) is confirmed across all six COMBVD sub-datasets. Because Oklch+ defines a coordinate system in which Euclidean distance approximates perceptual distance, linear interpolation in the transformed space offers substantially improved perceptual uniformity relative to Oklab. Current evaluation is limited to the sRGB-centered COMBVD dataset; validation in high-chroma regions with empirical observer-rated discrimination data remains future work.
Emergency landing flight envelope analysis traditionally adopts a binary notion of safety, whereby a trajectory is safe only if state constraints are satisfied pointwise in time. In practice, ensuring a successful landing requires recognizing that aircraft operation spans a continuum in the state space from the nominal to the critical regime. Between these regimes lies a degraded regime of states outside nominal operation that may be visited only for limited durations. Safety is therefore inherently graded, in the sense that limited exposure to degraded states may be tolerated, and must be assessed using a trajectory-dependent criterion rather than a purely pointwise-in-time one. This paper develops a Hamilton-Jacobi reachability framework for analyzing emergency landing flight envelopes under this graded notion of safety. Safety is encoded through a soft constraint defined by a designer-specified continuous violation cost function that assigns zero cost in the nominal regime and larger cost to more safety-critical off-nominal states. We introduce a general class of state- and time-dependent violation cost functions and establish monotonicity and continuity properties that characterize how the flight envelope varies with the cost of off-nominal operation. These results provide a principled sensitivity analysis linking safety conservativeness to operational capability. Building on this analysis, we propose a synthesis algorithm for parameterized violation cost functions in this class. The algorithm provably converges to the least conservative parameter under which a prescribed off-nominal safety requirement is satisfied. Numerical results for a fixed-wing emergency landing scenario under propulsion failure demonstrate the sensitivity properties and validate the algorithm.
Hyperspectral imagery represents the best contemporary technology to remotely detect anomalous objects. Nevertheless, hyperspectral anomaly detection (HAD) technique makes ground facilities/situations completely exposed. For the first time, we develop the first anti-HAD (AHAD) technique rendering the key objects undetected, without perfect coordinate/position state information (CSI) of the detectors (e.g., reconnaissance aircraft). Our AHAD algorithm is generally applicable to defend against almost all the existing benchmark data-driven and model-driven HAD methods. AHAD is fundamentally different from conventional adversarial attacks, so novel theory is needed. We customize novel regularizers for assimilating real anomalies into the backgrounds (ARAB) and fooling the detectors with pseudo-anomalies, thereby optimizing an energy-efficient stealthy perturbation signal for AHAD. The ARAB regularization is mathematically interpretable as flattening the topology-enhanced anomaly/background structures in the feature space, hence termed Lipschitz-forcing perturbations. Considering the imperfect CSI, we further develop a robust AHAD criterion, where the uncertainty is mathematically described as matrix-shifting misalignment for statistically generating the robust perturbation. Comprehensive experiments demonstrate the effectiveness and robustness of our AHAD algorithm across diverse real-world datasets. Remarkably, our algorithm generates a single AHAD perturbation signal that can simultaneously evade almost all benchmark detectors, greatly enhancing its practicality, given that the reconnaissance detector type is usually unknown. To the best of our knowledge, this is the first formal AHAD study. As a side contribution, we propose a new quantitative performance index, ArmCBA, to evaluate the robustness of an HAD method against our AHAD signal.
Children's automatic speech recognition (ASR) remains challenging because child speech differs from adult speech and varies substantially across developmental stages. While adapter tuning provides a promising way to adapt large pretrained ASR models to children's speech, a single shared child adapter may not fully capture age-dependent variation. In this work, we present one of the first systematic studies of age-aware adapter tuning for child ASR, focusing on speech from children aged 3--12 and older years. We propose age-specialized adapters trained separately for different age groups and compare them with a unified age-conditioned FiLM adapter. With ground-truth age routing, age-specialized adapters improve over the standard shared child adapter baseline from 12.6% to 12.3% overall word error rate (WER) and from 18.4% to 17.6% macro WER, while consistently improving WER for all age groups. We further show that predicted-age routing remains close to ground-truth routing, achieving 12.3% overall WER and 17.8% macro WER without ground-truth age labels at inference. In contrast, unified FiLM conditioning provides smaller gains, indicating that a single unified adapter may be insufficient to capture developmental variation in child speech.
The power consumption of the analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) in fully digital massive multiple-input multiple-output (MIMO) systems motivates the adoption of low-resolution architectures. In particular, 1-bit DACs reduce the power consumption and hardware complexity at the transmitter, but introduce severe transmit-side quantization distortion. In this paper, we investigate data detection for a point-to-point massive MIMO system with 1-bit DACs at the transmitter, where the linearly precoded signal is dithered prior to quantization, and either full-resolution or 1-bit ADCs at the receiver. Assuming that the dither vector applied at the transmitter is known at the receiver, we first develop softestimation-based data detection methods with symbol-independent dither removal for both full-resolution and 1-bit ADCs. We then introduce a new symbol-dependent linearization of the transmitted signal at the output of the 1-bit DACs and use it to derive maximum-likelihood (ML)-based data detection methods that directly recover the data symbol vector from the received signal. For full-resolution ADCs, this leads to an ML-based method with and without dither removal. For 1-bit ADCs, we develop an approximate ML-based method that exploits the derived statistics of the received signal without dither removal. We also propose low-complexity variants of the ML-based methods to mitigate the exponential complexity growth with the number of streams. Numerical results in terms of symbol error rate highlight the critical role of the dither power and demonstrate that the proposed ML-based methods (along with their low-complexity variants) achieve significant gains over a baseline based on binary ML detection via a homotopy algorithm.
We propose the Control Algorithm Performance Evaluation (CAPE) framework, a systematic methodology for benchmarking racing controllers under our proposed learned enhanced physics model (EPM). The proposed framework enables cross-controller comparison by evaluating five closed-loop control architectures. We further compare our proposed EPM with two state-of-the-art learned vehicle dynamics models: Deep Pacejka Model (DPM) and Deep-learning Dynamics Model (DDM). Closed-loop experiments show that across all models and controllers, the proposed EPM achieves best average lap times. Specifically, the Adaptive NMPC with EPM achieves a time of 5.82 s, compared with 12.99 s for DPM and 8.80 s for DDM, while simultaneously producing substantially lower longitudinal and lateral tracking errors under identical controller configurations. We further evaluate all three models and five controllers using a disturbance-aware simulation framework incorporating measurement noise, process disturbances, actuator delay, and parametric uncertainty. Under moderate global disturbance scaling factor ({\eta} = 1), results averaged across the five controllers show that EPM reduces a) longitudinal tracking error by 29.0% and 17.2%; b) lateral tracking error by 24.6% and 12.3%; c) while increasing average velocity magnitude by 39.9% and 3.1% relative to DPM and DDM, respectively. Overall, CAPE establishes a systematic benchmark for evaluating the performance of learned vehicle dynamics models in a closed-loop control framework and demonstrates that our proposed EPM significantly improves controller robustness and performance under realistic uncertainties.
This paper presents a comprehensive 2D analytical model of a toroidal magnetic ring with a square cross-section, subjected to sinusoidal excitation. By applying Maxwell's equations in local Cartesian coordinates and utilizing a complex permeability framework, the exact analytical expressions for the internal magnetic field, flux, complex impedance, and losses are derived. The model rigorously separates eddy current losses, hysteresis losses, and winding losses, explicitly accounting for the skin effect and complex permeability within the conductive core using separation of variables and hyperbolic functions. Furthermore, parameter for apparent permeability is expressed to map the core behavior onto simplified linear material models. The derivations establish a mathematical foundation highly suitable for standardized material characterizations, such as Brockhaus and Iwatsu ring measurements, by avoiding the heavy computational cost of 2D and 3D Finite Element Analysis.
In this paper, multi-target tracking and scanning are considered in a radar system operating in the track-while-scan mode. Specifically, time allocation for radar scanning and tracking of multiple maneuvering targets under a time budget constraint is addressed, aiming to jointly optimize the performance of both tracking and scanning in a cognitive radar. We first present the details of the model for tracking and scanning and formulate the time management task as a constrained optimization problem. Subsequently, we design a \gls{cdrl} framework to find the time allocation strategy for the problem. In the proposed \gls{cdrl} framework, the parameters of the neural networks and the dual variable are learned simultaneously. The deep deterministic policy gradient (DDPG) algorithm is introduced to tackle continuous action space and its performance is compared with deep Q-learning, heuristic approaches, and an optimization-based approach. Numerical results show that the radar with the proposed \gls{cdrl} framework can autonomously allocate more time to the tracking task that requires greater attention while providing time for scanning and also constraining the total time budget below the predefined threshold.
As grid-forming (GFM) battery energy storage systems (BESS) are increasingly deployed to enhance power system inertial response and frequency stability, incorporating their frequency support capabilities into day-ahead energy scheduling (DAES) is essential for achieving both frequency security and operational efficiency. However, accurately determining frequency metrics in grids with coexisting GFM inverters and synchronous generators requires electromagnetic transient (EMT) simulations, which are computationally prohibitive for direct embedding in grid operational optimization models. To bridge the gap between modeling accuracy and computational efficiency, a learning-assisted DAES (LA-DAES) framework is proposed in this work. By leveraging a surrogate model to represent the frequency support dynamics of GFM BESS, the proposed framework ensures frequency security with a reasonable solve time. Comparative results demonstrate that, relative to analytical frequency-constrained DAES, the proposed LA-DAES framework more accurately captures grid frequency metrics and improves the utilization of GFM BESS.
As sixth-generation (6G) wireless systems evolve toward higher frequency bands, large-scale antenna arrays, and intelligent interaction with the wireless environment, conventional fixed-position antennas (FPAs) are increasingly constrained by limited spatial degrees of freedom and insufficient hardware-level adaptability. Fluid antenna systems (FAS) provide new physical-layer flexibility by dynamically reconfiguring antenna ports, geometries, and radiation characteristics. However, existing studies have mainly focused on one- or two-dimensional apertures, leaving the spatial reconfigurability required for complex three-dimensional (3D) propagation environments insufficiently exploited. In this article, we present a 3D spherical fluid antenna system (3D SFAS) architecture for flexible spatially reconfigurable communications. By activating radiating elements in different spherical regions, 3D SFAS realizes array-level spatial reconfiguration through flexible region switching. Within the selected regions, element-level reconfiguration further adjusts the effective aperture size, array topology, and radiation characteristics. This joint framework enables flexible beamforming, concurrent multi-region transmission, blockage-adaptive aperture switching, effective-aperture reconfiguration, and high-resolution 3D aperture control. We also discuss its potential applications in space-air-ground integrated networks, high-mobility communications, integrated sensing and communication systems, and emergency communications. Numerical results demonstrate the potential of 3D SFAS to improve wireless communication performance through flexible spatial reconfiguration. Overall, 3D SFAS extends FAS design beyond 2D position switching toward comprehensive 3D spatial reconfigurability.
This paper addresses the question: How can mission effectiveness be systematically defined or approximated in the absence of customer requirements? Legacy requirements engineering frameworks presuppose customer input to define specifications but leave a gap in the process when stakeholder input is ill-defined or missing. Rapid build and development programs (such as military acquisition, space assets, infrastructure projects, etc.) often see requirement and objective evolutions throughout the proposal process, so a more adaptive method is needed. To address this gap, a structured approach is proposed that decomposes mission intent into mission context, functions, constraints, critical dimensions, effectiveness attributes, and architecture alternatives. This method conducts a mission feasibility assessment, prioritizes mission-critical dimensions using Best-Worst Scaling, and introduces a mission complexity factor to quantitatively understand the impacts of external mission difficulties, technology maturity, evidence and confidence standards, and mission utility. The resulting method provides a traceable basis for deriving Tier 1 and 2 requirements. The approach is structured to support future Unified Architecture Framework (UAF) and Systems Modeling Language (SysML) artifact integration. The proposed framework is demonstrated using a notional close air support mission example.
Underwater vehicles are naturally modelled as rigid bodies on SE(3) subjected to added mass effects. The passivity of the Hamiltonian structure of the system can be exploited to design energy-based stabilising controllers, however, the extension of these control designs to tracking control is not trivial since the error system for the classical error formulations is not itself Hamiltonian. In this paper, we show that a novel choice of error function leads to error dynamics that are Hamiltonian. We go on to derive an energy-based tracking control for a fully coupled model of a submersible vehicle. Asymptotic convergence of the control scheme is proved and the control is demonstrated in a simulation study of the Blue Robotics BlueROV2 Heavy submersible.
Automatic Audio Captioning (AAC) seeks to generate natural language descriptions of complex acoustic scenes, bridging auditory perception and language understanding. However, word-selection indeterminacy and increasing reliance on large-scale sequence-to-sequence or LLM-based models limit practical deployment. We propose a resource-efficient AAC framework that explicitly grounds caption generation in auxiliary AudioSet semantics. Frame-level acoustic representations extracted using a ConvNeXt encoder are augmented with top-$K$ predicted AudioSet keywords, providing structured contextual cues for decoding. A compact six-layer BART-style decoder conditions on this joint acoustic-semantic representation, enabling caption generation without LLM-scale decoding. The proposed design balances semantic grounding and computational efficiency within a compact architecture. Evaluations on Clotho V2 and AudioCaps confirm competitive caption quality under practical deployment constraints.
Audio-Visual Speech Recognition (AVSR) enhances speech recognition robustness by leveraging visual cues, while real-world scenarios remain challenging due to viewpoint variation, audio distortion, and visual occlusion, which degrade modality quality and increase audio-visual asynchrony. In this paper, we propose a novel Modality-aware Multi-view Self-supervised representation framework for robust Audio-Visual Speech Recognition (M2S-AVSR). First, we introduce a multi-view representation learning encoder to learn view-invariant visual speech representations. Next, we employ a modality-aware module that explicitly models modality quality and cross-modal synchrony to perform fine-grained modality-aware fusion, enabling fine-grained visual information injection during decoding. In addition, we present AISHELL8-RealScene, a public multi-scenario, multi-view conversational audio-visual dataset recorded in real-world environments, and establish a speech recognition benchmark on it. Experiments on English and Mandarin benchmarks demonstrate the effectiveness of the proposed method under challenging conditions. On LRS3, M2S-AVSR achieves up to 29.4% relative improvement under viewpoint perturbation and visual degradation settings. Our method also achieves new state-of-the-art performance on the MISP2021-AVSR test set. On AISHELL8-RealScene, it achieves the best result in outdoor scenes. The proposed method and dataset provide useful support for future research on robust speech and multimodal tasks under realistic conditions.
In power systems, alternating current optimal power flow (AC-OPF) has been a challenging problem for decades due to its nonconvexity, but fast and efficient solutions are even more needed because of high penetration of large scale renewable generation and load growth. Recently, neural networks (NN) have gained attention in solving AC-OPF, but it is still in an early stage to be applicable for real and large-scale power system operation with topology-changing characteristics. To end this, we propose a novel framework called GraphOPF that considers topology-adaptability, scalability, NN training time, self-supervision, and feasibility altogether. Extensive experiments show that the proposed framework against the baselines is up to 200 times faster in NN training and up to 66 times faster in solving AC-OPF for large-scale power systems including the real Korean power system, while achieving more than 99% feasibility.
We propose an AI Agent tailored for link power management in multi-band systems. In S+C+L band span-level study, the agent efficiently solves various optimization objectives. In network-wide evaluation, it delivers 689.0 Tbps gain in total allocated traffic with merely 303 average interactions per power profile.
Nonlinear Model Predictive Control requires solving a constrained nonlinear program (NLP) in real-time at every sampling instant, a computational bottleneck that limits deployment on resource-constrained hardware or at high sampling rates. We address this challenge for the broad class of input-affine nonlinear systems to show that the optimal control move can be approximated by a state-dependent quadratic program (QP) whose cost parameters depend on the current state and reference. We propose a single-network residual-corrector architecture: a state-dependent analytic baseline provides initial QP parameters, and the network learns only the corrections needed to match the full NLP solution; the QP is solved by a differentiable interior-point layer, guaranteeing constraint satisfaction for the first control action. The network is trained offline on data generated by an NLP solver using a hybrid loss that combines supervised imitation and KKT-residual penalties. We validate the approach on a three-link planar robotic arm with Cartesian end-effector tracking, demonstrating orders-of-magnitude speedup over the NLP solver while maintaining comparable tracking performance.
The paper considers a large class of nonlinear circuits, termed RLCM, containing all four basic circuit elements, i.e., resistors, inductors, capacitors and memristors. A companion paper [1] has introduced a mixed potential for RLCM circuits generalizing that found by Brayton and Moser for circuits without memristors. In this paper, systematic Lyapunov-like results on convergence of RLCM circuits are proved by means of the mixed potential. These hold under the basic assumption that an RLCM circuit has a complete set of variables in the flux-charge domain and they require, roughly speaking, that there is a balance, which is quantitatively estimated, between capacitors and inductors. The convergence results are robust with respect to circuit parameter variations and they include cases where the memristor circuits possess multiple stable equilibrium points, which is of importance for instance to implement content addressable memories (CAMs). The results extend to circuits possessing all four basic circuit elements previous results that pertain to circuits without memristors or memristor circuits without inductors. The main proofs are conducted by using the flux-charge analysis method (FCAM) to analyze RLCM circuits in the flux-charge domain.
Most neural speech codecs use residual vector quantization (RVQ), in which later VQs contribute less but consume the same bitrate, leading to inefficiency. We propose P2PSynCodec, an ultra-low-bitrate neural speech codec with a plain-to-pseudo synergistic vector quantizer (P2PSVQ). P2PSVQ consists of one plain VQ and multiple pseudo VQs. The plain VQ produces basic tokens by quantization, while the pseudo VQs generate auxiliary tokens by neural prediction and incur zero transmitted bitrate. Thus, speech is decoded from the plain-VQ tokens together with predicted pseudo-VQ tokens, greatly reducing bitrate. Experiments show that P2PSynCodec achieves speech reconstruction quality comparable to competing codecs at 2.0 kbps while operating at only 0.5 kbps, demonstrating high efficiency for ultra-low-bitrate speech coding.
Neural speech codecs are key to speech transmission and storage, but most use uniform quantization across frames, allocating the same bitrate regardless of content and wasting bits. We propose VoCodec, a low-bitrate streamable neural speech codec with voicing-driven quantization that assigns higher bitrate to voiced frames and lower bitrate to unvoiced frames according to perceptual sensitivity. VoCodec embeds a voicing detector in a fully causal encoder-quantizer-decoder neural coding framework, using residual scalar-vector quantization for voiced frames and simple scalar quantization for unvoiced ones. Experiments show that on the LibriTTS dataset at a 16 kHz sampling rate, VoCodec outperforms baseline neural speech codecs even at a bitrate as low as 1.1 kbps. Our further experiments also confirm that introducing voicing-driven quantization can effectively reduce the bitrate by approximately 27% compared with uniform quantization strategy.
This work conceives a unified channel estimation and beamforming framework, formulated within the principles of variational Bayesian inference. Recognizing the limitations imposed by hardware constraints, frequency-dependent propagation effects, and the structural restrictions of partially connected architectures in the Terahertz (THz) band, we formulate a dual-wideband channel model incorporating root raised cosine (RRC) pulse shape to account its band-limited nature. To further address the nonlinear distortions introduced by low-resolution ADCs, Bussgang decomposition is employed, enabling a tractable linearized inference process. Unlike conventional techniques, the proposed method accommodates both on-grid and off-grid angular domains, capturing spatial sparsity with improved resolution and robustness. The multi-user (MU) Bayesian Cramér-Rao lower bound is also derived to benchmark the performance of the proposed estimator. Moreover, the framework incorporates a true time delay (TTD)-based hybrid transceiver design that inherently compensates for the beam-squint effect; a frequency-dependent angular deviation that arises due to the fixedphase nature of the conventional beamformer in wideband systems, thereby ensuring accurate directional alignment across all subcarriers. Extensive simulation results validate the effectiveness of the proposed variational Bayesian inference-based estimator and the TTD-enabled beamforming architecture, highlighting their robustness and performance gains under practical wideband THz system.
The increasing penetration of single-phase loads and distributed generation exacerbates voltage unbalance (VU) in distribution grids, raising concerns about power quality and complicating network operation. However, most market-clearing models and price-based coordination frameworks do not enforce VU limits within a three-phase AC representation, so the implications for grid-code compliance, numerical scalability, and economic signals remain unclear. This paper embeds VU in a three-phase AC optimal power flow market-clearing model and benchmarks two treatments: strict VU limit enforcement and objective function penalization. Building on these insights, an Improved Hybrid Limits (IHL) formulation is proposed that preserves compliance while using a smooth unbalance proxy in the objective to guide the optimization solver. Case studies on a European low-voltage feeder show that IHL maintains feasible operating points, yields price and curtailment signals consistent with conventional hybrid formulations, and converges substantially faster and more reliably than a penalization based on the exact unbalance metric. These results support IHL as a practical and scalable mechanism for VU mitigation in market-based operation of unbalanced distribution systems.
Speech-based Alzheimer's Disease (AD) detection is constrained by scarce pathological speech data. To address this, we propose CoSTA, a Text-to-Speech (TTS)-based data augmentation framework. Specifically, we first develop two Cognitive-State-Conditioned (CS-Cond) TTS models by adapting CosyVoice2 and F5-TTS to synthesize speech with distinct AD and Healthy Control characteristics. Furthermore, by constructing a transcript pool comprising Manual Transcripts (MT) and 36 Automatic Speech Recognition (ASR) transcripts, we investigate the impact of text sources on TTS-based augmentation. We also perform augmentation-factor analysis and test-time augmentation. Experiments on the ADReSS dataset show that CS-Cond TTS significantly improves synthetic speech utility, and ASR-driven augmentation frequently outperforms MT-driven augmentation. Finally, CoSTA yields a 4.16% gain over the baseline, achieving an audio-only accuracy of 85.83% on the ADReSS test set and outperforming prior methods.
This work presents a data-driven framework for interpretable modelling and decision support in flotation systems, integrating Gaussian Process (GP) regression with Global Sensitivity Analysis (GSA) via Sobol indices and local interpretability using SHapley Additive exPlanations (SHAP). Based on laboratory-scale experimental data, a static GP surrogate model is developed to capture how superficial air velocity, overflowing froth velocity, froth height over the lip, pulp height, bubble size, and tailings flowrate influence the measured air recovery. The trained GP enables the computation of Sobol indices to quantify the contribution of each variable and their interactions to the overall variance in air recovery. The combination of Bayesian inference and Sobol-based sensitivity metrics provides a systematic approach to identify the dominant and interacting variables governing air recovery. This study links Bayesian learning, sensitivity quantification, and explainability to provide a foundation for data-driven control and optimisation of flotation processes.
Building a lexicon from discovered word-like units is a central goal in zero-resource speech processing. But do our evaluations provide a trustworthy indication of lexicon quality? A common metric, normalized edit distance, averages the phoneme edit distances between discovered units in each cluster. We show that this metric has an inherent bias toward the quality of large clusters, inhibiting fair evaluation. Moreover, it ignores how well true classes are distributed across clusters. Based on established theory in clustering literature, we propose two metrics that address these shortcomings: a modified metric that weighs cluster size when assessing within-cluster consistency, and an inverse metric that assesses how true words are spread across clusters. Through experiments on synthetic and real-world lexicons, we demonstrate that combined, these metrics are: (1) more closely correlated with how similar a lexicon is to the ground-truth distribution, and (2) more robust to biases that skew lexicon evaluations.
6G networks will introduce unprecedented complexity, which calls for a paradigm shift in network optimization and management. Artificial intelligence (AI)-based solutions, especially those enabled by the recently developed foundation models, have been recognized as promising candidates. Foundation models are large-scale AI models with general-purpose feature extraction capabilities, and once trained on massive amounts of data, they can be adapted to solve a wide range of downstream tasks, either in a zero-shot manner or with few-shot fine-tuning. This article provides a comprehensive overview of how foundation models are reshaping physical-layer processing and wireless resource management across three progressive paradigms. First, we examine the adaptation of off-the-shelf pre-trained foundation models to various wireless tasks. Second, we explore wireless-native foundation models, built from scratch on wireless data to bridge cross-domain modality gaps and capture universal wireless-domain physical characteristics. Third, we highlight agentic foundation models, which elevate static data processing into autonomous, reasoning-driven network orchestration. Furthermore, we discuss the impact of applying foundation models to emerging 6G frontiers, including integrated sensing and communications (ISAC), new multiple-input multiple-output (MIMO) architectures, semantic communications, and system-level network autonomy. Finally, we identify critical open challenges and opportunities, charting a promising path toward fully intelligent and adaptive wireless networks.
This paper addresses the problem of attack detection in cyber-physical systems without any knowledge of the plant model or its structure. A remotely located plant transmits sensor measurements to an operator over a network that is assumed to be under attack. We consider two classes of attacks: model-free replay attacks and model-based stealthy attacks. For the latter, we derive closed-form expressions for the optimal stealthy attack policy against a $\chi^2$ detector, for both linear and nonlinear systems. We then propose a model-structure-free detector based on TimesFM, a time-series foundation model developed by Google Research, which serves as a surrogate residual generator operating in a zero-shot fashion. We show empirically that the TimesFM-based detector achieves a comparable or superior attack detection performance. The efficacy of the proposed approach is demonstrated numerically on the IEEE 14-bus power system. We also demonstrate that TimesFM predictions can serve as a substitute for corrupted measurements, a practical mitigation technique when classical redundancy assumptions fail.
RTK augmentation andINS integration are widely used to improve GNSS positioning performance. However, on inland waterways, bridges and surrounding structures can degrade satellite visibility and correction availability, causing RTK augmentation loss, and GNSS/INS fusion transients. Since these effects depend on the local environment and sensor configuration, nominal receiver specifications are insufficient, and deployment-specific characterization is required. This paper presents a benchmarking study of an AsteRx-i3 D Pro+ GNSS/INS receiver installed within the mobile Sensor Box developed at KU Leuven. The study combines a real-world bridge-passage case study, static benchmarking, and closed-loop path-following experiments. The static benchmarking evaluates four receiver configurations: standalone GNSS, standalone GNSS with INS integration, RTK-augmented GNSS, and RTK-augmented GNSS with INS integration. The closed-loop experiments use INS-integrated GNSS as the navigation input and compare path-following operational performance with and without RTK augmentation. Results show that correction loss during bridge passage causes reduced positioning accuracy, increased positioning uncertainty and recovery-induced state jumps exceeding 1 m. Static benchmarking and closed-loop experiments confirm that RTK augmentation substantially improves positioning precision and uncertainty consistency, while INS integration supports short-term continuity during RTK unavailability but may introduce drift, bias, or transient uncertainty variations. By characterizing the deployment-specific receiver behavior with RTK augmentation and INS integration, this study motivates higher-level state estimation as a necessary next step toward spatially continuous and uncertainty-consistent positioning on inland waterway. The experimental data are released at: this https URL.
Wireless foundation models have emerged as a promising alternative to building separate models for each wireless task. However, existing approaches rely on masked input reconstruction, which can bias representations toward low-level signal details. In this paper, we propose LatentWave, a wireless foundation model pretrained using a Joint-Embedding Predictive Architecture (JEPA) on diverse wireless spectrograms and channel state information (CSI). By predicting masked regions in latent space, LatentWave learns representations that are more transferable out of the box across diverse downstream tasks. The proposed architecture employs per-channel patch embeddings with stochastic channel sampling during pretraining, allowing it to process variable antenna counts and improving usability across heterogeneous wireless configurations. We evaluate LatentWave on four downstream tasks: RF signal classification, 5G NR positioning, beam prediction, and LoS/NLoS classification, comparing against a masked-modeling baseline (WavesFM) pretrained on the same data. Additionally, we show that the masking geometry introduces a task-dependent inductive bias: frequency masking strongly favors channel-related tasks such as positioning and beam prediction, while region masking better preserves discriminability for signal classification.
This paper studies expected $\mathcal{L}_2$ string stability of event-triggered vehicle platoons in which a human driver leads a chain of cooperatively controlled autonomous followers under stochastic communication delays. The leader's driving behavior propagates through the string via vehicle-to-vehicle (V2V) communication, so human-induced disturbances must not amplify along the platoon. Unlike deterministic approaches based on worst-case delay bounds, we derive string-stability conditions depending on the full delay distribution through integral inequalities. The closed-loop platoon is modeled as a stochastic hybrid system capturing vehicle dynamics, communication events, and event-triggering. This framework certifies string stability even when delays exceed deterministic admissible bounds with nonzero probability. Results are evaluated under several delay distributions using the MATLAB HyEQ simulator.
Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self-supervised learning (SSL) has yielded strong domain-specific encoders like speech or music experts, multi-domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs. We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain-aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second-stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state-of-the-art performance across probing and LLM-based evaluations.
Existing multimodal safety benchmarks focus solely on visual inputs and cannot assess Omni Large Language Models (LLMs) that process vision, audio, and text. We introduce MCBench, a benchmark with 1196 scenarios spanning four safety categories that require integrating multiple modalities for accurate safety assessment. Each unsafe scenario is paired with a minimally different safe counterpart to assess model sensitivity. Our evaluations of state-of-the-art models reveal significant challenges. Omni LLMs struggle with subtle or non-physical risks but perform better when salient visual or acoustic cues are present. Analysis of reasoning traces shows that, although models can extract modality-specific information, they often fail to integrate these cues effectively for safety judgments. Our findings reveal that current Omni LLMs lack robust cross-modal reasoning in safety-critical settings, underscoring the need for improved architectures and training strategies for multimodal safety.
Data-driven equation discovery is fundamentally an inverse problem that seeks to infer the governing differential equations of a system directly from time-series measurements. A known issue is the ill-conditioned nature of the inverse problem, which frequently produces multiple mathematical models that fit the data similarly well. One path to address this issue is by incorporating known hypotheses and constraints into the training phase beforehand. While this approach effectively reduces the search space, it still results in multiple candidate models, forcing practitioners to rely on post-hoc manual filtering based on their own domain expertise. A recent approach incorporates structural `skeletons' inspired by characteristic curves (CCs), defining a hypothesis-driven methodology. In this methodology, practitioners define a skeleton, which is associated with a family of ordinary differential equations (ODEs), and then add their hypotheses and priors based on their domain knowledge to refine the obtained model iteratively. An important advantage of this approach is that some skeletons have demonstrable structural identifiability properties, which are useful for checking whether the skeleton is correct or should be discarded. Furthermore, this formalism enables the use of multiple equation discovery paradigms due to its modularity (such as neural networks, symbolic regression, and sparse regression). In this work, we present the Python library PyCC, which condenses these efforts into a flexible tool that allows researchers and engineers to seamlessly define their skeletons and hypotheses to discover ODEs from time-dependent data.
Selecting a clustering algorithm and its hyperparameters without labels is a common difficulty in engineering machine learning pipelines that work with unsupervised analysis of sensor, image, or process data. Clustering validation indices (CVIs) provide internal scores for ranking candidate clusterings, but most popular CVIs are built from Euclidean compactness and separation terms and so tend to favour compact, convex partitions. Their performance is known to degrade on non convex, irregular, or variable density data, where kernel transformations or alternative distance measures are typically used at the cost of additional tuning and computation. This paper introduces the Central Description Length (CDL) clustering validation index. CDL uses the observed within cluster compactness, the estimated cluster centers, and the estimated cluster covariances to compute a probabilistic upper bound on the description length associated with the unobservable true cluster centers. The bound condenses intra cluster compactness and centroid displacement into a single computable quantity and is evaluated on the partition produced by any clustering algorithm. The implementation uses only observable quantities (the data, the partition, the estimated centers, and the estimated covariances) and does not use ground truth labels. On synthetic benchmarks with non convex and arbitrary shape clusters, CDL-CVI selected the reference number of clusters more often and reached higher Adjusted Rand Index (ARI) values than the conventional CVIs we tested, without an additional kernel preprocessing stage. On image benchmarks (MNIST, CIFAR-10, STL-10) clustered from frozen unsupervised embeddings, CDL-CVI returned cluster numbers close to the reference class counts across K-means, DBSCAN, and spectral clustering in the reported trials.
This paper investigates the joint resource block group (RBG) scheduling and beamforming optimization problem for weighted sum-rate (WSR) maximization in multi-cell multiple-input multiple-output (MIMO) downlink networks. While the Fast Fractional Programming (FastFP) framework provides a reliable model-driven solution, it suffers from conservative continuous beamforming updates and prohibitive computational overhead during the discrete RBG matching phase. To address these bottlenecks, we propose a joint deep unfolding framework comprising two core modules: P-Net and K-Net. For continuous beamforming, P-Net learns an adaptive relaxation factor along the analytical FastFP update direction. By strictly constraining this factor within an ascent-preserving interval, P-Net accelerates the optimization trajectory while rigorously retaining monotonic improvement and stationary-point convergence guarantees. For discrete RBG scheduling, K-Net learns a long-horizon priority policy that guides a low-complexity greedy assignment, effectively preserving the assignment quality while bypassing the high complexity of Hungarian matching. Both networks leverage analytical algorithmic priors and utilize recurrent parameter sharing, enabling flexible inference beyond the training horizon. Extensive simulations demonstrate that the proposed joint framework achieves higher WSR and faster execution times than conventional model-driven baselines, while generalizing robustly across unseen network scales, antenna configurations, and channel conditions without retraining.
We investigate whether task-vector arithmetic, successful for cross-speaker emotional intensity control in modular text-to-speech (TTS), transfers to large-scale TTS systems built on language-model backbones with in-context learning (LM-TTS). Through a systematic elimination study over four progressively narrower operands on Qwen3-TTS-12Hz-1.7B - model weights via LoRA fine-tuning, continuous codec embeddings, discrete codec tokens, and the speaker embedding (x-vector) produced by an ECAPA-TDNN encoder jointly trained with the synthesis backbone - we localize the dominant carrier of emotional prosody to the x-vector. Building on this finding, we propose a training-free method based on centroid arithmetic in x-vector space: an emotion direction $\tau = \mathbb{E}_i[x(s_i,\text{emo})] -\mathbb{E}_i[x(s_i,\text{neutral})]$ applied to an unseen target speaker as $x_{\text{new}} = x(\text{target},\text{neutral}) + \alpha\cdot\tau$. Using ESD (English) as the $\tau$ source and emoUERJ (Brazilian Portuguese) as a cross-lingual ground-truth target, we observe average gains of $+0.29$ in emotion2vec cosine over the ICL baseline on English held-out speakers and $+0.09$ on Brazilian Portuguese held-out speakers, while largely preserving identity (WavLM SECS $\gtrsim 0.88$ for the multi-speaker $\tau$ variant) and intelligibility (WER $\approx 0$ in PT-BR). These results offer initial evidence that the reported incompatibility of centroid-arithmetic style control with token-based TTS architectures may be circumvented when the arithmetic operates on the speaker embedding.
nnAudio is an open-source audio feature extraction toolbox for deep learning, but its use in current environments is hindered by TorchScript incompatibilities, inverse-transform edge cases, and dependency drift. We present a targeted modernization for modern PyTorch and scientific Python. We resolve TorchScript compilation failures in STFT and iSTFT by removing dynamic state mutation and module construction from scripted code paths and tightening argument handling in inverse-related helpers. We clarify inverse-STFT behavior by restricting reliable inversion to the uniform-bin setting (freq_scale=`no') and raising explicit runtime errors for unsupported frequency scales, preventing silently degraded reconstructions. We restore CFP compatibility with modern SciPy and ensure VQT reduces to CQT when gamma = 0. Regression tests cover the new STFT/iSTFT behaviors, and the updated codebase passes the full repository test suite in a modern Python environment. These improvements provide a more robust foundation for differentiable audio analysis in research and deployment.
Frontier AI governance frameworks increasingly use cumulative training compute as the primary criterion for designating high-impact models, but enforcement rests on self-reporting because no technical verification primitive for training exists. Any future international agreement on frontier AI faces the same problem at higher stakes: coordinated regulation of technologies with significant externalities has historically rested on technical verification, without which agreements are declaratory. Recent governance analyses judge zero-knowledge proofs a promising candidate but currently impractical at frontier scale [26, 4]. We argue the impracticality is paradigm-bound rather than fundamental, and propose a verification architecture for frontier dense pre-training combining a pre-committed training specification, inter-node network observations, and on-the-fly Merkle commitments of intermediate computation, verified through a zero-knowledge Virtual Machine (zkVM) with native BF16/FP32 precompiles. The proof checks the actual floating-point computation the GPU performed rather than a fixed-point approximation, and preserves model-architecture confidentiality through a private training specification. The protocol produces three proof types: a genesis proof at initialisation, in-training step proofs across the run, and ex-ante attestations enforcing policy-relevant claims as running invariants, turning the training record into a governance-enforceable artefact. We estimate a deployable proof of concept within approximately 36 months at single-digit-percent training-side overhead, against a six-to-ten-year cycle for verification-grade custom silicon. Thirteen open research and engineering problems are catalogued as a research agenda for external contribution
The safety, security, and reliability of microelectronic systems depend on a trustworthy, secured supply chain and design flow. Globally distributed supply chains or unintentional design weaknesses leave the door open for attacks on the hardware level. These scenarios encompass counterfeiting, hardware trojans, or on-device attacks. For these, hardware reverse engineering (RE) results play a pivotal role. The ongoing publication of new RE-involved attacks motivated the development of the common RE scoring system (CRESS). The system enables a general classification of RE-involved scenarios for a common, consistent rating. In this work, the originally qualitative system is extended to a quantitative system. We performed an extensive interview study with experts in the field. The interview results allowed us to derive weights that measure the severity of different RE-involved attack categories. The weights form an equation that quantifies scenarios, resulting in the severity-indicating CRESS score. The score enables the coherent rating of novel scenarios, renders them comparable, and supports the development of effective countermeasures. To showcase the effectiveness of the quantitative CRESS Score, six selected case studies are rated qualitatively and quantitatively. The CRESS Score proves to be significantly more expressive than the industry-standard Common Vulnerability Scoring System (CVSS).
Data-driven Prognostics and Health Management (PHM) uses time-varying condition-monitoring data to diagnose system states and estimate remaining useful life in engineered assets. These tasks are central to maintenance planning, but industrial PHM data are often fragmented, partially observed, and poorly labeled, which hinders supervised learning. Foundation models offer a route toward reusable predictive systems, yet most time-series foundation models are designed for forecasting and assume long, coherent, regularly sampled sequences. To address this gap, we propose a framework for applying Tabular Foundation Models to industrial time series using in-context learning, and we evaluate them on a variety of PHM tasks. By converting raw unit-level signals into tabular rows, we show that these models perform well across multiple tasks - including prognostics, and diagnostics - and are highly data efficient. We compare them directly with sequence models, transformer baselines, and gradient-boosted trees under a common evaluation protocol. The results indicate that tabular foundation models achieve the best average ranks across prognostic and diagnostic tasks. Our findings further show that PFN-based models are competitive in low-data regimes, that temporal context can be preserved in the tabular representation, and that performance depends on representative context construction under subsampling. These results demonstrate that tabular foundation models provide a practical and general interface for heterogeneous PHM problems.
Recent advancements in Large Language Models (LLMs) have shown promising results in music understanding and generation tasks. However, existing works remain confined to Western tonal traditions, offering little insight into whether current LLMs can handle structurally distinct low-resource musical traditions. We present the first systematic evaluation of LLM competence in South Asian classical music, a tradition governed by raga, tala-based melodic constraints that impose fundamentally different structural principles from Western harmony-driven music. We ground our evaluation in Hindustani classical theory and Bengali classical forms, including Rabindra and Nazrul Sangeet -- representative low-resource traditions within South Asian classical music. For music understanding evaluation, we introduce a 504-question-answer benchmark spanning raga grammar, cultural knowledge, and symbolic notation reasoning, evaluating 33 LLMs where frontier models such as Gemini 2.5 Pro achieve 85-90% accuracy, while most open-source models remain in the 23-40% range. For music generation, we design a five-level controlled prompting framework and find that even the strongest model produces stylistically faithful outputs only 40% of the time. These results reveal that structural validity and stylistic faithfulness in music generation are distinct objectives and highlight an open challenge for culturally grounded music modeling.
Pretrained spatial audio encoders are increasingly used as general-purpose representations for perceptual tasks, yet their spatial encoding capabilities remain poorly understood. We introduce the Spatial Audio Representation Learning (SARL) benchmark, a controlled framework for evaluating spatial information in pretrained audio models. SARL probes source-level factors (azimuth, elevation, distance, class) and room-level factors (RT60, volume, shape). Experiments across diverse encoders reveal three patterns: input configuration and training paradigm shape spatial encoding; source factors are consistently easier to decode than room factors; and sensitivity analysis under controlled perturbations shows heterogeneous responses to source and room variation. These results reveal systematic biases in current pretrained audio representations. SARL is released as an open-source benchmark for reproducible evaluation of spatial audio representations.
Mispronunciation Detection and Diagnosis (MDD) has gained increasing importance in computer-assisted language learning and speech technology in recent years. In this paper, we propose a method for constructing statistical graphs that enable models to learn phoneme confusion patterns represented as directed graphs. Furthermore, we introduce a language-specific strategy to capture systematic pronunciation differences across various native language (L1) backgrounds. The effectiveness of our approach is demonstrated through extensive experiments on the L2-ARCTIC benchmark, where it achieves an F1-score of 59.52%, outperforming several competitive baselines.
Sound effects (SFX) datasets and libraries often employ distinct tagging schemes, taxonomies, and metadata structures. This creates challenges for research on SFX classification and generation because incompatible taxonomies lead to siloed datasets that might require individualized approaches, result in non-comparable outcomes, and prevent data merging strategies. We propose a modular dataset relabeling framework that adopts the Universal Category System (UCS), an industry-standard hierarchical taxonomy for sound effects, as a shared structural foundation. This open-source framework enables us (i) to convert tags of existing datasets to UCS with a rule-based multi-stage pipeline and conflict resolution to achieve high automatic conversion rates, (ii) to suggest a stratified dataset split for the new labels, and (iii) to combine multiple datasets. To showcase the practical utility, we introduce the EnvSound-UCS dataset, a publicly available unified UCS-compliant dataset of environmental sounds with 58,057 sound clips from three sources: AudioSet, FSD50K, and ESC-50.
Generative models have shown impressive results in speech enhancement but often suffer from multi-step inference. We propose SB-RF, a one-step generative framework integrating Rectified Flow (RF) with Schrödinger Bridge (SB) theory. SB-RF constructs a conditional bridge between clean and noisy speech distributions via entropy-regularized optimal transport. By aligning SB trajectories with the optimal transport geodesic through the velocity-matching objective of RF, SB-RF enables high-quality enhancement with one-step generation. Experiments demonstrate that SB-RF achieves leading performance among generative methods on the VoiceBank-DEMAND benchmark. Furthermore, to fully assess performance in challenging real-world scenarios, we evaluate SB-RF on a simulated low signal-to-noise ratio test set using an expanded training dataset. Under these conditions, SB-RF exhibits strong and competitive robustness with high efficiency, validating its potential for real-world applications.
Robotic Cellular Warehousing Systems (RCWS) give rise to multi-agent pickup and delivery (MAPD) processes in which robots sequentially collect multiple stock-keeping units (SKUs) for each order. Unlike classical MAPD formulations that assume static tasks, real warehouse operations often involve dynamic order evolution, where new SKUs may be appended to an order while it is being executed. Motivated by this practical requirement, this letter formulates the Dynamic Multi-Agent Pickup and Delivery problem considering internal order evolution for the first time. Building on the token passing paradigm, we propose two event-triggered online replanning algorithms. The first, Dynamic Token Passing, performs localized replanning upon order updates through add-order decomposition and priority-based token scheduling while preserving collision-free execution. The second, Cooperative Token Passing, further enables idle robots to opportunistically assist newly added pickups, improving system-level efficiency. Simulation results in RCWS environments demonstrate that the proposed methods significantly reduce order flowtime compared with static and non-cooperative baselines.
In humanoid motion control, model predictive control (MPC) offers physically grounded prediction and constraint handling, while reinforcement learning (RL) enables robust whole-body skills through large-scale simulation. However, using MPC inside RL often requires time-consuming problem construction or excessive training overhead, making such frameworks difficult to justify in practice. This work studies efficient training-time MPC guidance for humanoid locomotion and manipulation, termed MPC-RL. We introduce a centroidal-dynamics MPC reward formulation that leverages guidance from MPC trajectories in training time. To make this practical in massively parallel RL, we develop $\pi^n$MPC, a parallel-in-horizon and construction-free batched GPU MPC solver that operates directly on time-varying dynamics to avoid high memory usage and pre-compilation. Through a variety of comparative studies and hardware validations, we have found that MPC-RL achieves superior performance in locomotion and manipulation skills. The code base is available at this https URL.
Multimodal sentiment analysis (MSA) infers human affect from language, acoustic, and visual signals. Recent methods increasingly adapt large multimodal models (LMMs) via generative readout: prompting the model to emit a sentiment score as a text string. While convenient, this ties continuous regression to discrete autoregressive decoding, incurring unmeasured costs. We revisit this readout mechanism and propose a discriminative formulation built on the Thinker module of a native omni-modal LLM (Qwen2.5-Omni-7B). Instead of text decoding, we map the final-layer hidden state of the last non-padding token to a continuous score via a lightweight regression head in a single forward pass. Using 4-bit quantization and low-rank adaptation (QLoRA), the entire 7B pipeline -- including video and audio processing -- trains on a single consumer GPU (RTX 5090, 32 GB) with 10-21 GB peak memory and 1.14% trainable parameters. Through a controlled comparison fixing the backbone, data, and LoRA configuration, we isolate the impact of the readout. On CMU-MOSI and CMU-MOSEI, our discriminative readout reaches state-of-the-art accuracy without task-specific feature engineering (MOSI: MAE 0.551, Corr 0.888; MOSEI: MAE 0.506, Corr 0.790) and exhibits strong multi-seed stability. In contrast, the generative readout -- even after equivalent supervised training -- more than doubles the mean absolute error, yields unparsable or out-of-range outputs (2.8% zero-shot), and suffers from higher latency. Modality ablations reveal a text-dominant regime on CMU-MOSI. Our findings indicate that how an LMM is read out is as consequential as how it is trained, demonstrating that a discriminative readout offers a more accurate, efficient, and reliable alternative for continuous MSA.
This study presents a comparative analysis between the speaker embeddings of speech foundation models and human subjective perception of speaker similarity. Human listeners have the ability to judge speaker similarity on a continuous scale discerning how similar two voices are. In contrast, speech foundation models embed speaker characteristics into numerical representation. However, a question remains: does the numerical distance between speaker embeddings in these models truly align with the similarity perceived by humans? To address this, we conduct a comprehensive investigation using more than 40 models to compare model-derived distances with human-perceived similarity scores. Furthermore, we identify which factors in model configuration contribute most to a speaker embedding that mirrors human perception. Our findings provide insights for the development of more perceptually grounded speech foundation models.
Phase-sensitive optical time-domain reflectometry ($\phi$-OTDR) is widely used in large-scale distributed acoustic sensing (DAS) because it provides distributed spatiotemporal monitoring over long sensing distances. Its field performance can still deteriorate because of polarization-induced fading (PIF), local signal degradation, and strong environmental interference. This study develops a Sagnac-assisted enhanced $\phi$-OTDR sensing architecture and a standardized benchmark framework for engineering-oriented DAS event recognition. The Sagnac interferometer provides a continuous phase response that supplements fading-prone observations in the $\phi$-OTDR channel, and heterogeneous signal alignment is achieved using a cross-correlation procedure implemented on an FPGA platform. The benchmark protocol compares conventional feature-engineering methods, probabilistic shallow classifiers, single-branch deep models, and dual-branch fusion models under consistent data partitioning, preprocessing, and metric definitions. Experiments on a 10-km sensing fiber with six representative acoustic event classes show that the dual-branch fusion model provides the most favorable trade-off among the evaluated methods, reaching 89.79\% accuracy, 89.83\% macro-F1, and a nuisance alarm rate of 5.00\% on the balanced test set. The results also show that channel grouping strongly affects dual-branch evaluation, indicating that deployment-oriented conclusions should be based on accuracy, macro-F1, nuisance alarm rate, false negative rate, and latency rather than accuracy alone. This work provides a physically motivated enhancement strategy for $\phi$-OTDR-based DAS and a reproducible benchmark protocol for future fusion-oriented sensing research. The implementation and scripts for reproducing the DAS event-recognition experiments are publicly available at this https URL.
Text-to-audio retrieval has made significant progress with shared embedding models such as CLAP and Pengi, yet they often struggle with fine-grained semantic alignment due to the inherent modality gap between text and audio. In this work, we propose FORTE, a unified framework that integrates structured logical reasoning with parameter-efficient cross-modal alignment to improve retrieval precision. Our approach first transforms queries into first-order logic and refines them via a constrained search that preserves semantic invariance while introducing discriminative attributes. The refined representation is then aligned with audio embeddings using a lightweight projection module, followed by a predicate-aware re-ranking step that enforces logical consistency at inference. Extensive experiments on AudioCaps and Clotho demonstrate consistent improvements over strong baselines, particularly in challenging fine-grained scenarios. Our results highlight the effectiveness of combining symbolic reasoning with representation learning for cross-modal retrieval.
Leadership in social groups emerges dynamically through interaction and opinion exchange. Empirical evidence indicates that individuals expressing strong opinions tend to gain influence, while sustained leadership critically depends on maintaining alignment with the surrounding social context. Motivated by these observations, we introduce a coupled dynamical model describing the simultaneous evolution of opinions and leadership in a networked population. Extending the Friedkin-Johnsen framework, we represent leadership as a time-varying susceptibility to social influence, which evolves according to a game-theoretic mechanism, consistent with social psychology evidence. Within this setting, agents strengthen their leadership by expressing decisive yet socially coherent opinions, whereas misalignment with the collective state results in a loss of influence. We analyze the coupled dynamics and establish sufficient conditions to identify which agents necessarily emerge as leaders and which act as followers in the social network.
Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.
Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).
We propose GLASS, a framework for composable acoustic style control in zero-shot autoregressive text-to-speech (TTS) that learns controls from post-generation rewards rather than style labels. In zero-shot TTS, a speaker prompt often entangles speaker identity with prosodic attributes such as speaking rate and pitch, making it difficult to change style without changing the prompt itself. GLASS instead treats each acoustic attribute as a reward-defined control direction. For each control axis, GLASS freezes the TTS backbone and trains one lightweight LoRA adapter with Group Relative Policy Optimization (GRPO), using speech-token length and mean F0 as style rewards and WER as an intelligibility anchor. Because each control is represented as a LoRA weight update, independently trained adapters can be swapped, interpolated, and composed through linear LoRA arithmetic without retraining the backbone. Experiments on speaking rate and pitch control show targeted style shifts while preserving naturalness, speaker similarity, and intelligibility, and demonstrate smooth interpolation and multi-axis composition across independently trained adapters.
Ambient clinical scribes increasingly combine Automatic Speech Recognition with Large Language Models to automate documentation. However, traditional metrics like Word Error Rate mask systemic safety degradation. We present a paired acoustic stress test to isolate the causal impact of noise on clinical reasoning. For the same dialogues, we inject diverse noise types while keeping the downstream model configuration frozen. Crucially, we uncover a dangerous disconnect between signal fidelity and clinical safety. Stationary ambient noise increased the Word Error Rate by a negligible 0.71 percentage points yet nearly doubled the rate of unsafe outputs. Our analysis reveals that minor acoustic perturbations can invert clinical meaning without substantially inflating error rates. Furthermore, we demonstrate a lightweight mitigation strategy that mitigates safety degradation under noisy conditions without requiring model fine tuning.
Although artificial neural network (ANN) based speech enhancement (SE) methods demonstrate excellent performance, the high computational complexity and high energy consumption hinder their deployment in practical front-end processing tasks.} Currently, the spiking neural networks (SNNs) have shown potential in reducing power consumption. However, the discrete binary activation and complex spatio-temporal dynamics of SNNs often result in information loss. The current challenge therefore focuses on how to maintain performance and reduce computational complexity. To address this issue, this work propose a Dual-Branch Hybrid Neural (DBHN) Network. 1) In terms of network architecture: A dual-branch network integrating ANN and SNN was designed, where the SNN branch reduces power consumption while the ANN branch addresses information loss; The BandSplit and Time-Frequency (TF) -Mamba modules were developed to simultaneously compress energy consumption and enhance model performance; Spiking Feature Extraction Group (SFEG) and Information Transformation Block (ITB) components were implemented with residual connections to mitigate information loss while further refining feature representations. 2) To facilitate inter-branch information fusion: An Interaction module was designed to promote information exchange at various stages of the dual-branch network; A TF-Cross Attention-Fusion module was designed to perform time-frequency domain fusion of dual-branch information while data-adaptively guiding the SNN branch to retain more critical information. Results show that the proposed model maintains superior performance across three public datasets while achieving an average 7.5 fold reduction in computational complexity compared to baseline models.
When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).
The double-directional (DD) wireless channel model is important for realistic system design since it provides complete propagation information. While stochastic and deterministic channel models are widely adopted, and existing machine learning (ML) solutions mostly aim to align future channel realizations, these solutions are often limited to short time spans that may not be statistically significant. Moreover, because the number of multi-path components (MPCs) varies with spatial and temporal variation of the receiver (RX) and/or interacting objects (IOs), typical ML solutions that require fixed, predefined input and output shapes fall short. To curb these limitations, we propose a statistics-aided ML solution that relies on a fixed subset of MPCs selection. More specifically, we first select top-$M$ MPCs, where $M\in\mathbb{Z}^+$ is much smaller than the total number of MPCs, and construct learnable graphs to train our proposed hybrid TimesNet-TimeFilter (TNTF) model. We then use a channel statistics-aided training method to generate future top-M DD channel realizations such that the statistics calculated from these realizations matches closely with those of the actual statistics from the complete time-varying DD channel realizations. We validate the proposed solution using extensive simulations on both synthetic stochastic channel model (SCM)-based and deterministic ray-tracing-based datasets, and demonstrate its effectiveness relative to state-of-the-art baselines.
Medical knowledge graphs (MKGs) infused with clinical knowledge have been increasingly used to model electronic health records (EHRs) to support interpretable predictions in healthcare domain. However, existing MKG-based approaches are limited in capturing pairwise relations between clinical concepts (e.g., conditions, procedures, and medications), and restricts their ability to model higher-order interactions among co-occurring or semantically related concepts. In addition, most representation learning methods that leverage MKGs either collapse temporal information across visits or lack an explicit mechanism for modeling long-range temporal dependencies, which is critical for clinical tasks such as mortality prediction. To mitigate these limitations, we propose HoT-SSM, a parameter efficient and higher-order temporal graph reasoning with state space models. For each visit, HoT-SSM constructs hypergraphs by grouping semantically related clinical concepts into hyperedges using domain knowledge, thereby preserving visit-level clinical context. Further, to model the temporal dynamics while learning the representations, we introduce a novel dynamic hypergraph-based state space model that explicitly captures patients latent state evolution over time while preserving long-range information. The learned representations are used for downstream clinical prediction and reasoning. Experiments on MIMIC-III and MIMIC-IV datasets shows significant performance improvement over the current state-of-the-art models, demonstrating the effectiveness of jointly modeling higher-order clinical interactions and long-range temporal dependencies.
Large audio language models (LALMs) are increasingly deployed in real-world applications, yet their safety alignment is still primarily evaluated on monolingual, text-based harmful prompts. This leaves their generalizability under multilingual and spoken settings, particularly code-switched speech, largely underexplored. To address this gap, we introduce SpeechJBB, an audio jailbreak dataset for benchmarking across multiple state-of-the-art LALMs. The extent of safety weaknesses is further probed by introducing an augmented setting where phonologically plausible pseudo-words are inserted around safety-critical terms to simulate localized obfuscation. Across models, code-switched harmful audio yields substantially high jailbreak success rates (JSR), with non-English monolingual and non-English code-switched pairs exhibiting the highest attack success. Pseudo-word insertion further reduces refusal rates, which demonstrates that natural-sounding obfuscation can effectively bypass safety policies.
Soft, growing vine robots extend through tip eversion, a mechanism that enables navigation through cluttered environments. However, integrating cameras and other sensors at the tip is uniquely challenging because the material forming the tip is constantly renewed as the robot grows. This continual material turnover, combined with friction between internal layers, added tip weight, and fabric constriction, complicates sensor and tool mounting. These limitations hinder the deployment of vine robots for inspection and search tasks, where rapid growth while carrying tip-mounted sensors is essential. In this work, we present a triangular roller tip mount that reduces internal resistance during growth by rolling rather than sliding against the robot body. The design was refined through iterative failure analysis, enabling, for the first time, consistent eversion on a TPU-coated ripstop nylon vine robot. To quantitatively evaluate mount performance, we introduce a custom testbed that isolates tip mounting effects by measuring tail tension during eversion. Comparative experiments across multiple mount variants, including prior designs, show that our triangular roller mount achieves the lowest tail tension and most repeatable growth performance. These results establish both a validated tip mount design and a repeatable benchmarking framework for advancing sensor and tool integration in soft growing robots. CAD for the mount and testbed is available at: this https URL.
Second-language (L2) speech recognition often requires transcriptions of pronunciations and intended meanings. Multi-task learning (MTL) is a natural approach because it assumes that shared representations benefit both outputs. However, this paper shows that this assumption does not hold across Korean and English. MTL improves meaning but degrades surface transcription, especially in English, where the degradation scales with surface-meaning divergence measured by Levenshtein edit this http URL analysis links these patterns to encoder-level entanglement, with Korean preserving distinct task representations while English produces nearly identical ones. Cross-task decoder analysis shows that the meaning dual-output decoder adapts with a unique representation, while the surface dual-output decoder remains constrained by the encoder. These findings motivate the design of MTL frameworks that mitigate encoder-level entanglement to reduce surface degradation in dual-output L2 automatic speech recognition.
Zero-shot cross-lingual speech emotion recognition (SER) remains challenging due to distribution mismatches across languages and the lack of emotion annotations in target language. Under such conditions, models trained solely on source-language data frequently suffer from degraded generalization when evaluated on unseen target languages. To address this limitation, we propose an emotion-discriminative representation learning method that integrates supervised contrastive learning and speaker adversarial learning. The contrastive learning promotes cross-lingual emotion alignment, while speaker adversarial learning suppresses speaker-related cues to encourage speaker-invariant representations. Experimental results under a zero-shot cross-lingual SER setting demonstrate that the proposed method significantly improves SER performance over conventional training strategies.
Automatic speech recognition (ASR) has advanced remarkably for standard speech; however, pathological speech from neurological conditions remains a significant challenge. We investigate speaker conditioning via Feature-wise Linear Modulation (FiLM), injecting x-vector-derived information into each transformer layer of a frozen ASR encoder to adapt internal representations to individual pathological speakers without modifying base model weights. We benchmark this for the ASR task against standard and parameter-efficient fine-tuning baselines, complemented by post-processing, on Spanish and English pathological speech. Additionally, we evaluate if the adapted model preserves the ability to answer speech-related questions. Results show that speaker-conditioned ASR is competitive with established adaptation strategies while retaining performance on non-conditioned speech.
Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-regularized autoencoder bottleneck and a latent-side representation encoder. The bottleneck uses channel normalization and stochastic perturbation instead of KL-based variational training, yielding scale-controlled continuous latents for reconstruction and autoregressive generation. The representation encoder is trained on frozen autoencoder latents with RQ-MTP and frozen-LLM supervision. The resulting tokenizer provides high-dimensional representations for understanding while preserving normalized continuous latents as generation targets
Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.
Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call test-time feedback (TTF): the mismatch between the training/validation loss and downstream metrics of interest, such as task success rate and generation quality, which grows with task length. While data curation, architecture, and objective design have been proposed to combat train-test shift in TTF settings, this paper proposes optimization as a new design axis to mitigate error accumulation. Specifically, we introduce a new optimization paradigm called double-preconditioning (DoPr) uniquely tailored to the challenges of TTF. DoPr combines gradient-wise preconditioning, as in Adam and Muon, with activation-wise preconditioning (AP), such as in KFAC. We show that the addition of AP yields a drop-in intervention for increasing downstream model performance across a range of TTF settings. Interestingly, these gains in test-time performance do not consistently accompany improvements in validation loss, opening new questions about how to properly evaluate models trained with one-step supervised objectives.
In this article, we introduce a novel low-altitude wireless network (LAWN), which is a reconfigurable, three-dimensional (3D) layered architecture. In particular, the LAWN integrates connectivity, sensing, control, and computing across aerial and terrestrial nodes that enable seamless operation in complex, dynamic, and mission-critical environments. Different from the conventional aerial communication systems, LAWN's distinctive feature is its tight integration of functional planes in which multiple functionalities continually reshape themselves to operate safely and efficiently in the low-altitude sky. With the LAWN, we discuss several enabling technologies, such as integrated sensing and communication (ISAC), semantic communication, and fully-actuated control systems. Finally, we identify potential applications and key cross-layer challenges. This article offers a comprehensive roadmap for future research and development in the low-altitude airspace.
Low-altitude wireless networks (LAWNs) have been envisioned as flexible and transformative platforms for enabling delay-sensitive control applications in Internet of Things (IoT) systems. In this work, we investigate the real-time wireless control over LAWNs, where an aerial drone is employed to serve multiple mobile automated guided vehicles (AGVs) via finite blocklength (FBL) transmission. Toward this end, we adopt the model predictive control (MPC) to ensure accurate trajectory tracking, while we analyze the communication reliability using the outage probability. Subsequently, we formulate an optimization problem to jointly determine control policy, transmit power allocation, and drone trajectory by accounting for the maximum travel distance and control input constraints. To address the resultant non-convex optimization problem, we first derive the closed-form expression of the outage probability under FBL transmission. Based on this, we reformulate the original problem as a quadratic programming (QP) problem, followed by developing an alternating optimization (AO) framework. Specifically, we employ the projected gradient descent (PGD) method and the successive convex approximation (SCA) technique to achieve computationally efficient sub-optimal solutions. Furthermore, we thoroughly analyze the convergence and computational complexity of the proposed algorithm. Extensive simulations and AirSim-based experiments are conducted to validate the superiority of our proposed approach compared to the baseline schemes in terms of control performance.
The design of many classical optimization algorithms is driven by the certification of linear convergence rates over classes of optimization problems. In this paper, we consider the problem of improving the average-case performance of an algorithm over a specific distribution of problem instances. While this task can be tackled by embedding trainable components into the algorithm updates, a key challenge is to preserve worst-case guarantees across the entire problem class. For classes of composite optimization problems, we show that all linearly convergent algorithms can be parametrized in terms of a baseline linearly convergent algorithm, and a set of trainable, exponentially-decaying modifications to its update rule; crucially, this parametrization excludes all-and only-the algorithms that do not converge linearly. Our results apply to improving the average-case performance of classical algorithms such as gradient descent for nonconvex, gradient-dominated functions; Nesterov's accelerated method for smooth, strongly convex functions; and projected gradient methods for optimization over polyhedral feasible sets. We illustrate how our characterization can be used for learning to optimize with linear convergence and feasibility guarantees. Numerical results showcase benefits over classical optimizers when solving ill-conditioned systems of linear equations and running a model predictive control scheme on a linear dynamical system.
Most existing robust control barrier functions (CBFs) can only handle matched disturbances, restricting their applications in real-world scenarios. While some recent advances extend robust CBFs to unmatched disturbances, they heavily rely on differentiability property of disturbances, and fail to accommodate non-differentiable case for safety constraints with high relative this http URL address these limitations, this paper proposes a class of disturbance rejection CBFs (DRCBFs), including knowledge-based DRCBFs (kDRCBFs) and reciprocal-compensated DRCBFs (rDRCBFs).These two DRCBFs can strictly guarantee safety under general bounded disturbances, which includes both matched or unmatched, differentiable or non-differentiable disturbances as special cases. Moreover, no information of disturbance is needed in rDRCBFs. Simulation results illustrate that the proposed DRCBFs outperform existing robust CBFs.
Hadamard matrix-based aperture encoding is a method for producing synthetic aperture datasets with high Signal-to-Noise Ratios. Recently, the pulse inversion capabilities of bias-sensitive Top-Orthogonal to Bottom Electrode (TOBE) arrays have driven the development of multiple Hadamard-based sequences. These sequences produce high-quality static images but are sensitive to motion. This work introduces Recursive Aperture Decoded Imaging (READI) and Estimated Motion-Compensated Compounding (EMC2), which look to reduce this sensitivity. READI is a novel decoding and beamforming technique for Hadamard aperture-encoded sequences that produces multiple low-resolution images from subsets of the full sequence. These READI images are less affected by motion and sum to form the complete high-resolution image. EMC2 describes the process of comparing these low-resolution images to estimate the underlying motion, then warping them to align before compounding. This produces a high-resolution image that is resiliant to motion. READI with EMC2 applied to the TOBE-based Fast Orthogonal Row-Column Electronic Scanning (FORCES) sequence. It is shown to fully restore images corrupted by probe motion and to recover tissue speckle and boundaries in images of a beating heart phantom. READI low-resolution images by themselves are demonstrated to be a marked improvement over a sparse Hadamard scheme with the same transmit count, and are able to recover blood speckle at a flow rate of 42 cm/s.
This paper addresses the problem of designing recommendation systems for social networks and e-commerce platforms from a control-theoretic perspective. We treat the design of recommendation systems as a state-feedback infinite-horizon optimal control problem with a performance index that (i) rewards alignment and engagement, (ii) penalizes polarization and large deviations from an uncontrolled baseline, and (iii) regularizes exposure across neighboring users. The recommendation entries are fed to the platform users, who are assumed to follow a networked, multi-topic, continuous-time opinion dynamics. We show that the designed control yields a stabilizing recommendation system under simple algebraic spectral conditions on the weights that encode the platform's preference for engagement, stability of preferences, polarization, and cross-user diversity. Conversely, we show that when ill-posed weights are selected in the optimal control problem (namely, when engagement is excessively rewarded), the closed-loop system can exhibit destabilizing, pathological behaviors that conflict with the design objectives.
Anisotropic image analysis is ubiquitous in medical and scientific imaging, and while the literature on the subject is extensive, the robustness to numerical rotations of numerous methods remains to be studied. Indeed, the principal directions and angular profile of a rotated image are often expected to rotate accordingly. In this work, we propose a new spectral method for the anisotropic analysis of images (EquivAnIA) using two established directional filters, namely cake wavelets, and ridge filters. We show that it is robust to numerical rotations throughout extensive experiments on synthetic and real-world images containing geometric structures or textures, and we also apply it successfully for a task of angular image registration. The code is available at this https URL
During conversational interactions, humans subconsciously engage in concurrent thinking while listening to a speaker. Although this internal cognitive processing may not always manifest as explicit linguistic structures, it is instrumental in formulating high-quality responses. Inspired by this cognitive phenomenon, we propose a novel Full-duplex LAtent and Internal Reasoning method named FLAIR that conducts latent thinking simultaneously with speech perception. Unlike conventional "thinking" mechanisms in NLP, which require post-hoc generation, our approach aligns seamlessly with spoken dialogue systems: during the user's speaking phase, it recursively feeds the latent embedding output from the previous step into the next step, enabling continuous reasoning that strictly adheres to causality without introducing additional latency. To enable this latent reasoning, we design an Evidence Lower Bound-based objective that supports efficient supervised finetuning via teacher forcing, circumventing the need for explicit reasoning annotations. Experiments demonstrate the effectiveness of this think-while-listening design, which achieves competitive results on a range of speech benchmarks. Furthermore, FLAIR robustly handles conversational dynamics and attains competitive performance on full-duplex interaction metrics.
This paper presents a novel density control framework for multi-robot systems with spatial safety and energy sustainability guarantees. Stochastic robot motion is encoded through the Fokker-Planck Partial Differential Equation (PDE) at the density level. Control Lyapunov and control barrier functions are integrated with PDEs to enforce target density tracking, obstacle region avoidance, and energy sufficiency over multiple charging cycles. The resulting quadratic program enables fast in-the-loop implementation that adjusts commands in real-time. Multi-robot experiment and extensive simulations were conducted to demonstrate the effectiveness of the controller under localization and motion uncertainties.
Challenging indoor and urban environments with severe multipath propagation and obstructed line-of-sight degrade classical radio positioning. Multipath-based simultaneous localization and mapping (MP-SLAM) addresses this by building and exploiting propagation maps for robust localization. Emerging distributed multiple-input multiple-output (D-MIMO)/extremely large-scale MIMO (XL-MIMO) infrastructures provide large spatial apertures and high-resolution sensing, especially when phase coherence is maintained across base stations, subarrays, or distributed arrays. We propose a scalable Bayesian direct MP-SLAM method for coherent data fusion in D-MIMO/XL-MIMO systems that jointly infers the environment while performing robust, high-accuracy localization directly from raw radio signals. While commonly used zero-mean Type-II likelihood functions inherently lead to noncoherent processing across distributed arrays and thus to aperture loss, the proposed phase-preserving nonzero-mean Type-II likelihood shares a complex mean across distributed arrays. This enables coherent fusion and preserves the distributed aperture gain, while the variance captures noncoherent signal power. The method is combined with a surface model that enables map-feature fusion across the distributed infrastructure and supports near-field propagation and visibility effects. Bayesian inference is performed using belief propagation by means of the sum-product algorithm on a factor graph with particle-based messages. Parallelizing over particles and arrays, the GPU-accelerated implementation achieves millisecond-level runtimes even in large or distributed infrastructures. Simulation results show that the proposed method achieves performance gains over existing noncoherent methods and approaches the corresponding posterior CRLB, highlighting the potential of coherent processing for high-resolution sensing and localization.
We present principles of algebraic diversity (AD), a group-theoretic approach to signal processing exploiting signal symmetry to extract more information per observation, complementing classical methods that use temporal and spatial diversity. The transformations under which a signal's statistics are invariant form a matched group; this group determines the natural transform for analysis, and averaging an estimator over the group action reduces variance without requiring additional snapshots. The viewpoint is broadened in five directions beyond the single-observation measurement of a companion paper. Rank promotion admits AD on scalar data streams and identifies the law of large numbers as the trivial-group case of a $(G, L)$ continuum combining sample-count with group-orbit averaging. An eigentensor hierarchy handles signals with nested symmetry. A blind group-matching methodology identifies the matched group from data via a polynomial-time generalized eigenvalue problem on the unitary Lie algebra, placing the DFT, DCT, and Karhunen--Loève transforms as distinguished points on a transform manifold. A cost-symmetry matching principle then extends AD from measurement to blind and adaptive signal processing generally; blind equalization is given as a detailed example, with the Constant Modulus Algorithm's residual phase ambiguity predicted analytically and matched within two degrees on 3GPP TDL multipath channels, and other blind problems in signal processing are mapped into the framework. Four theorems formalize a structural capacity $\kappa$, the Rényi-2 analog of Shannon and von Neumann's Rényi-1 entropies, quantifying how a signal's information is organized rather than how much information it contains. AD relationship to prior algebraic approaches including invariant estimation, minimax robust estimation, algebraic signal processing, and compressed sensing.
Purpose: Access to electroencephalography (EEG) remains limited across low- and middle-income countries (LMICs) due to cost, infrastructure requirements, and a shortage of trained staff. This study evaluated the feasibility and clinical utility of a smartphone-based EEG system in a real-world setting. Methods: We conducted a multicenter observational study (November 2023 to April 2026) across 29 clinical sites in Kenya. A smartphone-based 27-lead EEG system enabled trained healthcare workers to acquire standardized recordings with remote expert interpretation. Results: 3,036 EEG sessions were performed. Male patients constituted 57.8% of the cohort, with representation across pediatric and adult populations. The most common referral indication was seizures or convulsions (68.5%). Overall, 2,915 (96%) recordings were interpretable, while 121 (4%) were uninterpretable, primarily due to high electrode impedance and insufficient recording duration. Uninterpretable recordings were significantly shorter than interpretable recordings (mean 18.5 vs. 33.8 minutes; median 15.1 vs. 31.6 minutes; p < 0.0001). Mean turnaround time for interpretation was 107 minutes. Among interpretable recordings, 917 (30.2%) were abnormal, including 701 (76.4%) with epileptiform abnormalities, 215 (23.4%) with non-epileptiform findings, and 1 (0.1%) indeterminate finding. Epileptiform abnormalities were highest in children aged 4-9 years (33.1%) and less frequent in adults (14-21%). Non-epileptiform abnormalities were more common in patients aged 60+ years (19.2%) compared to younger age groups (3-9%). Conclusion: Large-scale, point-of-care EEG acquisition by non-specialist operators in a resource-limited setting is feasible. Expansion of smartphone-based EEG systems may improve equitable access to neurological diagnosis and care in LMICs.
This study proposes a novel radar-centric signaling design and architecture for secure integrated sensing and communication (ISAC) systems. The proposed framework is designed to provide robust physical layer security for data transmission while simultaneously enhancing sensing privacy. It employs index modulation and phase coding over frequency-modulated continuous-wave radar (FMCW) chirps, where index modulation (IM) provides an outer layer of data security, and we explicitly design the phase coding (PC) to perturb the resulting signal's ambiguity function (AF) to enhance sensing privacy. This design reduces the risk of unauthorized surveillance by rendering target velocity estimation practically infeasible for unauthorized passive sensing hardware (i.e., a sensing eavesdropper, S-Eve) and significantly impairing its range estimation capabilities. Furthermore, this study also presents the transmitter and receiver architectures required for effective modulation and demodulation of the proposed ISAC signaling and for performing sensing at the legitimate sensing hardware. Simulation results show that the proposed approach achieves high data throughput while enhancing communication security and sensing privacy.
This study proposes a radar-centric integrated sensing and communication (ISAC) system utilizing a two-layer modulation scheme for vehicular networks. Frequency-modulated continuous wave (FMCW) chirps are jointly modulated via phase modulation (PM) and index modulation (IM) to transmit data while maintaining sensing as the primary function. To support this, a novel radar signal processing technique is developed to mitigate the impacts of IM and PM on sensing accuracy, alongside a communication receiver architecture designed to successfully demodulate IM and PM data within FMCW chirps. System performance is evaluated through simulations in the 2.4 GHz and 24 GHz bands under Doppler effects, achieving communication throughputs of 25 Mbps and 50 Mbps, respectively. Furthermore, a proof-of-concept hardware implementation is realized, and experimental measurements via a loopback cable are performed to verify the feasibility of the architecture. Finally, it evaluates the fundamental trade-off between communication throughput, sensing accuracy, and out-of-band emission, demonstrating the system's flexibility to dynamically adjust waveform parameters to meet varying operational requirements.
The Open Radio Access Network (O-RAN) architecture allows AI to be embedded directly into the RAN through modular xApps and rApps, yet creating these applications collecting data, training models, writing code, and deploying them safely remains slow and largely manual. Large Language Models (LLMs) offer strong reasoning and code-generation capabilities but are unsuited for the fast, deterministic inference required in real-time RAN control. We present a proof-of-concept Dual-Brain architecture that combines both strengths: an LLM-based orchestrator translates operator intents into data-collection policies and deployment code, while an automated ML engine, NeuralSmith, trains lightweight classifiers on demand via an API. We describe the architecture and provisioning workflow, share practical insights from a containerized O-RAN 5G~SA testbed, and discuss open research directions.
Low-density EEG is more suitable for wearable and IoT-based brain sensing, but sparse electrode sampling often lacks sufficient spatial information to characterize cross-regional neural activity. EEG spatial super-resolution aims to recover dense-channel EEG from sparse recordings, yet remains challenging because channel missingness typically occurs at the whole-channel level, spatiotemporal dependencies over the full electrode layout are often underexplored, and the mapping from sparse to dense signals is inherently ambiguous. To address these issues, we propose TGSD, a topology-guided state-space diffusion framework for EEG spatial super-resolution. TGSD first employs a Hierarchical Spatial Prior Encoder to learn topology-aware priors over the complete electrode layout by integrating local geometric relationships with region-level contextual information. Based on these priors and sparse observations, a Conditional State-Space Diffusion Reconstructor progressively generates missing-channel signals through reverse diffusion, while alternating temporal and channel-wise state-space modeling captures long-range temporal dynamics and inter-channel dependencies in a unified framework. Experiments on the SEED and PhysioNet MM/I datasets show that TGSD consistently outperforms representative baselines under different super-resolution factors in both reconstruction fidelity and downstream classification performance. These results demonstrate the effectiveness of combining topology-aware spatial priors with conditional diffusion for enhancing practical low-density EEG sensing in wearable and IoT scenarios. The official implementation code is available at this https URL.
This contribution presents an experimental integrated real-time 8 x 8 distributed MIMO (D-MIMO) testbed for wideband backscatter communication (BSC) and wireless power transfer (WPT). The testbed operates in the 2.45 GHz band with coherent sampling at 200 MS/s, employs a backscatter link frequency of 40 kHz, and uses wideband 5G NR reference signals for excitation. We evaluate the testbed by exploiting the estimated channel state information (CSI) in two target applications: wireless power transfer towards the backscatter device (BD) and real-time positioning of a BD in an indoor environment. In conjunction with the baseband processing chain introduced, the testbed requires less than 2 ms of total airtime to excite the system and acquire the signals for subsequent synchronization and CSI estimation on uplink BSC signals. With the CSI, we demonstrate effective energy harvesting gains of up to 12 dB.
Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.
This paper presents VaN3Twin-the first open-source, full-stack Network Digital Twin (NDT) framework for simulating the coexistence of multiple Vehicle-to-Everything (V2X) communication technologies with accurate physical-layer modeling via ray tracing. VaN3Twin extends the ms-van3t simulator by integrating Sionna Ray Tracer (RT) in the loop, enabling high-fidelity representation of wireless propagation, including diverse Line-of-Sight (LoS) conditions with focus on LoS blockage due to other vehicles' meshes, Doppler effect, and site-dependent effects-e.g., scattering and diffraction. Unlike conventional simulation tools, the proposed framework supports realistic coexistence analysis across DSRC and C-V2X technologies operating over shared spectrum. A dedicated interference tracking module captures cross-technology interference at the time-frequency resource block level and enhances signal-to-interference-plus-noise ratio (SINR) estimation by eliminating artifacts such as the bimodal behavior induced by separate LoS/NLoS propagation models. Compared to field measurements, VaN3Twin reduces application-layer disagreement by 50% in rural and over 70% in urban environments with respect to current state-of-the-art simulation tools, demonstrating its value for scalable and accurate digital twin-based V2X coexistence simulation.
Elliptically symmetric distributions are a classic example of a semiparametric model where the location vector and the scatter matrix (or a parameterization of them) are the two finite-dimensional parameters of interest, while the density generator represents an \textit{infinite-dimensional nuisance} term. This basic representation of the elliptic model can be made more accurate, rich, and flexible by considering additional \textit{finite-dimensional nuisance} parameters. Our aim is therefore to investigate the deep and counter-intuitive links between statistical efficiency in estimating the parameters of interest in the presence of both finite and infinite-dimensional nuisance parameters. Previous seminal works have addressed this problem by leveraging a general result: if the statistical model has a specific group invariance, then the projection operator onto the semiparametric nuisance tangent space can be asymptotically expressed as a conditional expectation with respect to the maximal invariant sub-$\sigma$ algebra. In this article, we show that, for the statistical model of elliptical distributions, the projection operator can be explicitly computed without relying on the above-mentioned asymptotic approximation. This allows us to obtain original results also for the case in which the location vector and the scatter matrix are parameterized by a finite-dimensional vector that can be partitioned in two sub-vectors: one containing the parameters of interest and the other containing the nuisance parameters. As an example, we illustrate how the obtained results can be applied to the well-known \virg{low-rank} parameterization. Furthermore, while the theoretical analysis will be developed for Real Elliptically Symmetric (RES) distributions, we show how to extend our results to the case of Circular and Non-Circular Complex Elliptically Symmetric (C-CES and NC-CES) distributions.
Inspired by recent developments in neural speech coding and diffusion-based language modeling, we tackle speech enhancement by modeling the conditional distribution of clean speech codes given noisy speech codes using absorbing discrete diffusion. The proposed approach, which we call ADDSE, leverages both the expressive latent space of neural audio codecs and the non-autoregressive sampling procedure of diffusion models. To efficiently model the hierarchical structure of residual vector quantization codes, we propose RQDiT, which combines techniques from RQ-Transformer and diffusion Transformers for non-autoregressive modeling. Results show competitive performance in terms of non-intrusive objective metrics on two datasets, especially at low signal-to-noise ratios and with few sampling steps. Code and audio examples are available online.
Coded caching (CC) can transform cache memory at network devices into an active communication resource and significantly enhance the Degrees of Freedom (DoF) of multi-input multi-output (MIMO) systems by jointly exploiting global caching and spatial multiplexing gains. Existing linearly decodable MIMO-CC designs, however, largely rely on symmetric stream allocation, where all scheduled users receive the same number of streams, which induces coarse DoF granularity and may leave spatial dimensions unused. This letter studies one-shot linearly decodable MIMO-CC delivery with arbitrary per-user stream allocations. We derive a sufficient stream-count decodability condition, expressed through per-user stream counts and multicast-codeword multiplicities, that generalizes the symmetric common-stream feasibility rule. Building on this condition, we develop a greedy multicast scheduling procedure with certified linear decodability, which redistributes coded multicast messages across transmission intervals to realize asymmetric stream allocations. Numerical results show that the proposed scheduler fills DoF-granularity gaps and improves finite-SNR symmetric rates over the state of the art.
We establish that temporal averaging over multiple observations is the degenerate case of algebraic group action with the trivial group $G=\{e\}$. A General Replacement Theorem proves that a group-averaged estimator from one snapshot achieves equivalent subspace decomposition to multi-snapshot covariance estimation. The Trivial Group Embedding Theorem proves that the sample covariance is the accumulation of trivial-group estimates, with variance governed by a $(G,L)$ continuum as $1/(|G|\cdot L)$. The processing gain $10\log_{10}(M)$ dB equals the classical beamforming gain, establishing that this gain is a property of group order, not sensor count. The DFT, DCT, and KLT are unified as group-matched special cases. We conjecture a General Algebraic Averaging Theorem extending these results to arbitrary statistics, with variance governed by the effective group order $d_{\mathrm{eff}}$. Monte Carlo experiments on the first four sample moments across five group types confirm the conjecture to four-digit precision. The framework exploits the $structure$ of information (representation-theoretic symmetry of the data object) rather than the content, complementing Shannon's theory. Five applications are demonstrated: single-snapshot MUSIC, massive MIMO, single-pulse waveform classification, graph signal processing, and analysis of transformer LLMs. Techniques for blind group matching are described.
We study day-ahead transmission topology control for high-voltage grid operation under $N-1$ security constraints. The operational task is to select, over a 24-hour horizon, a sequence of substation topologies obtained via busbar-coupler switching to relieve line overloads while limiting switching effort and topological complexity. We formulate this task as a sequential multi-objective optimization problem with four objectives used in TSO decision making: worst-case $N-1$ line loading, maximum topological depth, number of topology changes, and time spent outside the reference topology. We propose an exact block algorithm that exploits the temporal structure of topology plans: consecutive hours with the same topology are represented as blocks, enabling enumeration of the complete Pareto front over the admissible set of topologies under fixed operational bounds on depth and switching. We also develop a tailored NSGA-III-based evolutionary heuristic and evaluate it against the exact front. Using real operational data from the Dutch high-voltage transmission grid operated by TenneT, the block algorithm computes the exact front for a highly congested day in under three minutes after topology-level load-flow preprocessing. The exact front reveals low-switching plans with no DC $N-1$ thermal overloads that the tested evolutionary search fails to find. The proposed method, therefore, provides both a practical day-ahead decision-support tool for transmission operators and a benchmark for heuristic and learning-based topology-control methods.
Preprocessing screening is often the most expensive part of a near-infrared spectroscopy calibration workflow. It works because smoothing, derivatives, detrending and related filters change the spectral directions seen by partial least squares (PLS) or Ridge regression, but a full external search repeatedly refits nearly the same linear model. This paper studies the case where that search can be collapsed into one calibration step. For a strict linear preprocessing operator A acting on row spectra as XA^T, the transformed PLS cross-covariance satisfies (XA^T)^T Y = A X^T Y, and Ridge regression depends on the operator-induced kernel X A^T A X^T. These identities let a finite operator bank be screened inside the model while retaining original-wavelength coefficients, and the same identity extends to cheaply evaluated linear operator chains. Sample-adaptive or fitted corrections such as SNV, MSC, EMSC and ASLS are not strict linear; we prove the boundary and keep them as fold-local branches. The cohort has 61 regression and 17 classification rows, with a strict paired regression denominator of N=32 for the eight paper variants. There, AOM-PLS reaches median RMSEP ratios of 0.991/0.990 (simple) and 0.985/1.002 (best) against PLS-default/PLS-HPO, and AOM-Ridge reaches 0.974/0.984 (simple) and 0.918/0.966 (best) against Ridge-default/Ridge-HPO. The operator-adaptive classifier AOM-PLS-DA improves balanced accuracy by a median 0.159 on N=13 datasets (12/13 wins). The practical result is the runtime gap: PLS-HPO takes a median 710.81 s per run, whereas AOM-PLS takes 1.18-1.63 s -- 436 to 602 times less PLS fitting time. Linear operator-adaptive calibration thus gives prediction quality comparable to exhaustive preprocessing screening, with orders-of-magnitude less fitting time for PLS.
Safe physical interaction is critical for deploying robotic manipulators in human-robot interaction and contact-rich tasks, where uncertainty, external forces, and actuator limitations can compromise both performance and safety. We propose an online adaptive impedance control framework that enforces joint-state safety while achieving compliant interaction under uncertain dynamics. The approach combines a quadratic-program-based safety filter with a novel composed position-velocity non-smooth control barrier function (NCBF), enabling joint position and velocity constraints to be enforced through a unified relative-degree-one barrier. Unknown dynamics are compensated online using an interval type-2 fuzzy logic system, while actuator torque limits are handled through soft constraints with exact penalty recovery of feasible solutions. A disturbance-observer-enhanced safety mechanism improves robustness against modelling errors and external interaction forces. Using composite Lyapunov analysis, we prove forward invariance of the safe set and the uniform ultimately boundedness of the impedance-tracking error. Simulations on a 7-DOF manipulator with severe parametric uncertainty and external interaction wrenches demonstrate safe constraint satisfaction and robust impedance tracking.