New articles on Electrical Engineering and Systems Science


[1] 2606.30675

Listening Between the Lines: Joint Learning of ASR Embeddings and LLM-Augmented Linguistics for Dementia Detection

Early detection of dementia through speech analysis offers a non-invasive screening alternative, but capturing both acoustic and linguistic biomarkers remains challenging. We propose a multimodal framework leveraging Whisper for dual-purpose extraction: acoustic representations from encoder outputs and transcripts via automatic speech recognition (ASR). For the acoustic pathway, temporal networks with attention pooling aggregate variable-length sequences into fixed-dimensional embeddings. For the linguistic pathway, we prompt a large language model (LLM) to extract interpretable features spanning lexical diversity, syntactic complexity, semantic coherence, and discourse patterns. A gated fusion network integrates both modalities. On ADReSS and ADReSSo, our method achieves F1-scores of 89.47% and 90.14%, demonstrating effective integration of acoustic and LLM-augmented linguistic features. Ablation shows that multimodal fusion consistently outperforms either modality alone.


[2] 2606.30780

Detecting Audio Deepfakes on the Edge:Lightweight SSL-Based Detection in a Browser Plugin

Audio deepfakes are a growing challenge for the general public, as well as for journalists and fact-checkers. The latter need reliable tools to verify the authenticity of their sources, while at the same time keeping their information private. Commercial deepfake detection solutions rely on cloud-based processing, which raises privacy concerns. To solve this problem, we propose an on-device audio deepfake detection model. We show that a truncated self-supervised backbone with a simple logistic classifier is both very fast and often more accurate than existing solutions. Our solution outperforms the baseline AASIST by 10% and improves inference speed by 40%. We integrate this model into a browser plug-in, which allows journalists and fact-checkers to detect deepfakes easily and securely. Code for the plugin is available at this https URL.


[3] 2606.30843

TinyML for On-Device and Edge Analytics in Wireless Networks: A Survey of Deployments, Opportunities, and Concept-Drift Mitigation

Ubiquitous intelligence is essential for enabling real-time, adaptive, autonomous, and scalable operations in the next generation of wireless networks. However, this poses significant challenges in data management and energy consumption on the end-device/edge side, specially under dynamic environmental conditions. This has driven the adoption of tiny machine learning (tinyML), which offers data-driven optimization at the end-device/edge side. In this work, we survey and thoroughly discuss various tapped/untapped deployment possibilities of tinyML in wireless networks. We identify existing frameworks, accustomed to design tinyML algorithms, that could be utilized to solve a range of wireless network problems. We present a federated learning-based tinyML model update procedure, for both battery-powered and batteryless end-devices, to resolve the concept drift problem faced by tinyML models. Furthermore, we discuss the update-aware checkpointing, fault-tolerant bootloader, and intermittent-aware modify operation, which could support federated learning-based tinyML model update in the case of batteryless end-devices. Overall, this paper spells out several areas where end-device/edge intelligence can be utilized in the next generation of wireless systems, as well as ways to mitigate the concept drift problem faced in the case of end-device intelligence.


[4] 2606.30877

A Systematic Approach to Multi-Agent AI from Advanced Regulatory Control Theory: Safe and Auditable LLM Operator Agents for Process Control

Recent literature shows that large language models (LLMs) are useful for general-purpose tasks yet perform poorly on specific domain ones. One reason is the difficulty of supplying narrow context to a general-purpose model and of bounding the task it is asked to perform. It is possible to hypothesise that a multi-agent reformulation under process-control principles offers a route to address those points, since control theory provides a discipline of decomposing a system into elements of contained scope, each defending one controlled variable, with conflicts resolved by structural priority: MIN/MAX selector networks for CV-CV switching and split-range (split-parallel) logic for MV-MV switching. The present work proposes such a reformulation, derived from Advanced Regulatory Control (ARC) theory. Each feedback loop in the ARC chain is mapped to one specialised LLM operator agent carrying the loop's control-theoretic context (controlled variable, setpoint, chain priority, selector kind). The chain's interaction logic (MIN/MAX selectors, override paths) is encapsulated as a single orchestrator agent. Two orchestrator variants are tested: a deterministic rule chain, and a Claude-based LLM orchestrator at a slower tier. The control principles limit each agent's task and inform how its limitations are handled. The multi-agent system inherits the safety property of the ARC chain: every constraint conflict is resolved deterministically by the orchestrator, regardless of the LLM output. Evaluated on a dairy-barn ventilation case over a 4-day mixed-season scenario, Qwen 2.5 7B Instruct operator agents running offline on a 24 GB consumer GPU at a 5-minute cadence produce auditable trajectories, each paired with an operator-voice rationale that supports a control campaign logbook.


[5] 2606.30935

ShardNet: Training Neural Controllers with Hard, Non-Convex Constraints

While neural network control policies are powerful, their deployment on safety critical systems depends on ensuring that they obey strict constraints. Existing work often treats safety as a metric to optimize for, which competes with other performance objectives, if training converges at all. Instead, we introduce ShardNet, a neural network architecture that strictly enforces unions of polyhedral constraints by construction, using a differentiable projection layer parameterized by a classification network. The key insight is to embed safety into the neural network's structure, allowing performance to be optimized independently because formal safety guarantees are always given. In contrast with existing neural architectures that can only enforce simple convex constraints, ShardNet enables the first safe-by-construction synthesis of forward-invariant neural network controllers on closed-loop systems where safety constraints are expressed as nonconvex unions of polyhedras or learned value function level sets. To support this, we also introduce a technique to verify and train such value functions correctly as rectified linear unit (ReLU) networks, which has not previously been possible. On double integrator benchmarks drawn from the literature, ShardNet policies maintain 100% safety on verified sets and achieves significantly lower objective loss compared to existing formal methods. Furthermore, our value function training technique also produces safe sets more than 3 times larger than existing verification approaches.


[6] 2606.30944

Preserving Speech-to-Text LLM Capabilities in Speech-to-Speech Generation

Strong speech-to-text (S2T) LLMs already provide robust speech perception and text reasoning, but adding speech-to-speech (S2S) output is challenging: fine-tuning the backbone can degrade the original S2T performance, while attaching a downstream talker reintroduces a serial text-to-speech bottleneck. We present PRIME-Speech, a frozen-backbone S2S conversion framework that trains only speech-generation modules. PRIME-Speech synchronizes a causal audio post-decoder with intermediate hidden states of the frozen backbone, so codec tokens are generated from the model's evolving reasoning trajectory rather than from completed text chunks. The post-decoder uses mixed hidden-state, text, and audio-history conditioning, and a training-time packing strategy with turn-level audio KV-cache and position reset stabilizes multi-turn spoken interaction without additional multi-turn S2S training data. Multi-token prediction further reduces the effective codec prediction rate and improves first-audio latency without modifying the reasoning path. Across speech translation, spoken QA, speech understanding, and multi-turn dialogue, PRIME-Speech preserves the S2T behavior of the frozen backbone while producing accurate, low-WER spoken responses.


[7] 2606.30993

Rate-Splitting Multiple Access Enabled Probabilistic Semantic Communication in UAV Networks

This article proposes an uncrewed aerial vehicle (UAV) downlink semantic communication framework, where probabilistic knowledge graphs (PKGs) are employed to model user equipment (UE) semantics and decompose semantic information into shared and private components. Leveraging the capability of rate-splitting multiple access (RSMA) in addressing such semantic structures, a PKG-assisted RSMA transmission scheme is developed to efficiently deliver multi-user semantic information under severe energy constraints and fast-varying UAV channels. To characterize the strongly coupled energy costs of communication, computation, and flight, a weighted energy minimization problem is formulated to jointly optimize the UAV trajectory, power allocation, beamforming design, and semantic compression ratio. The resulting non-convex problem is efficiently solved using an iterative semantic-aware weighted energy optimization (SWEO) algorithm that integrates Lagrangian dual decomposition and successive convex approximation. Furthermore, a semantic accuracy metric is proposed to quantify the reliability of reconstruction by assigning importance-based weights to informative KG triples. Extensive simulation results verify that the proposed framework achieves superior energy efficiency, enhanced semantic preservation, and consistently better performance than conventional RSMA, non-orthogonal multiple access (NOMA), and space division multiple access (SDMA) schemes in benchmarks across various network parameters.


[8] 2606.31052

Event-Triggered Gain Scheduling of 2 x 2 Linear Hyperbolic PDEs via Neural Operators (Full Version)

This paper introduces a new framework for event-triggered gain scheduling applied to linear hyperbolic Partial Differential Equations (PDEs) with time- and space-varying coefficients. The approach leverages neural operators to address the challenges of real-time control in such systems. At each triggering time, the control input is designed using the classical static backstepping control law, while the gains of the boundary controller are updated according to the triggering mechanism and the spatial variation of the coefficients. Neural operators are employed to learn the mapping between the system parameters in the PDEs and the corresponding backstepping kernels. By integrating neural operators into the event-triggered framework, we eliminate the need to repeatedly solve complex kernel equations at every triggering instant, thereby reducing computational overhead while ensuring closed-loop stability. The proposed method is validated through theoretical analysis and numerical simulations, demonstrating its effectiveness and strong potential for real-time control of time-varying hyperbolic PDE systems.


[9] 2606.31056

A Simplex-Inspired Architecture for Integrating Quantum Capabilities into Cyber-Physical Systems

Cyber-physical systems require accurate and reliable system models to ensure safe and efficient operation. Classical Gaussian Process Regression (GPR) provides uncertainty-aware predictions but suffers from high computational complexity, which limits its scalability in real-time applications. Quantum-assisted Gaussian process models reduce complexity in inference, but their practical use is constrained by noise and stability concerns in safety-critical environments. In this paper, we propose a hybrid classical-quantum system identification framework based on a Simplex architecture. The framework combines Quantum-Assisted Hilbert-Space Gaussian Process Regression (QA-HSGPR) as a high-performance module and classical GPR as a high-assurance module. A runtime monitor evaluates system safety and dynamically switches between the two models. Experiments on a Continuous Stirred-Tank Reactor benchmark demonstrate that the proposed framework enables a controllable trade-off between performance and safety for real-time cyber-physical systems.


[10] 2606.31084

Accelerating Merge with Motion Vector Difference via Filter Difference Analysis for VVenC

Merge with Motion Vector Difference (MMVD) is a key coding tool in Versatile Video Coding for improving motion prediction accuracy. However, its exhaustive search strategy imposes a significant computational burden on the encoder. To address this issue, we propose a novel fast MMVD algorithm for the VVenC encoder based on fractional motion vector filter difference analysis. By approximating the 8-tap interpolation filter with a 2-tap filter, we derive a criterion based on spatial gradients and prediction residuals for estimating the potential gain of MMVD candidates. We further generalize this criterion to accommodate both shifted integer reference samples and 2D separable filtering. To minimize the overhead of the proposed method, we introduce implementation optimizations, including symmetric offset inference and cross-shaped downsampled dot-product computation. Compared with existing fast MMVD algorithms in VVenC, our method reduces the average MMVD search ratio from 21.07\% to 11.05\% and decreases the efficiency-complexity metric $\eta$ from 11.79 to 7.10 under the fast preset.


[11] 2606.31210

Due-to-Heatwaves Faults in Urban Distribution System: An Identification Approach

Distribution system faults occurring during heatwaves (HWs) are not all caused by the HW itself: concurrent factors such as asset ageing, mechanical defects, soil contamination, and operational constraints contribute independently. Hence, indiscriminately attributing all HW-period faults to thermal stress overestimates system vulnerability and misleads asset-management decisions. This paper proposes a systematic framework to identify and quantify the subset of summer faults directly attributable to HW occurrences (denoted Due-to-HW faults), by distinguishing them from Due-to-Others faults. HW events are first characterised through the Excess Heat Factor index. A covariance-based attribution criterion is then developed to distinguish faults whose occurrence is statistically consistent with HW-driven thermal mechanisms from those attributable to independent causes. Complementing the attribution framework, a time-delay model is introduced to estimate the lag between the beginning of a HW and fault occurrence by maximising the normalised covariance between hourly temperature series and shifted fault-duration series. Applied to six years of operational data from a real MV distribution network, the simulation results show that Due-to-HW faults constitute a significant yet variable proportion of total summer faults, underscoring the non-negligible impact of HW occurrences on summer fault statistics. Beyond documenting the deterioration of fault rate and Mean Time Between Failures across all seasons, the analysis confirms that Time-Between-Failures distributions depart significantly from the exponential assumption, with direct implications for the applicability of Poisson-based reliability models to distribution systems subject to recurrent HW stress.


[12] 2606.31228

FPGA-based LQG controller and hardware-in-the-loop simulator implementation for nanomechanical systems

We present an open-source framework for real-time Linear Quadratic Gaussian (LQG) control and hardware-in-the-loop (HIL) simulation on the affordable Red Pitaya STEMlab FPGA platform. The controller implements a discrete-time Kalman filter and Linear Quadratic Regulator (LQR) for systems with up to three coupled oscillatory degrees of freedom, targeting applications in levitated optomechanics, MEMS/NEMS, and related experimental platforms. Complementing the controller, the HIL simulator provides a~configurable second-order stochastic plant with nonlinear input and output mappings, enabling realistic closed-loop testing under real-time and fixed-point constraints. A MATLAB-based workflow automates model configuration, controller synthesis, numerical scaling, and FPGA deployment without requiring specialized hardware expertise. As an end-to-end demonstration, we present the stabilization of a levitated nanoparticle in a two-dimensional double-well potential, illustrating the complete workflow from model definition and simulation to real-time feedback control.


[13] 2606.31303

Minimizing Quantized Semantic Age of Information (QSAoI) in Foundation Model-Based Semantic Communications

The emerging techniques of semantic communications and edge computing in 6G networks necessitate a paradigm shift toward co-designed semantic-aware and adaptive resource allocation for short-packet transmissions. However, there is a fundamental gap between the semantic layer and the physical layer under low-latency finite blocklength (FBL) effects. To bridge this gap, we introduce the Quantized Semantic Age of Information (QSAoI), a novel metric that rigorously captures the trade-offs among freshness and semantic efficiency of high-level features in real-time communication in the FBL regime. Guided by this metric, we propose a novel foundation model-based efficient co-designed framework to minimize the expected QSAoI over wireless fading channels in latency-constrained semantic communication. Specifically, we formulate a non-linear joint optimization problem to dynamically optimize the block-wise mixed-precision quantization (MPQ) strategy and the physical blocklength. To efficiently resolve this complex problem, we develop a high-efficiency low-complexity algorithm based on fixpoint inspection and bisection search. Extensive simulations validate that our proposed algorithm dynamically adapts the semantic quantization precision to varying channel conditions, effectively minimizing the expected QSAoI compared to baselines.


[14] 2606.31314

A Novel Method for Differential-Algebraic Dynamic Model Discovery in Power Systems: An LLM-Based Multi-Agent Collaborative Framework

With large-scale integration of emerging power electronic devices represented by grid-forming inverters, power system dynamics increasingly exhibit strong nonlinearity, multi-timescale coupling, and black-box control logic. These features hinder conventional parameter identification requiring known model structures and structure identification based on predefined function libraries, making complete differential-algebraic dynamic model recovery difficult under weak prior information. To address this challenge, this paper proposes an LLM-based multi-agent collaborative framework for differential-algebraic dynamic model discovery in power systems. It integrates heterogeneous exploratory agents, individual candidate model memories, parameter fitting and evaluation, and a coordinator agent. Under unified measurement-data constraints, agents generate candidate equation structures in parallel, while candidates are optimized, evaluated, retained, and summarized to provide closed-loop search guidance. The task is decomposed into differential equation structure discovery and algebraic closure discovery, enabling joint recovery of state dynamics, algebraic constraints, and key intermediate variables with incomplete prior information. Case studies on synchronous generators and grid-forming inverters show that the proposed method outperforms single-agent LLM-based discovery and conventional symbolic regression in reconstruction accuracy, generalization, search efficiency, and noise robustness. In the generator case, OOD MAPE reaches 0.19\%; in the inverter case, discovery time is reduced by 25.7\% compared with the single-agent LLM baseline.


[15] 2606.31343

Standardizing case study descriptions for multi-energy systems and networks modeling

Research on Multi-Energy Systems (MES) often relies on case studies with divergent hypotheses and terminologies, limiting comparability and slowing progress. Discussions at the ECOS 2025 conference highlighted the need for standardized reference case studies to facilitate reuse and comparison. While frameworks like the IEC 62559 standard and the Open Energy Platform (OEP) exist, their adoption for MES remains fragmented. This heterogeneity hinders collaboration and replicability, motivating efforts towards a unified description framework tailored to MES. This paper aims to address this gap by evaluating existing approaches in order to promote a standardized description framework for MES case studies. The goal is to enhance comparability, streamline research, and make a first step towards defining reference case studies and benchmarks in the domain. The study adopts a collaborative approach: after analysing existing description frameworks and selecting the most suitable one, the co-authors describe their own case studies, followed by cross-reviews to assess completeness, clarity, and openness of data/models. The description framework is adapted to emphasizeMES-specific elements, such as system configuration and use case details. A checklist is developed to guide reviews. Preliminary results include a set of standardized case study descriptions and insights from cross-reviews on framework strengths/limitations. The diversity of case studies underscores the framework's flexibility, while feedback reveals opportunities for improvement and broader adoption. This work provides a foundation for standardized MES case study descriptions, fostering collaboration, comparability, and replicability. By reducing ambiguity and ensuring the availability of relevant information in a consistent format, it accelerates research and benchmarking in the field.


[16] 2606.31349

PGUDA: Pressure-Guided Unsupervised Domain Adaptation with Cross-Modal Knowledge Distillation for sEMG-Based Gesture Recognition

Surface electromyography (sEMG)-based gesture recognition has emerged as a promising technology for natural human-computer interaction. However, its practical deployment remains challenging due to severe performance degradation caused by feature distribution discrepancies across different subjects and recording sessions. Although domain adaptation (DA) techniques are commonly employed to mitigate such discrepancies, conventional methods often struggle to effectively aligning sEMG features, primarily due to their inherent stochasticity and the scarcity of labeled data. To address these limitations, this paper proposes a novel Pressure-Guided Unsupervised Domain Adaptation (PGUDA) framework, which leverages the robustness and stability of pressure signals to introduce a cross-modal knowledge distillation strategy that transfers consistent physical semantics across modalities. Specifically, a teacher network trained on pressure signals guides an sEMG student network on unlabeled target domains, thereby regularizing the representation learning process with transferable and modality-invariant knowledge. Extensive experiments conducted on a self-collected multimodal dataset involving eleven subjects validate the effectiveness of the proposed PGUDA framework. The results demonstrate that our proposed PGUDA achieves leading performance in both cross-subject and cross-session classification tasks, achieving average accuracies of 58.08% and substantially outperforming existing DA approaches. Notably, PGUDA exhibits remarkable label efficiency: it attains classification accuracy comparable to fully supervised benchmarks while requiring only 5% of labeled data for teacher network training. This framework offers a robust and data-efficient solution that can significantly reduce the calibration burden in practical sEMG-based gesture recognition systems.


[17] 2606.31365

Beyond Cross-Reconstruction: Probing-Based Disentanglement Evaluation for Acoustic Teleportation Codecs

Some neural audio codecs disentangle speech into latent subspaces encoding content, speaker identity, and acoustics, enabling acoustic teleportation and voice conversion. Existing evaluations rely on cross-reconstruction quality, which cannot reliably detect leakage across partitions. We extend a probing based framework to assess disentanglement by regressing room-acoustic parameters (reverberation time, clarity, and direct-to-reverberant ratio) and classifying speaker identity, using the gap between intended and unintended partitions as the disentanglement measure. Applied to an acoustic teleportation codec, we find speaker identity is largely confined to its partition, while acoustics leak into the speech embeddings due to the training objective. Acoustic embeddings blindly estimate room parameters within 0.02 s of supervised baselines, indicating physically meaningful structure emerges without explicit supervision.


[18] 2606.31384

Continuous-Time Decentralized Online Estimation With Additive Noises

We study a decentralized online estimation problem with additive communication noises over the fixed digraph. Each node has a linear measurement of an unknown parameter with random measurement matrices and runs a continuous-time online estimation algorithm. We transform the convergence analysis of the algorithm into the stability analysis of the non-autonomous linear stochastic differential equation (SDE) with random time-varying coefficients, and develop the asymptotic stability by numerical approximation theory. Based on the stability results, we show that the algorithm gains can be properly designed to ensure mean square convergence if the measurement matrices and the communication graph satisfy the stochastic spatial-temporal persistence of excitation condition. Furthermore, a special case where the measurement matrices contain a Markov chain is investigated, and the theoretical results are demonstrated by a numerical example.


[19] 2606.31396

Sensing-Limited Control Under Non-Designable Observation Mechanisms

We study the information-theoretic limits of controlling unstable linear systems through non-designable observation mechanisms. Unlike classical communication-constrained control, the information bottleneck lies in the observation mechanism rather than in a designable encoder-channel interface. For noiseless linear dynamics, we derive necessary conditions for mean-square observability and stabilizability, showing that the directed information rate from the unstable state process to the observation process must dominate the open-loop expansion rate of the unstable modes. We further show that this lower bound persists under additive process disturbances. In the Linear-Gaussian setting, although the unstable-state directed information rate remains intractable in closed form, we obtain an exact characterization of the full-state directed information rate, which upper-bounds the unstable-state quantity and yields computable necessary conditions. Under suitable posterior regularity conditions, we also establish sufficient conditions for asymptotic mean-square observability and, via certainty-equivalence control, asymptotic mean-square stabilizability. The key step is an entropy-to-error bridge: a strict surplus in directed information over the expansion rate forces posterior uncertainty to collapse and thereby drives the estimation error covariance to zero. These results identify a fundamental feasibility boundary for sensing-limited control and clarify how classical communication-based limits must be reinterpreted when the sensing interface is non-designable.


[20] 2606.31400

Transformer-Hypernetwork-Controlled Deep-Unfolded Phase-Aware Channel Estimation Refinement for Phase-Drift-Robust Backscatter Links

This paper proposes a transformer-hypernetwork-controlled deep-unfolded phase-aware channel estimation refinement (THUNDER) for phase-drifting backscatter links. Residual carrier-phase drift across the pilot block renders the backscattered observation phase-nonstationary, and a closed-form phase-aware channel estimation (PACE) compensates only the first-order phase component, leaving a deterministic high signal-to-noise ratio (SNR) error floor. THUNDER suppresses this floor by initializing from PACE and refining the estimate through unfolded Gauss-Newton steps on the exact phase-exponential model. A transformer extracts pilot-wide phase context, and a hypernetwork generates bounded controls and pilot-reliability weights. Evaluations show an 8.9 dB normalized mean square error gain over the strongest learning-based channel estimation baseline.


[21] 2606.31412

Rethinking Energy Efficiency in Cell-Free Massive MIMO: The Role of Processing and Optical Fronthaul

Cell-free massive MIMO promises uniformly high performance by combining densely distributed radio units, coherent transmission, and centralized processing. Unlike earlier radio generations, it depends on dense fronthaul connectivity and a virtualized cloud-RAN architecture. In this setting, energy use is no longer driven primarily by active radio components; instead, fronthaul and processing play a dominant role, calling for a fresh perspective on what defines energy efficiency. This work introduces a modular power model that captures the interplay between radios, fronthaul, and cloud processing. The analysis highlights how design choices, such as functional splits and precoding strategies, shape both fronthaul data load and total power consumption. Centralized precoding provides stronger performance with less resource utilization, while flexible activation of radios and processing elements avoids unnecessary overhead. Overall, the energy efficiency of cell-free massive MIMO grows as antennas are more densely distributed across the coverage area, particularly when combined with end-to-end resource allocation.


[22] 2606.31426

Towards a Joint Task-Oriented and Generative Semantic Communication Framework for 6G Networks

Semantic Communication (SC) has emerged as a key enabler for 6G wireless systems by transmitting task-relevant meaning rather than raw data, thereby significantly reducing bandwidth consumption while preserving communication intent. In this work, we propose an end-to-end OFDM-based semantic communication framework that integrates a semantic encoder-decoder pipeline with a neural receiver operating over a 3GPP vehicular channel. The semantic encoder extracts the underlying meaning of a visual scene by transforming it into a graph-based representation consisting of object-level features and relational structure. At the receiver, the reconstructed scene graph is processed by a spatio-temporal graph neural network (ST-GNN)-based module for collision-risk estimation, enabling task-oriented inference. In parallel, a diffusion-based semantic decoder reconstructs the visual scene from the recovered semantics, providing dual functionality: safety prediction and image reconstruction. The proposed framework is evaluated in a MIMO configuration under varying SNR conditions. Experimental results show that it achieves up to 99.1% data compression relative to pixel-domain transmission, outperforming conventional compression-based methods (JPEG and HEVC) while preserving downstream inference performance. Furthermore, the diffusion-based reconstruction attains significantly lower frechet inception distance (FID) scores than existing semantic communication approaches, reflecting superior semantic and perceptual fidelity.


[23] 2606.31447

Sensing for Reliable UAV Communication: Robust Trajectory and Resource Optimization in Low-Altitude Networks

In low-altitude wireless networks, sensing-aided communication has emerged as a promising integrated sensing and communication (ISAC) paradigm for unmanned aerial vehicle (UAV) tracking and communication. This paper investigates reliable sensing-aided communication for multiple cellular-connected UAVs under mobility uncertainties. Specifically, we maximize the minimum outage capacity among UAVs by jointly optimizing their real-time predicted positions, as well as the base station (BS) transmit power and bandwidth allocations. To address the non-convex and intractable maximum tolerable outage probability (OP) constraints, two robust optimization schemes are proposed based on a continuous confidence ellipse (CE) and discretized inverse-whitened sectors (IWSs), respectively. For the CE-based scheme, an efficient algorithm is proposed to optimize the predicted UAV positions individually via block successive convex approximation, followed by convex resource allocation. For the IWS-based scheme, an IWS-based OP approximation is proposed to facilitate the robust optimization, based on which a low-complexity IWS selection method is proposed to decouple the optimization variables. Then, a similar sequential optimization algorithm is proposed based on the projected gradient descent approach. The two algorithms are further unified into a common trajectory-resource optimization framework, revealing a low-complexity structure for robust UAV trajectory and resource management. Simulation results validate the effectiveness of our proposed OP approximation, demonstrate the significant outage capacity improvement of the proposed robust optimization schemes over benchmark schemes, and illustrate the superiority of the IWS-based scheme over the CE-based scheme.


[24] 2606.31473

Von Mises Based Uncertainty Quantification for Closely Spaced Automotive Radar Targets

This work investigates uncertainty-aware deep learning approaches for direction of arrival (DOA) estimation in automotive radar, focusing on probabilistic modeling and downstream integration. A circular-statistics-based von Mises (VM) ensemble (ENS) is compared with an evidential deep learning (EDL) framework based on a normal inverse gamma formulation, yielding a Student t predictive distribution in the Euclidean domain. The ENS framework produces angular predictions parameterized by (mu, kappa), enabling interpretable uncertainty aligned with directional geometry. Performance is evaluated under in distribution and multiple out-of-distribution conditions using risk coverage and ROC or AUROC analyses. Results indicate that ENS achieves lower uncertainty under nominal conditions and exhibits stronger sensitivity to severe perturbations, whereas EDL provides smoother uncertainty variation and slightly improved ranking consistency. Importantly, the ENS representation enables direct probabilistic integration into association modules via closed form VM likelihoods, facilitating a unified detection tracking pipeline. These findings highlight a trade-off between geometric consistency and statistical generality in uncertainty-aware DOA estimation.


[25] 2606.31521

Distortion-Corrected Diffusion MRI Using Rotated-View EPI and Joint Field-Map/Image Estimation with Gaussian Primitives

Echo Planar Imaging (EPI) is the standard acquisition technique for diffusion and functional neuroimaging, enabling rapid imaging but suffering from geometric distortions caused by B0 field inhomogeneities. Existing correction methods first reconstruct distorted images using parallel imaging, then estimate the B0 field and correct the distortion in the image domain. In this sequential process, reconstruction artifacts at high acceleration factors and low SNR at high diffusion b-values degrade B0 estimation and limit the overall correction quality. We propose a physics-informed framework that jointly estimates the B0 field and distortion-free image directly from k-space data, without depending on an intermediate parallel-imaging reconstruction for the correction. The image and the B0 field are each represented as a superposition of Gaussian primitives embedded within an MRI physics forward model. The explicit, continuous parameterization captures both smooth regions and tissue boundaries and supports rotated-view EPI acquisitions without interpolation. The diffusion-weighted image is modeled as real and non-negative, with the image phase absorbed into a per-shot phase factor. Rotated views distribute distortions across multiple phase-encoding orientations, improving point spread function isotropy and providing stronger constraints for B0 estimation. On in vivo brain diffusion EPI, the proposed method attains the closest brain-boundary agreement with a distortion-free structural reference, with the largest improvement over sequential methods at high b-value and high acceleration. Extensive visual comparisons further show improved detail fidelity and noise suppression.


[26] 2606.31527

How Bilingual Are SSL Speech Models? Cross-Lingual Probing of Articulatory Encoding with Finnish and Russian EMA

SSL speech models capture rich phonetic, prosodic, and acoustic patterns from raw audio, yet how they encode articulatory information across diverse languages remains unclear. Using EMA data from bilingual Finnish-Russian speakers, we evaluate cross-lingual correlations between SSL latent representations and articulatory movements. Models achieve strong prediction performance (Pearson r up to 0.68) even with approximately 5 minutes of training data, with multilingual models outperforming monolingual ones. Intermediate layers encode articulatory features most effectively, and tongue movements are more predictable than lip movements. We also assess the impact of task type (read versus spontaneous speech) and language proficiency, finding higher accuracy for structured tasks and strong generalization across proficiency levels. These results enhance the interpretability of SSL models and show their potential for speech-technology applications.


[27] 2606.31552

Improving multichannel speech enhancement through accurate room-acoustic simulations

Room-acoustic simulations are widely used to augment training data for deep-learning-based speech enhancement. While most pipelines rely on simplified geometrical acoustics, wave-based approaches offer greater physical accuracy. In this work, we examine how simulation fidelity affects multichannel speech enhancement performance. To this end, we train SpatialNet on datasets augmented with different room-acoustic simulation methods and evaluate the resulting models on measured data. We compare lower-fidelity datasets based on geometrical acoustics with a high-fidelity dataset using advanced acoustic modelling and a hybrid combination of wave-based and geometrical acoustics simulations. Training on the high-fidelity dataset results in an up to 38 % relative reduction in median word error rate compared to the lower-fidelity alternatives. These results show that augmentation with high-fidelity room-acoustic simulations directly translates into improved multichannel speech enhancement performance.


[28] 2606.31566

Fast Risk Certification of Candidate Trajectories under Uncertain Time-Varying Constraints

This paper studies the certification of a fixed candidate trajectory on a finite certification grid under parametric uncertainty. For each constraint-time pair, we define a scalar measure of constraint violation and aggregate the resulting pointwise chance constraints into a worst-case Value-at-Risk (VaR) margin. The goal is not to generate a new trajectory, but to assess online whether a trajectory produced by a planner or predictive controller is sufficiently safe on the certification grid. Direct evaluation requires repeated uncertainty propagation and is often too expensive for computationally demanding models. We therefore adopt an offline-online scheme: offline, a surrogate of the constraint violation map along the candidate trajectory is constructed using polynomial chaos expansion (PCE) when the uncertainty law is known, or kernel regression when only sampled input-output data are available; online, the surrogate is sampled to evaluate conservative VaR bounds at low computational cost. On the theoretical side, we derive a finite-sample upper bound for the grid-based VaR margin using empirical quantiles, the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality, and a union bound over all constraint-time pairs, without assuming a parametric family for the underlying violation distribution. We also show how a uniform surrogate error bound transfers to the certified VaR margin. The approach is illustrated on a crystallization population balance model, where the surrogate-based risk estimates track direct Monte Carlo results while substantially reducing online evaluation time.


[29] 2606.31607

Uncertainty Quantification via Invariant-Measure Conformal Prediction

Uncertainty quantification for learned stochastic dynamical systems is essential in safety-critical tasks such as control and monitoring. Standard conformal prediction provides finite-sample coverage guarantees under exchangeability, but this assumption is typically violated in dynamical systems because trajectory data are temporally dependent, state distributions evolve, and recursive prediction errors accumulate. This paper proposes an invariant-measure conformal prediction (imCP) framework that calibrates uncertainty using independent samples from an invariant measure of the Markov process induced by the dynamics. This aligns calibration with the stationary operating regime and restores the statistical symmetry needed for rolling one-step split conformal guarantees. For recursive multi-step prediction, imCP combines conformal calibration with Lipschitz error propagation through the learned predictor to obtain explicit horizon-dependent this http URL pre-deployment uncertainty tubes are suitable for rolling and receding-horizon applications, such as self-triggered control and fault detection, where uncertainty bounds must be computed before future residuals are observed. Numerical experiments show that imCP yields reliable bounds, while non-invariant calibration can become misaligned during deployment.


[30] 2606.31614

Automating Cause-Effect Specification with Knowledge Graphs and Large Language Models

Engineering specifications such as interlocks, alarm rationalization tables, and cause-and-effect (C&E) matrices remain central to process control and safety, yet their creation is still predominantly manual, document-driven, and prone to inconsistency. This paper presents a semantic-AI framework that automates the generation of C&E logic by combining a knowledge graph (KG) with a constrained large language model (LLM) layer. The KG builds on an established modular alignment ontology to represent process structure, operating modes, faults, symptoms, causes, and mitigation actions in a machine-interpretable form. The LLM then transforms this information into operator-ready safety narratives and Semantic Web Rule Language (SWRL) rules under strict ontology and vocabulary constraints, grounding the generated artifacts in the underlying semantic model. The workflow is demonstrated on a modular process plant, showing how engineering semantics, diagnostic relations, and machine-verifiable specifications can be generated from a unified knowledge representation with reduced manual effort.


[31] 2606.31635

A Tutorial on Autonomous Fault-Tolerant Control Using Knowledge-Grounded LLM Agents

Fault recovery in process plants still relies heavily on plant operators, especially when faults fall outside predefined supervisory logic. Operators interpret alarms, procedures, P\&IDs, interlocks, and process trends, then decide how to move the plant to a safe operating mode without triggering a shutdown. This paper examines how Large Language Model (LLM) agents can support such recovery decisions. The proposed framework treats the LLM as a constrained supervisory planner. It uses plant-specific knowledge to propose recovery actions, and every proposal is checked by an external validator (symbolic or simulation-based) before actuation. The paper develops three design dimensions for applying the framework: the recovery patterns for which LLM agents are useful, the validation strategies that separate admissible from inadmissible proposals, and the deployment constraints imposed by latency, knowledge engineering, safety integration, and model lifecycle management. To make the framework directly usable, two openly available executable Python environments are provided. Both re-implement established case studies, a modular mixing module and a continuous stirred-tank reactor, extended with configurable faults and defined interfaces for custom recovery and validation methods.


[32] 2606.31690

Resource-Efficient WiFi CSI Sensing via Exploiting the Age of Samples

WiFi channel state information (CSI) sensing must coexist with data communications, which constrains the acquisition rate of fresh CSI measurements. To model this, we formulate CSI-based human activity and identity recognition under a sensing rate constraint that limits the fraction of time slots, within a measurement session, where CSI samples are available. This framework captures sensing-communication resource sharing and uncontrolled packet loss or traffic-driven irregularity. To satisfy the sensing constraint, two fixed CSI sampling policies are considered: a deterministic policy and a stochastic Bernoulli policy. We propose a low-cost age-aware WiFi sensing framework that explicitly incorporates sample freshness into the model training. The age of each retained CSI sample is first encoded and then fused with the CSI embedding via multiplicative fusion. On the NTU-Fi human activity recognition and person identification datasets, the proposed model consistently outperforms both a CSI-only baseline and the state-of-the-art time-aware attention model from the UniFi benchmark. For example, it yields up to a 10-percentage-point improvement over the UniFi method for person identification, with the largest gains observed under strict sensing budgets.


[33] 2606.31728

A Coalitional Stable and Fair Reward Allocation for Dynamic Virtual Power Plants

This paper establishes crucial cooperation criteria for the operation of Dynamic Virtual Power Plants (DVPPs). We propose a control design and reward allocation mechanism to enable and incentivize Distributed Energy Resources (DERs) to provide dynamic ancillary services (DAS). Our results illustrate how the cooperative aggregation of heterogeneous DERs leverages technical complementarities to outperform standalone DAS provision. The proposed reward allocation fulfills critical game-theoretic criteria, including individual rationality, coalitional stability, incentive compatibility, optimality, fairness and ex-post consistency. The control design and reward allocation are validated using a case study based on the Finnish power grid.


[34] 2606.31729

Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation

Text-to-speech (TTS) evaluation is an open challenge. While the primary target was "naturalness," recent fidelity gains shifted focus toward "appropriateness" and whether speech is correct for its context. In this work, we examine how perception changes when the expected downstream use varies. We measure the appropriateness and human-likeness of five SOTA TTS systems across five domains: AI assistant, reader, actor, animated character, and spontaneous speaker. Results show appropriateness varies across domains independently of naturalness. While systems shine at reading, expressive domains remain challenging, and optimizing for one can degrade others. Furthermore, naturalness scores tend to penalize stylized speech while rewarding spontaneity. Finally, our study also highlights blind spots in one-size-fits-all evaluation metrics across more expressive domains. We demonstrate that TTS performance is not "solved" but depends on the target domain, requiring context-aware evaluation.


[35] 2606.31730

A Fair and Transparent Framework for Speech-Based Depression Detection: Balancing Interpretability and Performance

While speech provides rich, non-invasive biomarkers for mental-health assessment, clinical adoption is limited by opaque models and potential demographic bias. In this work we propose a methodological framework to evaluate robustness and interpretability for automated depression detection on the extended DAIC-WOZ dataset using low-complexity machine learning baselines (RF, SVM, and MLP) chosen to mitigate overfitting and enhance generalization in combination with human-understandable acoustic features (MFCCs, eGeMAPS). To balance accuracy with clinical trust, we leverage explainability methods (LIME and SHAP) for feature selection, validating our findings with statistical significance tests and demographic fairness analyses to mitigate spurious, artifact-driven correlations. Empirical results demonstrate that an optimized subset of explainable AI (XAI)-selected features combined with an MLP architecture achieves a state-of-the-art test accuracy of 82\%. Ultimately, this work provides a transparent framework for robust and ethical assistive technologies that can be applied to any other binary task.


[36] 2606.31737

Dynamic Scheduling for Flexible Manufacturing Systems Based on Multi-Agent Deep Reinforcement Learning and Petri Nets

This paper investigates dynamic scheduling for flexible manufacturing systems (FMSs) subject to dynamic events, such as new order arrivals, temporary order cancellations, and machine failures. Traditional methods often face significant challenges in achieving real-time responsiveness under such conditions. To address this issue, the scheduling problem is formulated as a Markov decision process (MDP) with timed Petri nets, where the future evolution of the system depends exclusively on the current marking and the subsequently executed transitions, independent of historical trajectories. The state space and action space of the MDP are constructed using the notion of basis reachability graph (a compact state space representation) of Petri nets to alleviate the state explosion problem, thereby accelerating model training convergence. Meanwhile, a hierarchical dense reward function is constructed by integrating stepwise guidance with terminal evaluation. Then, a multi-agent proximal policy optimization algorithm is employed for model training under the centralized training and decentralized execution paradigm to improve scheduling efficiency. Numerical experiments are conducted involving typical dynamic events, and the results demonstrate that the proposed method can effectively handle dynamic events and achieve superior scheduling performance compared with conventional approaches.


[37] 2606.31739

Electric Field Attenuation Techniques for Inductive Wireless Charging of Medical Implants

Inductive wireless charging of implantable medical devices necessitates careful control of magnetic and electric field emissions to meet strict safety regulations while delivering sufficient power. When designing a comfortable wireless charger that can operate over distances ranging to 10cm or more, it is difficult not to exceed the most stringent E-field limit of 83~V/m. This paper investigates electric field attenuation techniques for mid-range wireless power transfer at 6.78~MHz. Using \newacronym{fea}{FEA}{finite element analysis}\acrfull{fea} like Ansys \textregistered{} HFSS \texttrademark{}, three mitigation strategies are evaluated; (1) a high-permittivity dielectric shielding layer to absorb and redistribute electric fields, (2) multiple resonant tuning capacitors distributed along the transmitter coil to lower the voltage swing and confine high E-field regions, and (3) alternative coil-array transmitter topologies to spatially localize more confined E-fields. The results show that each technique significantly reduces the E-field magnitude without substantially affecting the H-field. Shielding the transmit coil attenuates the peak E-field from its initial 1416~V/m to 496~V/m, approximately a 65\% reduction. Distributing the tuning capacitance into sixteen smaller capacitors yields a drop from the 1416~V/m to 231~V/m, approximately a 84\% reduction. Both techniques preserve the required 8~A/m magnetic field. The third technique, a two-by-two coil array transmitter reduced the E-field from its 1416~V/m to 990~V/m (around 30\% reduction), though with a slight magnetic field redistribution. All three methods combined, the E-field was successfully attenuated to 82~V/m, just below the strictest limit, without compromising power transfer efficiency. This research demonstrates a feasible approach and framework to safely extend the application of wireless charging for medical implants.


[38] 2606.31743

Spatially Coupled Sparse Code Multiple Access (SC-SCMA): A Spectral Graph Approach

This paper presents a spatially coupled sparse code multiple access (SC-SCMA) framework to overcome the performance and scalability limitations of conventional SCMA systems. By analyzing the pairwise error probability associated to multi-user error patterns, we show that spatial coupling projects the superimposed SCMA codewords into a higher-dimensional effective signal space, leading to a strictly improved minimum Euclidean distance (MED) compared with conventional SCMA, while simultaneously enhancing the coding gain through global message propagation and the diversity gain through inter-block resource spreading. Such a distance gain is shown to be governed by the effective access dimensionality (EAD) induced by the coupled factor graph. With the aid of spectral graph theory, we establish a direct relationship between the spectral gap of the factor graph and a lower bound on the EAD, providing a computable structural metric that guarantees MED improvement under various error patterns. Building upon these theoretical insights, we introduce a low-complexity structure-aware codebook design approach, including a spectral-gap-oriented construction of spatially coupled factor matrices and a localized codebook optimization strategy that exploits the dominant error-inducing local user group. Simulation results validate the analysis and demonstrate that the proposed SC-SCMA consistently outperforms conventional SCMA in overloaded massive access channels.


[39] 2606.31744

A Conversational Agentic Interface to Physics-Based Household Digital Twins for Residential Energy Decision Support

Multiple actors around residential energy systems require accessible decision-support tools: homeowners and tenants for dwelling-level retrofit choices, consultants and municipal planners for building and district-level intervention assessment, and retailers and aggregators for estimating residential flexibility and coordinating distributed energy resources. However, existing pathways remain limited, since professional audits are costly and static, rule-of-thumb estimates lack household specificity, and high-fidelity simulation tools require specialized expertise. This paper presents a conversational agentic framework that makes physics-based household energy simulation accessible through natural language interaction. The proposed system integrates a Household Digital Twin (HDT), built on GridLAB-D and exposed through a REST-based microservices architecture, with a two-tier large language model (LLM) agentic layer that translates user requests into structured, schema-compliant simulation payloads. To improve reliability, the architecture combines intent routing, a domain-specific knowledge base, deterministic post-processing of simulation outputs, and tool-governed execution policies. The system is evaluated on a curated dataset of 45 prompts with increasing complexity, covering multiple households, seasons, and override scenarios. Results show 100% schema conformance, 96.1% field-level F1, 90.4% value accuracy, and a 95.6% end-to-end simulation success rate. The findings indicate that conversational agentic interfaces can substantially lower the usability barrier of physics-based household digital twins while preserving the reliability required for residential energy decision support.


[40] 2606.31756

Stability and Droop Characteristics Analysis of Observer-Synchronized Grid-Forming Control

This paper analyzes the stability and droop characteristics of Observer-Synchronized grid-forming control. First, a second-order nonlinear autonomous model is derived under the quasi-steady-state assumption. Based on the derived model, the equilibrium points and nonlinear stability properties are investigated using the qualitative theory of differential equations. Explicit parameter conditions are obtained to guarantee almost global asymptotic stability of the desired equilibrium. Furthermore, an analytical expression of the nonlinear droop characteristic is derived to reveal the relationship between active power and frequency. The theoretical analysis is validated through electromagnetic transient simulations and experiments.


[41] 2606.31911

Trade-Offs in Decentralized Gigantic MIMO with Hard-Boundary Constraints

To maintain the antenna apertures offered by 5G massive MIMO systems operating at the sub-6GHz band, known as FR1, 6G base stations (BSs) using the upper-mid band, FR3, should increase the number of antennas by a factor 4-8, giving rise to gigantic MIMO. This poses challenges in terms of processing complexity and interconnection bandwidth. The WAX framework, previously introduced for exploring trade-offs in decentralized architectures, may offer the flexibility needed to tackle these challenges. However, no results have been established on the applicability of this framework in the presence of hard-boundary constraints. The current work explores gigantic MIMO implementations based on a novel adaptation of the WAX framework, where the decentralized processing is performed by non-cooperating hardware modules. These modules may be implemented through state-of-the-art massive MIMO baseband units (BBUs). The results show the potential of the proposed framework towards exploiting trade-offs between complexity and performance in practical gigantic MIMO implementations.


[42] 2606.31962

Toward Efficient Sensing in Multi-Device ISCC by Removing Frequency Domain Redundancy

Integrated sensing, communication, and computation (ISCC) is envisioned as a key enabler for intelligent services in future wireless networks. However, in multi-device ISCC systems, directly offloading full orthogonal frequency division multiplexing (OFDM) sensing data to the edge may incur excessive overhead, thereby limiting sensing performance under practical resource constraints. In this paper, we propose a subcarrier selection-based sensing framework for multi-device ISCC systems, where frequency-domain redundancy in OFDM sensing data is removed during local preprocessing to reduce sensing data transmission and processing overhead. Based on the proposed framework, we establish analytical models for sensing accuracy, delay, and energy consumption, and formulate a sensing accuracy maximization problem under practical resource constraints. To solve this problem, we develop an alternating direction method of multipliers (ADMM)-based algorithm. Experiments on commodity wireless devices validate the effectiveness of the proposed framework and show that it consistently outperforms three baseline schemes under various resource constraints.


[43] 2606.32003

On the Comparison of Reinforcement Learning and Adaptive Control for Linear Systems under Packet Loss and Uncertainty

This paper presents a comparative study between Adaptive Quantized Control (AQC) and Deep Deterministic Policy Gradient (DDPG) reinforcement learning for uncertain linear systems with input quantization over communication channels subject to packet loss. The considered setting also includes dynamic switching from a nominal unstable system to a more unstable one during operation. The AQC is designed for unknown system dynamics using acknowledgment messages to compensate for packet losses, whereas the DDPG controller is trained using the nominal system model without acknowledgment messages. Numerical results show that the DDPG controller achieves faster transient responses and improved damping within its training environment. However, under model uncertainty, packet loss, and dynamic switching, the AQC consistently demonstrates superior robustness owing to its rigorous Lyapunov stability guarantees. These results highlight the trade-off between data-driven performance and model-based robustness, and provide insight into the applicability of reinforcement learning and adaptive control for networked uncertain systems.


[44] 2606.30646

ASR-Agnostic Multimodal Spectrotemporal Modeling for Early Dementia Detection

Speech recruits the same executive, attentional, and working memory processes underlying instrumental activities of daily living, or IADLs, providing a non-invasive proxy for cognitive assessment. Yet most speech-based dementia detection systems depend on transcription, discard within-recording temporal structure, and are validated on a single English corpus with known recording artifacts. We propose an ASR-agnostic framework operating directly on Mel spectrograms. Our key contribution is extracting spectrotemporal displacement fields from consecutive spectrogram frames, capturing shifting spectral energy patterns as digital biomarkers of cognitive decline. These features are fused with CNN-ConvGRU acoustic embeddings via a learned cross-attention mechanism and aggregated using a Transformer encoder with learnable query pooling. A composite temporal loss enforces smoothness and contrastive coherence across segments. We train independent models on English DementiaBank, Slovak EWA-DB, and Spanish Ivanova corpora, using clinical elicitation protocols taxing IADL-relevant cognitive domains. The Slovak model achieves 83.9% accuracy, and Spanish achieves, while the English baseline yields 53.2%, confirming known artifacts. Cross-lingual ablation studies reveal distinct fusion regimes: removing cross-attention collapses Spanish performance to 53.7%, below unimodal models, while the Slovak audio encoder alone outperforms the full model, 93.7% vs. 83.9%, and all English configurations remain near chance. Thus, multimodal fusion's value is corpus-dependent: essential when signal is distributed across modalities, counterproductive when one dominates, and irrelevant when no signal exists. Auxiliary temporal losses converge to language-invariant values, indicating cross-lingual architectural stability.


[45] 2606.30671

Enhancing BEST-RQ Pseudo-Label Quality through Online Refinement for Automatic Speech Recognition

BEST-RQ is a simple and effective self-supervised training method for speech representation learning that performs well on automatic speech recognition (ASR) tasks. It generates pseudolabels using a fixed online quantization scheme, which simplifies training but provides weaker supervision than HuBERT-style models that iteratively refine pseudo-labels. In this work, we improve online pseudo-label generation while preserving simplicity. We propose three modifications: replacing the quantizer's linear projection with Principal Component Analysis (PCA), updating the codebook via iterative codebook refinement, and introducing an additional codebook updated via codebook distillation. We pre-train on the LibriSpeech 960-hour dataset and fine-tune using 100 hours of supervised LibriSpeech data. With all three modifications enabled, we achieve a 12% relative reduction in word error rate (WER) on the LibriSpeech test-other set, improving from 10.1% to 8.8%.


[46] 2606.30682

ALM2Vec: Learning Audio Embeddings for Universal Audio Retrieval with Large Audio-Language Models

Recent advances in language--audio retrieval have been largely driven by contrastive dual-encoder architectures that align audio and text in a shared embedding space. While effective, existing retrieval embeddings are primarily optimized for audio--caption matching, limiting their ability to support diverse retrieval objectives and controllable retrieval behaviors. We present ALM2Vec, a universal audio embedding framework derived from pretrained large audio--language models (LALMs). By transferring the audio understanding, instruction-following, and reasoning capabilities acquired through large-scale multimodal training, ALM2Vec learns a unified embedding space for retrieval across audio domains and task types. Beyond conventional text--audio retrieval, ALM2Vec incorporates natural-language instructions into the embedding process, enabling instruction-aware retrieval for scenarios such as audio question answering and aspect-conditioned retrieval. Experimental results show that ALM2Vec achieves competitive performance on standard audio and speech retrieval benchmarks while exhibiting promising compositional and controllable retrieval capabilities, highlighting its potential as a unified audio embedding model for retrieval across domains, tasks, and user intents.


[47] 2606.30700

BEST-RQ-2: Contextualize-Then-Predict, a Two-Step Approach for Self-Supervised Audio Representations

Self-supervised learning enables audio representations that transfer across domains and tasks. We present BEST-RQ-2, an evolution of BEST-RQ that retains frozen randomprojection-based discrete targets while introducing a two-step contextualize-then-predict pretraining scheme. A ViT context encoder processes only the unmasked spectrogram regions, and a lightweight predictor infers targets for the masked regions; the predictor is discarded after pretraining. Replacing the original Conformer encoder with a ViT shifts performance across domains, slightly reducing speech performance while improving music and environmental sounds, with comparable average scores. The main improvement comes from decomposing masked prediction into separate contextualization and prediction stages. On the X-ARES and XARES-LLM benchmarks, BEST-RQ-2 consistently outperforms one-stage baselines in overall transfer while keeping inference compute unchanged. Code and model checkpoints are publicly available.


[48] 2606.30791

Probing-Guided Layer Selection from Self-Supervised Speech Models for Generalizable Audio Deepfake Detection

Audio deepfake detection systems often fail to generalize across domains because they rely on features tied to specific attacks or recording conditions. Self-supervised speech models offer rich multi-layer representations, yet existing approaches either use a single layer or fuse all layers indiscriminately, and only reveal layer importance after training. We propose a model-agnostic, two-stage methodology that identifies informative depth zones before any task-specific model is trained. In the first stage, lightweight XGBoost probes evaluate each transformer layer's cross-domain discriminative power, producing a layer ranking. In the second stage, a compact neural classifier fuses only the selected layers through per-layer attention pooling and a shared bottleneck projection, while the backbone remains frozen. Applied across three backbones, the probing reveals two key findings. First, informative layers cluster in depth zones rather than at uniquely optimal positions: within-zone substitutions fall within multi-seed noise, while zone violations degrade performance by up to 5x. Second, the probing produces backbone-specific selections rather than a fixed layer recipe. On XLS-R-300M, four probing-selected layers with 1.34M trainable parameters achieve 4.94 +/- 0.32% equal error rate on In-The-Wild and 5.07% cross-domain average over four shared datasets, a 28% relative improvement over the best prior frozen-backbone result (Xiao and Vu, 2025) using all 25 layers with identical training data.


[49] 2606.30811

AVTok: 1D Unified Tokenization for Holistic Audio-Video Generation

Audio-video generation has recently gained unprecedented research attention, aiming to synthesize high-quality sounding video content with fine-grained synchronization and semantic alignment between the auditory and visual components. The preceding methods predominantly adopt a dual-branch design with separate tokenization and generation modules per modality, neglecting the representation gap while necessitating intensive computational resources for proper training. Inspired by recent advancements in one-dimensional visual tokenization, we present \textbf{AVTok}, a novel unified tokenizer designated for holistic audio-video generation. AVTok features a dual-stream transformer-based architecture with shared encoder-decoder and modal-specific learnable queries to efficiently and effectively encode an audio-video pair into a compact one-dimensional latent representation with a unified codebook. To cope with the heterogeneous information imbalance that hinders AVTok from exploiting aligned audio-visual information, we devise a hierarchical training strategy to progressively realize reconstruction capabilities for each modality. Extensive experiments demonstrate that AVTok excels both in audio-video reconstruction and when integrated into downstream pipelines for audio-to-video, video-to-audio, and class-conditional joint audio-video generation. AVTok paves the way for the challenge of joint audio-video tokenization and provides a potential direction to build unified large multimodal models for audio-video generation.


[50] 2606.30829

Joint Chance Constrained Safe-Optimal Control

We consider the finite-time optimal control of stochastic systems subject to a probabilistic constraint on the trajectories' safety. Such formulations are known as joint chance constrained optimal control problems. The common practice is to jointly minimise the expected cost of all trajectories, safe and unsafe. This leads to policies which invite constraint violations to exploit low-cost unsafe trajectories. When constraints represent states of critical failure, such behaviour is undesirable. We demonstrate that this behaviour can be overcome by only minimizing the expected cost of safe trajectories. The underlying rationale follows a practical intuition: In many applications, the cost incurred by unsafe trajectories is irrelevant (e.g., the battery usage of a crashed quadcopter), and one is usually interested in minimizing the cost of trajectories that are safe. We show that this problem can be cast as a constrained Markov Decision Process over an augmented state space. This allows solving it via dynamic programming. We derive bounds on the policies' safety under errors resulting from gridding approximations when the system's state space is continuous. Finally, we empirically compare dynamic programming as well as reinforcement learning solutions on a simulated 2D unicycle system in cluttered reach-avoid environments.


[51] 2606.30849

SyncCache: Exploiting Asymmetric Dynamics for Fast Audio-Driven Portrait Animation

Diffusion Transformers (DiTs) have significantly advanced audio-driven portrait animation, but their high computational cost leads to substantial inference latency. Although training-free diffusion caching accelerates inference significant, existing methods are primarily developed for text-conditioned generation and overlook the spatial and modality imbalances inherent in audio-driven portrait animation. In this paper, we propose SyncCache, a training-free caching acceleration method tailored for DiT-based portrait animation that explicitly exploits asymmetric dynamics. Specifically, high-frequency dynamics driven by audio conditions and concentrated in human regions are more challenging and critical to cache and reuse than the low-frequency visual background in portrait animation. First, we introduce Spatially-Asymmetric Probing to prioritize error sensitivity in dynamic human region. Second, through Modality-Decoupled Caching, we bypass heavy DiT block by reusing stable inter-block residuals, while continuously recomputing lightweight audio blocks to preserve precise lip synchronization. Furthermore, we introduce a cache ratio to control cache capacity and formulate memory-adaptive cache selection as an offline dynamic programming problem without online overhead. Extensive experiments demonstrate that SyncCache achieves superior speed-quality trade-offs, delivering up to 4.12x acceleration on HunyuanVideo-Avatar and 3.75x on Wan-S2V with near-lossless visual fidelity and precise audio alignment.


[52] 2606.31055

Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with model-predicted speaker traits and interaction state, pooled human statistics can be poorly calibrated for evaluating a particular output. Using 4000+ hours of dyadic English conversation from the Seamless Interaction dataset, we construct matched reference regimes for $F_0$ mean, $F_0$ expressivity, speech rate, articulation rate, pause ratio, and mean pause duration. We then define a percentile-based evaluation protocol: extract the same metrics from an S2S output waveform, compare them to the closest matched human reference stratum, and report percentile deviations or 5th-95th percentile out-of-regime flags. On held-out human rows, pooled references over-flag state-conditioned $F_0$ expressivity and rhythm, while matched references return flag rates closer to the nominal 10% and make deviation direction interpretable. These outputs serve as behavioral plausibility checks that complement, rather than replace, perceptual and user-centered evaluation.


[53] 2606.31105

Attacking UTMOS: Probing the Robustness of a Speech Quality Assessment Model

UTMOS has become one of the most commonly used deep neural network-based speech quality assessment (SQA) metrics in speech processing research. In this paper, we attack UTMOS to probe its robustness. Starting from high-quality speech samples, we optimize the input in two directions: a score-preserving attack, which degrades perceived quality while maintaining the predicted score, and a quality-preserving attack, which lowers the predicted score while maintaining perceived quality. We consider three input spaces: raw waveform, mel spectrogram with a HiFi-GAN vocoder, and the latent space of EnCodec, a neural audio codec. Experimental results show that score-preserving attacks are effective against UTMOS. Although perfect quality-preserving attacks are more difficult, optimization in the EnCodec latent space provides the best chance of success. These results reveal failure modes of UTMOS and highlight the importance of robustness analysis for DNN-based SQA metrics.


[54] 2606.31128

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

Speech editing aims to modify specific portions of an utterance while preserving the remaining speech. Existing approaches primarily focus on word-level content modification and typically treat content, speaker, and emotion editing as separate tasks, limiting both editing granularity and flexibility. We propose UniSAE, a unified speech attribute editing framework which supports composable speaker, emotion and content editing from sub-phoneme to word level within a single architecture. UniSAE introduces a Discrete Phonetic PosteriorGram (DPPG) representation that factorizes speech content into discrete tokens encoding phoneme identity, pronunciation variants, and duration, enabling direct phoneme- and sub-phoneme-level editing. For higher-level modifications, an autoregressive content transformer predicts edited DPPG sequences for word-level content editing. The edited sequences are rendered into speech by a diffusion-based acoustic decoder, conditioned on disentangled speaker and emotion representations. Experimental results demonstrate that the proposed unified framework supports precise speaker and emotion control, content editing at multiple granularities, and joint modification of all three attributes within a single framework.


[55] 2606.31137

A Bayesian Filtering Approach for Learning Lagrangian Dynamics from Noisy Measurements

This paper proposes a Bayesian filtering-based approach for learning the dynamics of a physical system from partial, noisy measurements. We model the system dynamics using a Lagrangian mechanics formulation. As in Lagrangian neural networks (LNNs), we parameterize the kinetic and potential energies with neural networks. The unknown external forces in the Lagrangian formulation are modeled as white Gaussian noise. The corresponding Euler--Lagrange equations then yield a continuous-time stochastic state-space model (SSM) that describes the system dynamics. The neural network parameters and system states are then jointly learned via a maximum-likelihood method using Gaussian-approximation-based Bayesian filters. The effectiveness of the proposed method is demonstrated on pendulum and Duffing oscillator examples, and its performance is compared with conventional LNNs and with approximate Bayesian filters using known system models.


[56] 2606.31199

Machine Learning-based Feedback Linearization Control of Quadrotor Subject to Unmodeled Dynamics

The control of agile quadrotors in dynamic and uncertain environments remains an open area of investigation to this day, particularly when the complete system dynamics are partially known or highly nonlinear. This work introduces a novel machine learning-based feedback-linearization control framework that employs a Gaussian Radial Basis Function (RBF) neural network (NN) to model and compensate for unmodeled dynamics in real time. The proposed controller leverages the universal approximation capability of RBF networks to model nonlinearities and uncertainties. An online adaptation of the RBF NN updates the network's weights without prior training. The control law is derived using the Lyapunov stability theory, herein guaranteeing closed-loop stability and providing theoretical guarantee of asymptotic convergence of a trajectory tracking task. Gazebo simulation and real flight experiments are conducted using the Bitcraze's Crazyflie 2.1 quadrotor subject to unmodeled air drag, actuator dynamics, and external disturbance. Despite incomplete knowledge of prior dynamics and presence of external disturbance such as air drag and drift in state estimation, the proposed controller improves trajectory tracking with rapid convergence and reduction of position-norm and yaw orientation RMSE by more than $7.13\%$ and $49.27\%$ respectively compared to baseline feedback linearization controller.


[57] 2606.31247

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

Spoken language models (SLMs) extend LLMs to speech input and output. Existing SLMs represent speech at fixed frame rates (e.g., 25 or 12.5 Hz), ignoring the time-varying information density of speech and offering no flexibility to trade off quality for speed at inference time. Recent audio tokenizer research has proposed dynamic frame rate speech coding, which exploits this non-uniformity and enables two new capabilities: very low average frame rates and frame rate controllability. However, this technique has not yet been applied to SLMs. We introduce Flexible Spoken Language Model (FlexiSLM), the first SLM that supports dynamic and controllable frame rates on both speech input and output. Using dynamic frame rate representations, FlexiSLM outperforms fixed-frame-rate 7B models including Qwen2.5-Omni and Kimi-Audio at its high-quality operating points. We further verify that FlexiSLM can be accurately steered down to 4.0 Hz; at 6.25 Hz, it roughly halves inference time relative to 12.5 Hz while retaining strong speech-to-speech quality. Audio samples are available at this https URL .


[58] 2606.31259

SwiftAudio: Data-Efficient Caption-Only Distillation for One-Step Text-to-Audio Diffusion-based Generation

Diffusion-based text-to-audio (TTA) models achieve impressive synthesis quality but suffer from high inference latency due to iterative multi-step denoising. Existing one-step approaches alleviate this issue but still rely on paired text--audio data during distillation. To address these limitations, we propose SwiftAudio, a one-step TTA framework that performs audio-free distillation from a pretrained diffusion teacher using only text captions. Specifically, we adapt Variational Score Distillation (VSD) to the audio domain and introduce a temporal smoothness regularization objective to encourage coherent latent audio representations. This design enables the student model to inherit the teacher's generative prior without requiring paired audio supervision and allows effective training with only approximately 45K captions. Experiments on AudioCaps and Clotho demonstrate that SwiftAudio achieves state-of-the-art performance among strict one-step methods and substantially narrows the gap to multi-step diffusion systems. Project page: this https URL


[59] 2606.31301

Fundamental Limits of Quantized MIMO ISAC under Gaussian Signaling

We study a quantized multiple-input multiple-output (MIMO) integrated sensing and communication (ISAC) system in which the communication and sensing receivers each apply analog spatial combining followed by scalar subtractive dithered quantization. This quantization model leads to an additive effective-noise representation with non-Gaussian noise. We derive upper and lower bounds on the capacity of this channel. Numerical results show that these bounds are tight at low signal-to-noise ratios (SNR) and saturate at high SNR due to finite-resolution quantization. They also show that, despite the effective noise being non-Gaussian, independent and identically distributed (i.i.d.) isotropic Gaussian signaling achieves rates close to capacity. Focusing on i.i.d. Gaussian signaling, this paper also presents a closed-form expression for the linear minimum mean-squared error (LMMSE) achieved under a Kronecker sensing-channel model. Numerical results show that the LMMSE also saturates at high SNR, where the saturation level increases as the spatial combining ratio decreases, and for combining ratios below one, saturation occurs even without quantization.


[60] 2606.31321

Projection Operator Stochastic Equations for Non-Markovian Quantum Systems Under Continuous Measurement-Based Feedback

Quantum Markov models have been successfully used to accurately model various physical quantum systems in fields such as quantum optics, optomechanics and superconducting circuits and they provide the basis for (measurement-based) quantum feedback control. However, the quantum Markov assumption is a strong one and it is not expected to hold for general quantum systems of interest. The projection operator approach is one approach that has been developed to model non-Markovian quantum systems by considering its embedding in a larger Markovian quantum system, but mainly in the context of quantum master equations for the dynamics of the unmonitored reduced quantum state of a quantum system. This approach was recently adapted for continuously measured non-Markovian quantum systems, which enables open-loop control but did not yet consider the presence of feedback of the stochastic measurement record, deriving non-Markovian SDEs for the evolution of the projected state of the Markovian embedding. This paper generalizes these stochastic equations to the setting of stochastic feedback based on the continuous-measurement record and shows that the equations take the same form but that previously deterministic terms become stochastic ones which depend on the measurement record, as would be intuitively expected. The stochastic equations are obtained for a generalized class of measurements that includes continuous (possibly adaptive) homodyne and photon counting measurements.


[61] 2606.31337

Fundamentals of Optical Fiber Sensing Schemes Based on Coherent Optical Time Domain Reflectometry: Signal Under Dynamic Temperature Conditions

We present a theoretical, algorithmic, and experimental study of temperature sensing using $\phi$-OTDR with coherent detection. A physics-based model is developed to relate the measured Rayleigh backscattered signal to temperature variations along the fiber, showing that the phase evolution encodes the cumulative temperature change between the interrogator and the sensing location, while the amplitude exhibits only local sensitivity. Based on this insight, we propose robust algorithms for temperature-event detection and temperature-profile reconstruction. Experimental results demonstrate reliable recovery of temperature-induced perturbations in standard single-mode fibers using coherently detected $\phi$-OTDR.


[62] 2606.31338

Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models

Recent music audio-language models achieve high accuracy on instrument question-answering benchmarks, but it remains unclear whether this reflects robust audio grounding or benchmark-specific shortcuts. In this paper, we introduce an OpenMIC-derived diagnostic benchmark sequence for instrument grounding in music audio-language models, extending binary instrument-presence QA to genre-prior-reduced examples, confusable instrument discrimination, longer audio context, and temporal localization. Across these settings, high binary QA accuracy often fails to predict model behavior: models can exhibit option-position bias, confusable-instrument errors, and temporal response bias. These results suggest that instrument grounding should be evaluated with multi-axis diagnostic benchmarks rather than a single aggregate accuracy.


[63] 2606.31352

Dualformer: Efficient Feature Extractor for Complex-valued Blind Communication Signal Analysis

Designing effective feature extractors is critical for blind signal analysis tasks such as automatic modulation recognition (AMR), signal scheme recognition (SSR), and \color{black} signal structure parsing (SSP). In this work, we propose dual-channel neural network (DualNN) that efficiently exploits complex-valued signals through parameter sharing across IQ channels. Unlike traditional real-valued or complex-valued models, DualNN is a groundbreaking framework which shares the network parameters for processing the real and imaginary parts of the complex-valued signals, and is theoretically shown to reduce generalization error while preserving expressive capacity. Specifically, we propose a novel Transformer-based architecture to implement DualNN, called Dualformer. The Dualformer segments input signals into patch-level tokens and captures multi-granularity features, enabling robust performance across diverse signal analysis tasks. Furthermore, we conduct extensive experiments comparing Dualformer with three Transformer-based baselines and four conventional DL-based approaches. Results demonstrate consistent performance improvements on AMR, SSR, and SSP tasks. Besides, the modular design of DualNN allows it to generalize well to blind signal processing tasks such as blind source separation and low-SNR spectrum sensing. This work paves the way for a broader application of DualNN architectures in unsupervised and weakly supervised complex-valued signal analysis scenarios.


[64] 2606.31415

Ensuring Deterministic Timing in a Federated GNSS Correction Pipeline with Lingua Franca

Embedded systems that combine hardware interrupts, buffering, and distributed communication are often perceived as inherently asynchronous and difficult to analyze. However, such systems can exhibit a deterministic timing structure when modeled using explicit logical-time semantics. This paper presents a Global Navigation Satellite System (GNSS) correction-data pipeline implemented as a federated Lingua Franca (LF) application. The federated LF program decomposes the end-to-end pipeline into reactors with explicit time semantics, including a time-triggered GNSS receiver, a UART interrupt stream derived from baud rate and First-In First-Out (FIFO) buffer characteristics, a periodic forwarding task, and downstream processing with jitter monitoring. Federated execution and runtime logs validate the analytically derived deterministic timing structure-including interrupt cadence, ring-buffer evolution, packetization behavior, and physical--logical jitter-yielding a reproducible and predictable timing profile.


[65] 2606.31595

Dilemmadata: On the Interoperability of Heterogeneous Roman Numeral Datasets

In recent years, there has been growing effort to annotate and collect large-scale corpora of Roman numeral analyses in support of data-driven studies in tonal harmony. We introduce dilemmadata, the first resource to reconcile two major collections, the AugmentedNet Dataset (AN) and the Distant Listening Corpus (DLC), making them interoperable through a shared note-wise TSV schema. The reconciliation confronts four families of dilemmata: annotation-standard (the two encode the same musical fact differently in terms of vocabulary size, syntax, conventions for chord extensions, inventory of special chord functions), representational (what counts as a row, and which information survives the conversion), toolchain (incompatible Python ecosystems built around music21 vs. ms3+dimcat), and curatorial (which pieces to include, exclude, or retain twice). We resolve each by deliberately transforming, augmenting, and omitting information, formalising the mismatches, preserving musical semantics, and flagging transformations that may subtly affect annotation fidelity. Consistency checks and qualitative inspections offer a preliminary assessment of post-conversion validity and a basis for critiquing the theoretical assumptions embedded in each original standard. After removing duplicates and merging the two collections, the resulting dilemmadata (1,621 pieces and aprox. 2.8 M note-wise annotations) is the largest homogeneous Roman-numeral corpus currently available, albeit far from perfect. Crucially, we retain 84 pieces common to both corpora under each of their original analyses, yielding a shared reference set in which two equally legitimate analytical traditions can be compared note-for-note over identical musical material. Released on Zenodo, dilemmadata supports interoperability, comparative harmonization modeling, and future refinement of Roman-numeral encoding standards.


[66] 2606.31716

Gaussian Belief Propagation for Tracking With Unresolved Measurements

Unresolved measurements occur in many inference problems where two or more hidden processes may, at times, jointly generate a single measurement. For instance, such phenomena are encountered in multiobject tracking owing to the limited resolution capabilities of practical sensors; or in camera-aided autonomous driving due to shadowing or occlusions. Substantial performance degradation, such as track losses, are incurred when unresolved measurements are not accounted for. In this paper, we address multiobject tracking under a generalized unresolved measurement model, where any subset of objects may generate a single unresolved measurement according to a probabilistic model. Our innovation lies both in modeling and algorithm-design directions. First, we develop a probability distribution for object partitions based on a model of pairwise coupling of objects and subsequently a probability distribution for object-to-measurement association variables. This generic model incorporates sensor resolution capabilities, sensor detection, and sensor noise characteristics for object groups. Second, a generic Loopy Belief Propagation (LBP) method as well as a specialized Gaussian-LBP (GLBP) algorithm are proposed that perform object state inference under the aforementioned model. In contrast to direct marginalization methods, which involve a computational complexity of $O(m^n)$, for $m$ measurements and $n$ objects, the proposed GLBP algorithm achieves a computational complexity on the order of $O(m n 2^{n})$. Numerical results demonstrate the effectiveness of our proposed GLBP, with estimation performance that closely matches that of exact marginalization for only a fraction of the computational resources.


[67] 2606.31973

Semantic Leakage and Privacy Preservation in Relay-Assisted Semantic Communications

Semantic communication (SemCom) has emerged as a promising paradigm in which the transmission of task-relevant information is prioritized over raw data, enabling efficient and robust communication under resource and channel constraints. In this paper, the privacy implications of relay-assisted SemCom systems are studied, where the intermediate relay node operates directly on learned latent representations. It is shown that the relay, even without access to source data, can reliably infer semantic meaning and reconstruct signals with performance comparable to that of the legitimate receiver, revealing a fundamental privacy vulnerability of semantic representations. To address this issue, an iterative adversarial training framework is proposed in which a strong, adaptively trained eavesdropper at the relay is explicitly accounted for. The proposed approach alternates between optimizing the relay's eavesdropping function and the legitimate system, resulting in representations that preserve semantic decoding performance at the intended receiver while degrading semantic inference at the relay. The semantic accuracy gap between the legitimate receiver and the eavesdropper is significantly enlarged across channel conditions. Importantly, this protection is achieved in a stealthy manner, with high reconstruction fidelity maintained while semantic leakage is selectively suppressed.


[68] 2606.32010

Dual-Regime Absorbing Markov Chain Theory in Remote Estimation: Age-Minimizing Push Policies

For a remote estimation system, we study the optimization of age of incorrect information (AoII), which is a recently proposed semantic-aware information freshness metric. In particular, we assume an information source that observes a discrete-time finite-state Markov chain (DTMC), and occasionally transmits status update packets to a remote monitor which is tasked with remote estimation of the source. For the forward channel from the source to the monitor, we assume the channel delay to be modeled by a general discrete-time phase-type (DPH) distribution, whereas the reverse channel from the monitor to the source is assumed to be perfect, ensuring that the source has perfect information on the AoII and the remote estimate at the monitor, at all times. Push-based transmissions are initiated when AoII exceeds a threshold depending on the current estimation value, i.e., multi-threshold policy. In this very general setting, our goal is to minimize a weighted sum of the time average of a polynomial function of AoII, depending on the remote estimate, and energy consumption from transmissions. We formulate the problem as a semi-Markov decision process (SMDP) with the same state-space of the original DTMC to obtain the optimal multi-threshold policy, whereas the parameters of the SMDP are obtained by using a novel stochastic tool called dual-regime absorbing Markov chain (DR-AMC), and its corresponding absorption time distribution named as dual-regime DPH (DR-DPH). The proposed method is validated with numerical examples using comparisons against other policies obtained by exhaustive search, and also various benchmark policies.


[69] 2410.11894

Automated Discovery of Operable Dynamics from Videos

Dynamical systems form the foundation of scientific discovery, traditionally modeled with predefined state variables such as the angle and angular velocity, and differential equations such as the equation of motion for a single pendulum. We introduce a framework that automatically discovers a low-dimensional and operable representation of system dynamics, including a set of compact state variables that preserve the smoothness of the system dynamics and a differentiable vector field, directly from video without requiring prior domain-specific knowledge. The prominence and effectiveness of the proposed approach are demonstrated through both quantitative and qualitative analyses of a range of dynamical systems, including the identification of stable equilibria, the prediction of natural frequencies, and the detection of chaotic and limit cycle behaviors. The results highlight the potential of our data-driven approach to advance automated scientific discovery.


[70] 2506.12997

MORIC: CSI Delay-Doppler Decomposition for Robust Wi-Fi-based Human Activity Recognition

The newly established IEEE 802.11bf Task Group aims to amend the WLAN standard to support advanced sensing applications such as human activity recognition (HAR). Although studies have demonstrated the potential of sub-7 GHz Wi-Fi Channel State Information (CSI) for HAR, existing methods often degrade substantially under realistic variations across users, environments, and sensing configurations. This work addresses the poor generalization of Wi-Fi-based HAR by extracting motion-centered representations that reduce dependence on static, environment-specific, and non-activity-related CSI magnitude and phase patterns. CSI signals are transformed into the delay-profile space and decomposed into multiple Doppler velocity projections, which are modeled as observations of a moving point's velocity from different unknown directions, analogous to virtual cameras observing the same motion with varying degrees of clarity. This yields a richer activity representation than either a single aggregated Doppler estimate or the spurious, environment-dependent CSI patterns used in prior works. Since these projections are unordered and may recur due to random multipath propagation, we introduce MORIC, a novel order- and repetition-invariant time-series classification model for robust Wi-Fi-based HAR. Experimental results on the collected dataset show that the proposed method outperforms state-of-the-art approaches in cross-user hand motion recognition, especially for challenging gestures. Incorporating only a few calibration samples further improves accuracy, demonstrating MORIC's adaptability and highlighting the potential of the proposed methodology for practical Wi-Fi sensing in real-world scenarios.


[71] 2506.23102

Region-Aware Multimodal Large Language Model via SlowFast Tokenization and Pseudo-Mask Guidance for 3D CT Report Generation

Current CT report generation frameworks predominantly rely on global feature representations, often failing to capture region-specific details and potentially missing certain abnormalities. To overcome this limitation, we propose MedRegion-CT, a region-focused multimodal large language model framework featuring three key innovations. First, we revisit the SlowFast strategy to jointly model global and fine-grained information and adapt it to the medical domain via a Region-based SlowFast Tokenizer that extracts tokens guided by clinically meaningful regions. Second, generated pseudo-masks guide the model to attend to diagnostically important anatomical regions, facilitating a systematic understanding of the overall scan context. Third, quantitative lesion information, including size, diameter, and spatial location, is encoded as structured textual prompts, enabling context-aware and clinically informed report generation. To enable rigorous evaluation, we validate our framework on multi-institutional structured report generation benchmarks. Experimental results demonstrate that MedRegion-CT achieves state-of-the-art performance, outperforming existing approaches in both linguistic quality and clinical accuracy. All code is publicly available at: this https URL.


[72] 2509.12698

Low-Altitude UAV Tracking via Sensing-Assisted Predictive Beamforming

Sensing-assisted predictive beamforming shows significant promise for enhancing various future unmanned aerial vehicle (UAV) applications in integrated sensing and communication (ISAC) systems. However, the impact of such beamforming technique on the communication reliability was largely unexplored and challenging to characterize. To fill this research gap and tackle this issue, this paper proposes a cellular-connected UAV tracking scheme leveraging extended Kalman filtering (EKF), where the predicted UAV trajectory, sensing duration ratio, and target constant received signal-to-noise ratio (SNR) are jointly optimized to maximize the outage capacity at each time slot. To address the implicit nature of the objective function, analytical outage probability (OP) approximations are proposed based on second-order Taylor expansions, providing an efficient and full characterization of outage capacity. Subsequently, an efficient algorithm is proposed based on a combination of bisection search and successive convex approximation (SCA) to address the non-convex optimization problem with guaranteed convergence. To further reduce computational complexity, a second efficient algorithm is developed based on alternating optimization (AO). Simulation results validate the accuracy of the derived OP approximations, the effectiveness of the proposed algorithms, and the significant outage capacity enhancement over various benchmarks. Furthermore, we show that the optimized predicted UAV trajectory tends to be parallel to the base station's uniform linear array antennas with a nonzero minimum distance, indicating a trade-off between decreasing path loss and enjoying wide beam coverage for outage capacity maximization.


[73] 2511.16757

Revisiting Audio-language Pretraining for Learning General-purpose Audio Representation

Audio-language pretraining (ALP) holds promise for learning general-purpose audio representation, yet remains underexplored. Crucially, there is no consensus on whether audio-language models can build effective general-purpose audio encoders, nor a systematic understanding of how pretraining objectives behave across diverse tasks and scales. We identify three key barriers: limited scale of audio-text corpora, limited coverage of audio attributes in existing caption corpora, and lack of systematic exploration and evaluation. To fill this gap, we present the first principled empirical study of ALP. We first introduce CaptionStew, a 10.7M caption dataset aggregating open-source audio-text corpora across multiple domains and captioning focuses. We then conduct the first comprehensive evaluation comparing contrastive and captioning objectives for learning audio representation across speech, music, and environmental sound tasks. Our results not only demonstrate that ALP yields competitive, transferable representations, but reveal critical trade-offs: contrastive learning offers superior data efficiency, while captioning exhibits better scalability. Furthermore, we find that the benefits of supervised initialization often diminish at larger scales, challenging common practices. By grounding these claims in empirical evidence, we establish a viable pathway toward general-purpose audio representation learning, guiding future research.


[74] 2512.05692

IMMPC: An Internal Model Based MPC for Rejecting Unknown Disturbances

Model predictive control (MPC) is a powerful control method that allows for the direct inclusion of state and input constraints into the controller design. However, errors in the model, e.g., caused by unknown disturbances, can lead to constraint violation, loss of feasibility, and deteriorate closed-loop performance. In this paper, we propose a new MPC scheme based on the internal model principle. This enables the MPC to reject unknown disturbances if the dynamics of the linear signal generator are known. We formulate the disturbance rejection problem as a stability problem to ensure feasibility, constraint satisfaction, and convergence to the optimal reachable output trajectory. The controller is validated on a fourtank system.


[75] 2601.08480

Quantitative Analysis of Proxy Tasks for Anomalous Sound Detection

Anomalous sound detection (ASD) typically involves self-supervised proxy tasks to learn feature representations from normal sound data, owing to the scarcity of anomalous samples. In ASD research, proxy tasks such as AutoEncoders operate under the explicit assumption that models trained on normal data will increase the reconstruction errors related to anomalies. A natural extension suggests that improved proxy task performance should improve ASD capability; however, this relationship has received little systematic attention. This study addresses this research gap by quantitatively analyzing the relationship between proxy task metrics and ASD performance across five configurations, namely, AutoEncoders, classification, source separation, contrastive learning, and pre-trained models. We evaluate the learned representations using linear probe (linear separability) and Mahalanobis distance (distributional compactness). Our experiments reveal that strong proxy performance does not necessarily improve anomalous sound detection performance. Specifically, classification tasks experience performance saturation owing to insufficient task difficulty, whereas contrastive learning fails to learn meaningful features owing to limited data diversity. Notably, source separation is the only task demonstrating a strong positive correlation, such that improved separation consistently improves anomaly detection. Based on these findings, we highlight the critical importance of task difficulty and objective alignment. Finally, we propose a three-stage alignment verification protocol to guide the design of highly effective proxy tasks for ASD systems.


[76] 2601.08758

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. Such opaque reasoning processes lack reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at this https URL.


[77] 2601.12782

Sensing-Limited Control of Noiseless Linear Systems Under Nonlinear Observations

This paper investigates the fundamental information-theoretic limits for the control and sensing of noiseless linear dynamical systems subject to a broad class of nonlinear observations. We analyze the interactions between the control and sensing components by characterizing the minimum information flow required for stability. Specifically, we derive necessary conditions for mean-square observability and stabilizability, demonstrating that the average directed information rate from the state to the observations must exceed the intrinsic expansion rate of the unstable dynamics. Furthermore, to address the challenges posed by non-Gaussian distributions inherent to nonlinear observation channels, we establish sufficient conditions by imposing regularity assumptions, specifically log-concavity, on the system's probabilistic components. We show that under these conditions, the divergence of differential entropy implies the convergence of the estimation error, thereby closing the gap between information-theoretic bounds and estimation performance. By establishing these results, we unveil the fundamental performance limits imposed by the sensing layer, extending classical data-rate constraints to the more challenging regime of nonlinear observation models.


[78] 2603.17415

Structured SIR: Efficient and Expressive Importance-Weighted Inference for High-Dimensional Image Registration

Image registration is an ill-posed dense vision task, where multiple solutions achieve similar loss values, motivating probabilistic inference. Variational inference has previously been employed to capture these distributions, however restrictive assumptions about the posterior form can lead to poor characterisation, overconfidence and low-quality samples. More flexible posteriors are typically bottlenecked by the complexity of high-dimensional covariance matrices required for dense 3D image registration. In this work, we present a memory and computationally efficient inference method, Structured SIR, that enables expressive, multi-modal, characterisation of uncertainty with high quality samples. We propose the use of a Sampled Importance Resampling (SIR) algorithm with a novel memory-efficient high-dimensional covariance parameterisation as the sum of a low-rank covariance and a sparse, spatially structured Cholesky precision factor. This structure enables capturing complex spatial correlations while remaining computationally tractable. We evaluate the efficacy of this approach in 3D dense image registration of brain MRI data, which is a very high-dimensional problem. We demonstrate that our proposed method produces uncertainty estimates that are significantly better calibrated than those produced by variational methods, achieving equivalent or better accuracy. Crucially, we show that the model yields highly structured multi-modal posterior distributions, enable effective and efficient uncertainty quantification.


[79] 2603.27831

Quantifying and Attributing Power Flexibility from GPU-Heavy Data Centers

The growth of GPU-heavy data centers has increased electricity demand and challenged grid stability. This paper investigates how an energy-aware job scheduling algorithm provides flexibility in GPU-heavy data centers. We develop a rolling-horizon optimization framework considering IT power and cooling dynamics with limited future job information. Compared with the first-in first-out baseline, we show that energy-aware scheduling brings latent power flexibility during peak-price periods. This flexibility is created through both thermal and computational mechanisms: cooling shifting can reliably reduce demand for short periods at relatively low incentive (\$30/MWh), and movement of backfilled jobs can often reduce demand at similar prices (\$30-300/MWh). Further reduction is possible through reordering or delaying jobs, but due to lost profits these actions come at higher prices (starting at \$600/MWh, more significantly above \$3000/MWh). Flexibility is achievable without knowing arriving jobs, but much greater flexibility can be achieved with perfect foresight of the future queue.


[80] 2604.09118

Efficient Uniform Feasible-Set Sampling for Approximate Linear MPC

Model Predictive Control (MPC) offers safe and near-optimal control but suffers from high computational costs. Approximate MPC (AMPC) mitigates this by learning a cheaper surrogate policy, typically by training a neural network on state-MPC input pairs. Generating training data is a major bottleneck, requiring solving the MPC for numerous states sampled from its feasible set. Since this feasible set is implicitly defined and unknown, efficient sampling is nontrivial but crucial. We propose the linear MPC Hit-and-Run (LMPC-HR) sampler for linear MPC with polyhedral constraints. We identify the feasible set boundaries along search directions, a crucial step within HR, by formulating the problem as a convex linear program, replacing expensive iterative searches with a single optimization step. A numerical study demonstrates that LMPC-HR reduces the computational cost of generating uniformly distributed samples from the feasible set by an order of magnitude compared to standard baselines.


[81] 2604.14410

Integrated Investment and Policy Planning for Power Systems via Differentiable Scenario Generation

We formulate a method to co-optimize power system capacity planning decisions and policy investments that shape electricity load patterns. To this end, we leverage a gradient-based solution technique that enables the efficient solution of operation-aware planning models. To compute gradients with respect to the conditions that define daily electricity demand profiles, we introduce and formalize the concept of differentiable scenario generation and show that generative machine learning models satisfy the mathematical requirements needed to compute consistent gradients. We demonstrate the feasibility of the proposed approach through numerical experiments using a diffusion model-based scenario generator and a stylized generation and capacity expansion planning model.


[82] 2604.15223

Eccentricity Confound in EEG-based Visual Attention Decoding from Gaze-Fixated Neural Tracking of Motion in Natural Videos

Objective. Decoding visual attention from brain signals during naturalistic video viewing has emerged as a new direction in brain-computer interface research. Current methods assume that stronger coupling between object motion and neural activity indicates higher attention, but this can be confounded by eye movement artifacts and stimulus properties. This study investigates how visual eccentricity (the distance between a visual object and the fixation point) affects neural responses when eye movement artifacts are controlled. Approach. EEG signals were recorded across three tasks that manipulated object eccentricity and attention conditions while participants maintained gaze fixation. Correlation analysis and match-mismatch decoding were performed to quantify the neural tracking of object motion. Main results. The analysis supports three conclusions: (1) neural tracking of object motion in natural videos works under gaze fixation; (2) the strength of this tracking under gaze fixation is predictive of attention; and (3) there exists a significant eccentricity confound in the EEG responses, with poorer neural tracking of motion at larger eccentricities. Significance. These results indicate that findings from previous free-viewing studies also reflect genuine neural processing rather than mere oculomotor artifacts. However, the identified eccentricity effect highlights a major limitation for current decoding approaches that assume coupling strength reflects attention levels alone.


[83] 2605.22425

Time-varying rPPG signal separation via block-sparse signal model

Remote photoplethysmography (rPPG) enables non-contact measurement of cardiac pulse signals by analyzing subtle color changes in facial videos. Nevertheless, extracting rPPG signals remains challenging because of their extremely weak signal strength and susceptibility to illumination noise. In this paper, we propose an rPPG signal extraction method that exploits the quasi-periodic characteristics of rPPG signals. Our approach models quasi-periodicity of the rPPG signal, which arises from the stable cardiac cycle, as a block-sparse structure in the time-frequency domain. To incorporate a block-sparse model and enable adaptive signal separation under illumination fluctuations, we construct a time-varying signal separation framework. Experiments using a public dataset demonstrate the effectiveness of our method.


[84] 2605.23536

Utilizing Missed Detections in Directional Sensitivity-Based DOA Estimation

This paper introduces a signal strength-based direction of arrival (DOA) estimation approach for directional sensors that explicitly accounts for missed detections. In traditional phase-based DOA estimation frameworks, negative information from expected emitters that fall below the detection threshold fall outside the scope of standard measurement models. Unlike phase-based DOA estimation methods, the proposed approach relies only on received signal strength measurements. As a result, missed detections arise naturally from the sensing and detection process and convey valuable information via the known detection thresholds. By incorporating both detected signals and missed detections into the likelihood function, we develop a probabilistic estimation method that fully leverages the underlying measurement and detection models. Simulation results show that the proposed method significantly improves DOA estimation accuracy compared to baseline techniques, particularly in challenging scenarios with high missed-detection rates. Real-world experiments using Bluetooth Low Energy (BLE) signals and directional antennas further validate the effectiveness of the approach, demonstrating substantial performance gains. These findings highlight the value of modeling missed detections in sensor array processing and open new avenues for enhancing localization performance in wireless communication systems.


[85] 2606.04869

Source Side Mitigation of AI Datacenter Power Fluctuations with a Hybrid Energy Storage System and Residual Differentiable Predictive Control

The rapid growth of hyperscale AI datacenters introduces structured, workload-driven active-power fluctuations at the point of interconnection. These fluctuations appear to the grid as time-varying disturbance injections that cannot be captured by conventional peak- or average-load representations. To reduce the residual power disturbance before it propagates into the bulk power system, this paper proposes a hybrid energy storage system with differentiable predictive control (HESS-DPC) framework for datacenter-side power smoothing. A workload-driven disturbance model is first established, representing the point-of-interconnection load deviation as the superposition of training and fine-tuning workloads to capture the structured forcing inputs that can excite generator frequency dynamics. A frequency-based rule-based controller then allocates this deviation between a battery energy storage system (BESS) and a supercapacitor (SC), assigning the energy-dominant component to the BESS and the fast-varying component to the SC. To overcome the anticipation and constraint limitations of fixed-frequency decomposition, a residual differentiable predictive control policy is trained offline to compute finite-horizon command corrections around the rule-based baseline while enforcing a one-step safeguard. Simulations on the NPCC 140-bus system show that HESS-DPC reduces grid-side residual deviations during workload transitions, improves SC state-of-charge sustainability over extended operation, and reduces generator peak-to-peak frequency deviations by more than 80 percent across all monitored generators, with the worst-affected generator response falling from 15.1 mHz to 1.3 mHz. These results confirm that local active-power smoothing at the datacenter point of interconnection can substantially mitigate frequency disturbances caused by AI workloads.


[86] 2606.29737

Effective Depth in Joint Source-Channel Coding: An Implicit Equilibrium Analysis

A fundamental design question in deep joint source-channel coding (Deep JSCC) remains insufficiently explored: given a channel signal-to-noise ratio (SNR), what effective computation depth is required for semantic reconstruction? Existing Deep JSCC systems typically employ fixed-depth neural architectures selected through empirical hyperparameter tuning, which may lead to unnecessary computation under favorable channel conditions and insufficient refinement under severe channel noise. This paper proposes \emph{Implicit-JSCC}, an implicit equilibrium framework in which semantic encoding and decoding are formulated as fixed-point equilibrium processes. The effective encoder and decoder depths are determined by residual-based solver convergence rather than manually predefined layer numbers, while parameter sharing across equilibrium iterations enables depth-independent parameter complexity. To analyze the resulting effective-depth behavior, we develop a Gaussian-process-inspired kernel evolution framework that models equilibrium iterations as an effective-depth propagation process. Since channel noise is injected between the encoder and decoder, the analysis tracks channel-induced representation perturbations across receiver-side equilibrium iterations and derives a theory-guided depth--SNR relationship. After offline calibration of the system-specific parameters, the resulting model characterizes the required receiver-side refinement depth under different SNRs. Extensive experiments show that Implicit-JSCC achieves competitive reconstruction performance while enabling residual-based adaptive inference and controllable computation--quality tradeoffs. The depth--SNR model further provides a characterization of the SNR-dependent refinement depth required to reach a prescribed perturbation tolerance.


[87] 2310.05507

MEDUSA: Scalable Biometric Sensing in the Wild through Distributed MIMO Radars

Radar-based techniques for detecting vital signs have shown promise for continuous contactless vital sign sensing and healthcare applications. However, real-world indoor environments face significant challenges for existing vital sign monitoring systems. These include signal blockage in non-line-of-sight (NLOS) situations, movement of human subjects, and alterations in location and orientation. Additionally, these existing systems failed to address the challenge of tracking multiple targets simultaneously. To overcome these challenges, we present MEDUSA, a novel coherent ultra-wideband (UWB) based distributed multiple-input multiple-output (MIMO) radar system, especially it allows users to customize and disperse the $16 \times 16$ into sub-arrays. MEDUSA takes advantage of the diversity benefits of distributed yet wirelessly synchronized MIMO arrays to enable robust vital sign monitoring in real-world and daily living environments where human targets are moving and surrounded by obstacles. We've developed a scalable, self-supervised contrastive learning model which integrates seamlessly with our hardware platform. Each attention weight within the model corresponds to a specific antenna pair of Tx and Rx. The model proficiently recovers accurate vital sign waveforms by decomposing and correlating the mixed received signals, including comprising human motion, mobility, noise, and vital signs. Through extensive evaluations involving 21 participants and over 200 hours of collected data (3.75 TB in total, with 1.89 TB for static subjects and 1.86 TB for moving subjects), MEDUSA's performance has been validated, showing an average gain of 20% compared to existing systems employing COTS radar sensors. This demonstrates MEDUSA's spatial diversity gain for real-world vital sign monitoring, encompassing target and environmental dynamics in familiar and unfamiliar indoor environments.


[88] 2412.02798

Filterless Snapshot Hyperspectral Imaging using Guided Patch Diffusion

We consider the problem of reconstructing a HxWx31 hyperspectral image from a $H\times W$ grayscale snapshot measurement that is captured using only a single diffractive lens and a filterless panchromatic photosensor. This problem is severely ill-posed, but we present a model that produces high-quality results in simulation and experiment. We make efficient use of limited training data by creating a conditional denoising diffusion model that operates on small patches in a shift-invariant manner. During inference, we synchronize per-patch hyperspectral predictions using guidance by physical consistency with the system's optical point spread function. Our experiments reveal that the patch size can be as small as the point spread function, with local optical cues being the main source of information about complete spectra. Also, by drawing multiple samples, our model provides per-pixel uncertainty estimates that strongly correlate with reconstruction error.


[89] 2504.20653

SysVCoder: An LLM-Driven Framework for Systematic Generation of System-Level Design

Recent advances in large language models (LLMs) have demonstrated strong potential in generating hardware designs using hardware description languages (HDLs) such as Verilog. However, existing LLM-based frameworks struggle to accurately capture the complexity of real-world architectural designs, particularly for large-scale systems with hierarchical, multi-level module instantiations. To address this issue, we present SysVCoder, an LLM-driven framework that enhances both the generation quality and efficiency of system-level design in Verilog. SysVCoder introduces a two-stage generation pipeline that leverages an intermediate representation to enable a more structured and accurate translation from natural language specifications to complex multi-module designs. Furthermore, we incorporate a rule-based alignment mechanism and a domain-specific retrieval-augmented generation strategy (DS-RAG) to enhance functional correctness by grounding LLM outputs in domain knowledge. We also present SysVDB, a comprehensive dataset comprising 60 system-level hardware designs along with their corresponding verification testbenches. Experimental results demonstrate that SysVCoder outperforms state-of-the-art frameworks such as CodeV and VeriGen by 30.7% and 38.3% in terms of functional correctness under the same base LLM. Notably, SysVCoder achieves performance comparable to NVIDIA's GPT-4 based VerilogCoder while using only a 7B-parameter model, reducing token consumption by 7.6x and synthesis latency by 37.5x. Both SysVCoder and SysVDB are made public at this https URL.


[90] 2506.20995

Step-by-Step Video-to-Audio Synthesis via Negative Audio Guidance

We propose a step-by-step video-to-audio (V2A) generation method that provides finer control over the generation process and more realistic audio synthesis. Inspired by traditional Foley workflows, our approach enables incremental generation of complementary sounds, allowing users to author multiple sound events induced by a video. To avoid the need for costly multi-reference video-audio datasets, each generation step is formulated as a negatively guided V2A process that discourages duplication of sounds already present in previously generated tracks. The guidance model is trained by finetuning a pre-trained V2A model on audio pairs from non-overlapping segments of the same video, encouraging it to leverage acoustic context while remaining visually grounded, and enabling training with standard single-reference audiovisual datasets. Objective and subjective evaluations demonstrate that our method enhances the separability of generated sounds at each step and improves the overall quality of the final composite audio, outperforming existing baselines. Our project page is available at: this https URL.


[91] 2508.08237

VGGSounder: Audio-Visual Evaluations for Foundation Models

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.


[92] 2603.02794

An Interpretable, Controllable Time-Varying IIR Denoiser for On-Device Assistive Hearing

We present TVF (Time-Varying Filtering), an interpretable, low-latency speech enhancement model for real-time, on-device assistive hearing. A lightweight neural controller predicts, in real time, the coefficients of a differentiable cascade of 35 second-order IIR filters (biquads), so the model tracks non-stationary noise while keeping a fully interpretable processing chain: every spectral modification is an explicit, adjustable equalizer curve rather than an opaque `black-box' transform. Because the biquad cascade carries the signal processing, the controller can be made very small, driving the cascade with only 24k parameters at a 10.7ms algorithmic latency, within hearing-aid budgets, and running entirely on-device so that audio never leaves the device. We also expose the suppression-versus-preservation trade-off as an explicit control: it can be set during training through the loss weighting, and adjusted at inference, with no retraining, by mixing the noisy input with the denoised output. On hearing-aid metrics (HASPI/HASQI) the 24k model stays within about 0.02 of DFNet3 (2.3M parameters, almost two orders of magnitude larger) while using about 29X fewer multiply-accumulates, although larger black-box models still lead on reference metrics such as PESQ. We present TVF as a proof of concept for a compact, interpretable, and controllable denoiser for on-device assistive hearing.


[93] 2603.16424

Early-Terminable Energy-Safe Iterative Coupling for Parallel Simulation of Partitioned Port-Hamiltonian Systems

Parallel simulation of robotic systems requires partitioning the dynamics into coupled subsystems. Finite-iteration coupling across the partition boundary can inject spurious energy, even when each subsystem is passive. We propose an early-terminable, energy-safe coupling interface for port-Hamiltonian subsystems based on Douglas--Rachford splitting in wave (scattering) coordinates. The wave-domain formulation reduces passivity to norm inequalities and coupling to orthogonality. Within this setting, the deep correspondence between monotone operator theory and discrete passivity can be exploited to construct a Douglas--Rachford inner iteration whose Fejér monotonicity provides algorithmic dissipation. Under passivity of the subsystem integrators and an impedance-tuning condition, the proposed method guarantees discrete passivity of the augmented storage for any finite inner-iteration budget and converges to the monolithic discretization as the budget increases. Experiments on a linear--Duffing coupled-oscillator benchmark support the finite-iteration energy inequality at numerical roundoff (1e-14 in double precision), with state-error metrics decreasing over the tested inner-iteration budgets.


[94] 2603.23297

Drop-In Perceptual Optimization for 3D Gaussian Splatting

Despite their output being ultimately consumed by human viewers, 3D Gaussian Splatting (3DGS) methods often rely on ad-hoc combinations of pixel-level losses, resulting in blurry renderings. To address this, we systematically explore perceptual optimization strategies for 3DGS by searching over a diverse set of distortion losses. We conduct the first-of-its-kind large-scale human subjective study on 3DGS, involving 39,320 pairwise ratings across several datasets and 3DGS frameworks. A regularized version of Wasserstein Distortion, which we call WD-R, emerges as the clear winner, excelling at recovering fine textures without incurring a higher splat count. WD-R is preferred by raters more than $2.3\times$ over the original 3DGS loss, and $1.5\times$ over the current best method Perceptual-GS. WD-R also consistently achieves state-of-the-art LPIPS, DISTS, and FID scores across various datasets, and generalizes across recent frameworks, such as Mip-Splatting and Scaffold-GS, where replacing the original loss with WD-R consistently enhances perceptual quality within a similar resource budget (number of splats for Mip-Splatting, model size for Scaffold-GS), and leads to reconstructions being preferred by human raters $1.8\times$ and $3.6\times$, respectively. We also find that this carries over to the task of 3DGS scene compression, with $\approx 50\%$ bitrate savings for comparable perceptual metric performance.


[95] 2604.01897

FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.


[96] 2604.04834

E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

Robotic Vision-Language-Action (VLA) models generalize well for open-ended manipulation, but their perception is fragile under sensing-stage degradations such as extreme low light, motion blur, and black clipping. We present E-VLA, an event-augmented VLA framework that improves manipulation robustness when conventional frame-based vision becomes unreliable. Instead of reconstructing images from events, E-VLA directly leverages motion and structural cues in event streams to preserve semantic perception and perception-action consistency under adverse conditions. We build an open-source teleoperation platform with a DAVIS346 event camera and collect a real-world synchronized RGB-event-action manipulation dataset across diverse tasks and illuminations. We also propose lightweight, pretrained-compatible event integration strategies and study event windowing for stable deployment. Experiments show that even a simple parameter-free fusion, i.e., overlaying accumulated event maps onto RGB images, could substantially improve robustness in dark and heavy-blur scenes: on Pick-Place at 20 lux, success increases from 0% (image-only) to 60% with overlay fusion and to 90% with our event adapter; under severe motion blur (1000 ms-exposure proxy), Pick-Place improves from 0% to 20-25%, and Sorting from 5% to 32.5%. Overall, E-VLA provides systematic evidence that event-driven perception can be effectively integrated into VLA models, pointing toward robust embodied intelligence beyond conventional frame-based imaging. Code and dataset will be available at this https URL.


[97] 2604.09344

DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio

Full-duplex dialogue audio, in which each speaker is recorded on a separate track, is an important resource for spoken dialogue research, but is difficult to collect at scale. Most in-the-wild two-speaker dialogue is available only as degraded monaural mixtures, making it unsuitable for systems requiring clean speaker-wise signals. We propose DialogueSidon, a model for joint restoration and separation of degraded monaural two-speaker dialogue audio. DialogueSidon combines a variational autoencoder (VAE) operates on the speech self-supervised learning (SSL) model feature, which compresses SSL model features into a compact latent space, with a diffusion-based latent predictor that recovers speaker-wise latent representations from the degraded mixture. Experiments on English, multilingual, and in-the-wild dialogue datasets show that DialogueSidon substantially improves intelligibility and separation quality over a baseline, while also achieving much faster inference.


[98] 2604.18546

Wasserstein Distributionally Robust Risk-Sensitive Estimation via Conditional Value-at-Risk

We propose a distributionally robust approach to risk-sensitive estimation of an unknown signal x from an observed signal y. The observation and unknown signal are modeled as random vectors whose joint probability distribution is unknown, but assumed to belong to a given type-2 Wasserstein ball of distributions, termed the ambiguity set. The performance of an estimator is measured according to the conditional value-at-risk (CVaR) of the squared estimation error. Within this framework, we study the problem of computing affine estimators that minimize the worst-case CVaR over all distributions in the given ambiguity set. As our main result, we show that, when the nominal distribution at the center of the Wasserstein ball is finitely supported, such estimators can be exactly computed by solving a tractable semidefinite program. We evaluate the proposed estimators on a wholesale electricity price forecasting task using real market data and show that they deliver lower out-of-sample CVaR of squared error compared to existing methods.


[99] 2604.19569

Lyapunov-Certified Direct Switching Theory for Q-Learning

Q-learning is a fundamental algorithmic primitive in reinforcement learning. This paper develops a new framework for analyzing Q-learning from a switching linear system (SLS) viewpoint. In particular, we derive a stochastic SLS representation of the Q-learning error, and a finite-time error analysis through the joint spectral radius (JSR) of the corresponding SLS model, where the JSR is the exact worst-case exponential rate of the associated SLS. To the best of our knowledge, this is the first convergence rate analysis of standard Q-learning whose leading exponential rate is expressed through the JSR. The resulting rate is tied to the intrinsic worst-case exponential rate of the direct SLS representation and can be sharper than row-sum upper bounds when those bounds are conservative.


[100] 2605.13028

Local Conformal Calibration of Dynamics Uncertainty from Semantic Images

We introduce Observation-aware Conformal Uncertainty Local-Calibration (OCULAR), a conformal prediction-based algorithm that uses perception information to provide uncertainty quantification guarantees for unseen test-time environments. While previous conformal approaches lack the ability to discriminate between state-action space regions leading to higher or lower model mismatch, and require environment-specific data, our method uses data collected from visually similar environments to provably calibrate a linear Gaussian dynamics model of arbitrary fidelity. The prediction regions generated from OCULAR are guaranteed to contain the future system states with, at least, a user-set likelihood, despite both aleatoric and epistemic uncertainty -- i.e., uncertainty arising from both stochastic disturbances and lack of data. Our guarantees are non-asymptotic and distribution-free, not requiring strong assumptions about the unknown real system dynamics. Our calibration procedure enables distinguishing between observation-velocity-action inputs leading to higher and lower next-state-uncertainty, which is helpful for probabilistically-safe planning. We numerically validate our algorithm on a double-integrator system subject to random perturbations and significant model mismatch, using both a simplified sensor and a more realistic simulated camera. Our approach calibrates approximate uncertainty estimates both when in-distribution and out-of-distribution, producing volume-efficient prediction regions without requiring environment-specific data.


[101] 2606.06790

Learning All-Terrain Locomotion for a Planetary Rover with Actively Articulated Suspension

This paper presents ERNEST, a four-wheeled planetary rover concept equipped with a two-degree-of-freedom Active Gimbal Suspension that combines yaw and roll actuation to enable wheel reconfiguration, steering, and active load redistribution. A single neural network controller, trained to track a desired path across challenging terrain, fully unlocks the capabilities of this actuated suspension system for autonomous obstacle negotiation. A reinforcement learning framework is developed using the high-fidelity DARTS simulation engine, which combines rigid-contact dynamics and Bekker-Wong terramechanics, enabling the emergence of locomotion strategies adapted to loose-soil conditions. To obtain a single unified controller across heterogeneous terrains, a policy consolidation strategy merges the experience of terrain-specialized agents into one neural network, eliminating the need for explicit terrain classification and controller switching. The resulting controller operates on a combination of proprioceptive and exteroceptive feedback, including sparse stereo-derived terrain elevation, chassis attitude, joint states, and force-torque measurements. Zero-shot transfer to the physical rover is achieved through domain randomization, sensor noise injection, and model-to-real system identification. Experimental results demonstrate autonomous traversal of rock fields, a Bickler trap (bump obstacle), a wheel-high step, sand ripples, and sandy slopes. On a 20° sandy slope, the learned controller reduces the cost of transport by 37% on dry sand despite the additional actuation, and achieves superior performance on wet sand where the passive suspension becomes completely immobilized. A video accompanying this paper is available at this https URL


[102] 2606.10410

A Comprehensive Inference-Time Augmentation Framework in Physiological Signals: Application to PPG-Based AF Detection

Objective: Accurate classification of physiological signals in real-world deployments is challenged by sensor noise, motion artifacts, and distribution shifts between training and deployment data. Inference-time augmentation (ITA), which applies augmentations during inference rather than retraining, offers a simple, model-agnostic mechanism to improve robustness. However, ITA application to physiological signals has remained narrow in scope, relying on limited augmentation methods with fixed, unoptimized parameters. This work proposes a unified ITA framework to address that gap. Approach: The framework incorporates 13 augmentation methods spanning time-domain, amplitude-domain, frequency-domain, and artifact-injection transformations, with hyperparameters optimized via Bayesian optimization. We evaluate on atrial fibrillation (AF) detection from 30-second PPG signals using GPT-PPG and ResNet across five datasets comprising more than 400 patients and ${\sim}$9,800 hours of recording. Main results: Standard ITA consistently improved AUROC (up to 8.5% for GPT-PPG and 0.7% for ResNet) and AUPRC (up to 10.6% for GPT-PPG and 0.8% for ResNet). Selective ITA further reduced average FPR by up to 4.4% (GPT-PPG) and 1.3% (ResNet) on non-AF datasets. Significance: These findings establish ITA as a practical, model-agnostic approach for improving PPG-based AF classification reliability in deployment settings where retraining is not feasible, with broader applicability to physiological signal analysis.


[103] 2606.14027

Same-Origin Policy for Agentic Browsers

Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at this https URL.