New articles on Electrical Engineering and Systems Science


[1] 2606.14745

Compositional small-gain and small-phase stability analysis

We adapt the small-gain and small-phase stability analysis to systems composed of MIMO LTI subsystems via an iteration of the series, parallel, and feedback interconnections. Based on a certain set of parameters (including the phase sector and the Crawford number) of the subsystems, we bound the same set of parameters of each consecutive interconnection. This is called "composition of bounds." Composing the gains of subsystems in the same manner enables the small-gain-or-phase analysis of fairly complex systems. The method is illustrated with two examples that are not tractable by the classical small-phase analysis, and one other example that nevertheless benefits from the compositional approach.


[2] 2606.14750

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.


[3] 2606.14791

From Physics to Representation: Audio Learning with Synthetic Pre-training via Procedural Generation

Self-supervised learning advances audio representation for multimedia analysis. However, prevailing data-centric approaches rely on massive real-world corpora, increasing training costs, curation burdens, and privacy barriers. To address this, we present AudioPG, a procedural synthesis framework eliminating real audio recordings during pre-training. AudioPG trains a Transformer-based masked autoencoder on waveforms generated on-the-fly from basic acoustic primitives and composition rules. The encoder transfers effectively to real audio benchmarks, achieving 90.60% accuracy on ESC-50, 0.546 mAP on FSD50K, 88.17% on UrbanSound8K, and 97.03% on Speech Commands V2. Notably, pre-training completes in under 20 minutes on a single GPU. Latent space analysis reveals physical factors, including fundamental frequency and relative intensity, emerge in orthogonal subspaces, making representations linearly decodable. These results establish procedural synthesis as an efficient, interpretable pre-training signal when large-scale corpora are unavailable. Our code is available at: this https URL.


[4] 2606.14808

Explainable Task-Oriented Token Communication for AI-Native 6G Networks

The integration of Foundation Models (FMs) and wireless communications is driving the evolution of image communication from bit-accurate transmission toward task-oriented transmission. However, existing task-oriented image communication methods still face three major challenges: insufficient task-oriented Token representation, inadequate collaboration between Visual Tokens and Task Tokens, and limited interpretability of task decisions. To address these challenges, we propose an Explainable Task-Oriented Token Communication (ET-TokenCom) framework. By treating Tokens as unified units for information representation and transmission, the proposed framework constructs an end-to-end communication link that spans visual perception, wireless transmission, and task reasoning. At the transmitter, the ET-TokenCom framework extracts Visual Tokens from images to preserve low-level visual information. Meanwhile, Task Tokens generated by the FM are introduced to represent the target information and decision intent required by the current task. A Cross-Modal Attention (CMA) fusion mechanism is further designed, enabling Task Tokens to explicitly guide the selection, weighting, and transmission of Visual Tokens. At the receiver, the framework integrates Token decoding with an explainable output mechanism, where attention heatmaps are generated to highlight critical perceptual regions under different task objectives and reveal the influence of Task Tokens on the outputs. Finally, simulation results validate the effectiveness and robustness of the proposed ET-TokenCom framework.


[5] 2606.14828

Leptomeningeal Collateral Detection on DSA via Vessel-Graph Neural Networks

Leptomeningeal collaterals (LMCs) are an important prognostic factor in acute ischemic stroke. Existing automated methods rely on CT angiography (CTA), but individual LMCs are often too small to be resolved on CTA, limiting these methods to coarse collateral scoring. Digital subtraction angiography (DSA) visualizes individual collaterals at superior resolution, yet current assessment remains subjective, relying on manual grading scales that suffer from poor inter-rater agreement. We present a framework that formulates collateral detection as the classification of individual vessel segments on a graph derived from DSA. A hybrid graph-pixel architecture combines a topology-aware graph branch with a dense pixel branch, fused in a shared node-probability space. In a five-fold cross-validation setting, the fused model achieves a PR-AUC of 0.434, outperforming the graph-only (0.403) and pixel-only (0.362) baselines. To our knowledge, this is the first method to enable the individualization of LMCs in DSA, allowing for precise per-vessel quantitative assessment. This integration shifts DSA assessment toward objective evaluation, supporting future biomarker and pattern discovery for individual LMCs.


[6] 2606.14897

An Analytical Methodology for Quantifying Airspace Conflict Rate and Complexity

Air traffic growth, advanced air mobility, and increasingly autonomous operations are driving the need for scalable and adaptive airspace design methodologies. Central to this challenge is quantifying how traffic flow structure and demand, governed in part by airspace geometry, influence conflict generation and operational complexity. This paper presents an analytical framework for computing conflict rate and conflict probability in structured airspace using stochastic flow models. Traffic streams are modeled as renewal processes with prescribed inter-arrival time distributions, while interactions between flows are captured through geometry-dependent minimum spacing constraints at merges and crossings. Within this formulation, closed-form upper bounds on the expected conflict rate and conflict probability per aircraft are derived as functions of flow configuration and demand. These metrics are interpreted as complementary measures of airspace complexity, reflecting controller workload and per-aircraft operational risk. The methodology is applied to representative hexagonal cell geometries with varying routing structures and flow distributions. Results reveal non-monotonic tradeoffs between routing flexibility, capacity, and conflict generation, with intermediate flow configurations outperforming both highly constrained and highly distributed cases. The proposed framework provides a tractable tool for evaluating airspace design alternatives and complexity-informed traffic management strategies.


[7] 2606.14907

Sparse Solution Trade-offs in GMP DPD: A Least Squares Thresholding Approach

Power amplifiers (PAs) in satellite communication systems introduce nonlinear distortion, degrading spectral fidelity. Digital pre-distortion linearizes the PA response, but full-complexity solutions are prohibitive under strict size, weight, and power (SWaP) constraints. We propose the use of Least Squares Thresholding (LST) and compare it against Orthogonal Matching Pursuit (OMP) and Matching Pursuit. LST achieves a 2.77x complexity reduction while maintaining near-identical linearization performance to OMP.


[8] 2606.14910

Moving Target SAR Imaging Using Planar Arrays And Multidimensional Chinese Remainder Theorem (MD-CRT)--Part I: A General Framework

In this two-part paper, we investigate synthetic aperture radar (SAR) moving target imaging using planar antenna arrays. For a target moving over a three-dimensional terrain, its accurate localization requires the joint estimation of the motion-induced cross-range shift and the target height. In Part I of this two-part paper, starting from the planar array imaging geometry and the corresponding signal model, we show that these two quantities can be unified into a two-dimensional parameter vector and represented, after two-dimensional discrete Fourier transform (2D-DFT) processing across the planar array, through a natural vector remainder formulation. We first develop a general 2D-DFT matrix modulus framework and show that, in the two-dimensional setting, the associated 2D-DFT matrix modulus affects the propagation of vector remainder errors. Under a fixed array geometry and antenna number constraint, we derive an optimal construction of this matrix modulus and adopt it in the subsequent analysis. Under this construction, a single planar array provides only a folded estimate when the true parameter vector lies outside its unambiguous range. To resolve this ambiguity, we develop a multi-subarray framework in which multiple planar subarrays generate multiple vector remainders with different matrix moduli, and the desired parameter vector is recovered through the multidimensional Chinese remainder theorem (MD-CRT). To account for practical errors introduced by 2D-DFT quantization and additive noise, we further introduce an approximate 2D-DFT peak model for non-integer frequency vectors, incorporate robust MD-CRT, and establish sufficient conditions together with explicit reconstruction error bounds for both noiseless and noisy settings. Numerical results verify that the proposed multi-subarray framework enlarges the unambiguous range compared with a single planar array.


[9] 2606.14911

Moving Target SAR Imaging Using Planar Arrays And Multidimensional Chinese Remainder Theorem (MD-CRT)--Part II: Two Subarray Designs

Based on the framework proposed in Part I, the Part II of this two-part paper investigates two-subarray designs for moving target SAR imaging using planar antenna arrays and the multidimensional Chinese remainder theorem (MD-CRT). In this Part II, we focus on the performance analysis and the detailed two planar subarray designs. In particular, we study a common-scaling two-subarray design, under which the two subarrays share the same scaling factor in the MD-CRT formulation. Under this design, ambiguity resolution can be performed on a common integer frequency vector. As a result, the same unambiguous range as in the general two-subarray framework in Part~I is preserved, while the sufficient conditions for robust recovery become weaker and the corresponding reconstruction error bounds become tighter. Within this common-scaling design, we compare the proposed planar array framework with a conventional separated scheme, in which the motion-induced cross-range shift is recovered by a one-dimensional CRT-based method and the target height is estimated by cross-track interferometric processing. Under the same platform size and minimum antenna spacing constraints, the proposed planar array framework can realize the common-scaling design, whereas the corresponding one-dimensional non-uniform linear array scheme does not admit such a design. With this design, the planar array framework leads to a weaker sufficient condition for robust recovery and thus performs better in moving target imaging. We also compare several planar array designs under fixed platform size and minimum antenna spacing. The analysis shows that recovery performance depends not only on the number of antennas but also on the array geometry. In particular, non-separable planar array geometries can provide better robustness than separable ones when their antenna numbers are comparable.


[10] 2606.14944

Prototype-Aware Fundamental Electromagnetic Limits on Wavefront Synthesis with Programmable Metasurfaces

Wavefront synthesis is a central objective in many applications of programmable metasurfaces (PMs), ranging from electromagnetic holography and computational imaging to massive backscatter communications. Yet, fundamental limits on the ability of a given real-world PM prototype to synthesize a desired output wavefront remain largely unknown. Here, we derive prototype-aware and electromagnetically consistent bounds on target-wavefront synthesis in reconfigurable MIMO wave systems whose programmability stems from tunable lumped elements. Our approach combines multiport network theory (MNT), experimentally estimated proxy MNT parameters, and semidefinite relaxation. We account for relevant practical aspects of typical real-world PMs, such as mutual coupling, binary programmability, and lossy tunable loads. We derive bounds on strength-agnostic wavefront-synthesis fidelity, shape-agnostic target-mode strength, and the strength--fidelity Pareto frontier using two complementary threshold sweeps. We evaluate these bounds for four experimental MIMO systems whose transfer functions are parametrized by a reconfigurable intelligent surface (RIS), involving up to 100 1-bit-programmable elements and radio environments ranging from rich scattering to free space. Our bounds yield practical insights such as the identification of unattainable performance regions and the close-to-optimality certification of certain optimization outcomes. Comparisons with feasible discrete-optimization benchmarks show that the bounds can often be closely approached in practice, indicating tightness. While demonstrated with a RIS prototype, our methodology applies broadly to lumped-element-reconfigurable wave systems, including dynamic metasurface antennas. Altogether, this work contributes to the development of a prototype-aware electromagnetic information theory for reconfigurable wave systems.


[11] 2606.14979

Robust Sampling-Based Covariance Steering for Aerocapture Guidance

Aerocapture is a maneuver where a spacecraft dives through the atmosphere of a planet or moon to reduce its velocity and prepare for orbital insertion. Aerocapture allows for higher cruise velocities and reduces fuel consumption, decreasing transit time and increasing payload mass. However, uncertainties in the atmospheric entry state and atmospheric density increase the risk of aerocapture. Dynamic nonlinearities and nonlinearities caused by the state-dependence of the atmospheric density pose additional challenges. This work develops a robust sampling-based covariance steering algorithm designed for aerocapture guidance. Our proposed algorithm leverages sampled nonlinear system trajectories to improve evaluation of the delta-V required for aerocapture and address nonlinearities caused by the aerocapture dynamics and atmospheric disturbances. We perform Monte Carlo simulations with dispersed entry and atmospheric conditions on aerocapture scenarios at Mars and Uranus and demonstrate a 5-15% reduction in the 99th-percentile, 99.7th-percentile, and worst-case delta-V required for aerocapture when compared against a state-of-the-art covariance steering algorithm.


[12] 2606.14993

A state-of-charge based formulation for storage participation in electricity markets: Technical Reference

We consider a storage device and develop a basic market design that represents the salient characteristics of storage such as state-of-charge (SOC) and round-trip efficiency. The contribution of this paper is a market design that reflects these technical characteristics, does not require bids and offers by the storage owner within the market horizon, but does require an end-of-horizon bid/offer for deviating the end-of-horizon SOC from the start-of-horizon SOC\@. Small examples are used to illustrate the market design and large-scale implementation is considered. Several extensions are sketched in Appendices.


[13] 2606.15000

Polyp-D2ATL: Deep Domain-Adaptive Transfer Learning for Colorectal Polyp Classification under Label Distribution Shift

Early and highly accurate prediction of colorectal polyps, as an important sign of one of the most dangerous types of cancer, will result in saving more lives. Despite the advancements in colorectal polyp classification, many challenges remain in obtaining an automated polyp prediction system that is able to diagnose the difficult-to-predict polyps accompanied by different features in real scenarios, where the model can handle imbalanced data, label distribution shift, and cross-modality generalization successfully. In this study, we propose Polyp-D2ATL, a novel framework accompanied by a specific training strategy, which mitigates these limitations and effectively predicts the different classes of polyps belonging to the NICE classification. Our extensive experiments on the PICCOLO validation and test sets demonstrate that the proposed Polyp-D2ATL significantly outperforms existing state-of-the-art models across various reliable metrics, achieving an accuracy of 82.38%, a Macro-F1 of 77.49%, and a specificity of 87.47% on the validation set, alongside consistent improvements on the held-out test set which demonstrates the generalization capacity and clinical applicability of the proposed approach.


[14] 2606.15004

CREST: Deployment-Realistic Hardware-in-the-Loop NAS for Embedded Sensing Systems

Deploying neural networks on low-power microcontrollers (MCUs) requires selecting model architectures under tight memory, latency, and energy constraints. Existing workflows often simplify this process along one or more axes: static proxy costs such as FLOPs or parameters, treating one MCU as representative, and continuous-inference tests instead of deployed sensing schedules. These assumptions can mis-rank Pareto-front candidates, miss infeasible deployments, and obscure schedule-dependent energy. We present CREST (Cross-platform Runtime Evaluation and Search Tool), a deployment-realistic hardware-in-the-loop (HIL) neural architecture search (NAS) framework for MCU sensing systems. CREST keeps the optimizer, HIL measurement boundary, logging, and replay workflow fixed while exposing workload, model family, target backend, schedule, quantization, and scoring policy as configurable axes. This makes deployment effects experimentally separable within one reusable workflow. We evaluate CREST on inertial odometry and audio classification across three Arm Cortex-M targets. For inertial odometry, measured-energy HIL search reduces median per-inference energy by 41.7% versus FLOPs-based selection and 40.8% versus memory-traffic-based selection at similar error. FLOPs-based selection also chooses infeasible deployments on memory-constrained targets. On the STM32 N657 target, continuous-inference and duty-cycled searches produce different Pareto frontiers. For audio classification, the same application-level policy selects different DS-CNN architectures on different boards, and cross-board replay changes deployment cost substantially. Overall, CREST shows that deployment-realistic MCU NAS must jointly optimize model architecture, target platform, runtime schedule, and deployment policy rather than relying only on static proxy costs or continuous-inference measurements.


[15] 2606.15006

Symbol Error Analysis of Linear Receivers in Terahertz Channels under Channel-Noise Dependence

This paper develops a comprehensive framework for the performance analysis of linear detectors, namely zero-forcing (ZF) and minimum mean-square error (MMSE), under diverse terahertz (THz) channel conditions. Three fading models are considered: Rayleigh fading, the $\alpha$--$\mu$ distribution for indoor THz environments, and the mixture-gamma (MG) distribution for outdoor THz scenarios. Semi-analytical, approximate, and asymptotic expressions for the symbol error rate (SER) are derived, explicitly incorporating the correlation between the channel and the additive noise arising from hardware impairments. This correlation is characterized using both statistical approaches and copula-based methods to effectively capture complex dependency structures. The theoretical findings are validated through simulations, demonstrating strong agreement with the derived expressions and confirming the accuracy and robustness of the proposed framework. The results demonstrate the significant impact of channel--noise dependence on THz-band receiver performance and verify the expected performance degradation of biased MMSE receivers in point-to-point links employing higher-order quadrature amplitude modulation. Specifically, at a target SER of $10^{-3}$, a 70\% correlation results in approximately a 6.5~dB degradation in the effective signal-to-noise ratio, with mismatched MMSE detection incurring an additional 1~dB loss compared to ZF. Nonetheless, MMSE offers enhanced numerical stability under severe channel fading conditions, where channel inversion causes noise amplification.


[16] 2606.15011

Interpretable and Frugal Learning Systems Employing Multiresolution Pyramids and Volterra Kernels

Deep learning models are widely used to process multidimensional signals such as time series, images, and volumetric medical images, but their learned representations often lack explicit signal structure and are difficult to inspect. This thesis develops model-based, signal-theoretic learning systems guided by data and task objectives. It combines multiresolution analysis, wavelets and filter banks, multirate representations, nonlinear Volterra systems, and neural computation graphs. Scale, directional geometry, memory, and nonlinear input-output interactions are represented as differentiable operator modules trainable by backpropagation. The design keeps intermediate variables tied to kernels, subbands, recursions, and transform-domain coefficients rather than only to opaque feature channels. The thesis formulates fast GPU-compatible D-dimensional convolution layers, multirate sampling layers, Volterra-kernel layers in natural and wavelet coefficient domains, rational polynomial cascade heads, stability-constrained multidimensional IIR filters, wavelet banks, and digital shearlet layers with learnable gains. These modules are composed into task-specific architectures for inverse modeling, classification, and segmentation across atmospheric, audio, texture, and medical-imaging problems. In microwave radiometric inversion, InVeRt retrieves vertical temperature and humidity profiles from microwave brightness temperature observations using learnable Volterra kernels in wavelet bases. Multiresolution filter-bank encoders with Volterra heads are used for efficient classification. WaveletViT and ShearViT serve as subband transformer blocks for WaveNETR and ShearNETR, direction-sensitive segmenters for image and MRI segmentation. MRILong deploys trained 3D T1-weighted brain MRI segmenter checkpoints for automatic segmentation and longitudinal analysis of ischemic stroke MRI volumes.


[17] 2606.15094

Adaptive Deep Koopman Operator for Vehicle Dynamics Modeling: A Physics-Informed and Tire-Force-Driven Approach

Accurate and adaptive modeling of vehicle dynamics is paramount for the safety of autonomous driving systems, particularly under extreme maneuvers and time-varying parameters. While Deep Koopman operator theory offers a promising global linearization framework, its online application faces a theoretical bottleneck: the high-dimensional lifted state space inherently induces a rank-deficient problem, rendering traditional recursive least squares based updates numerically unstable. To address this, we propose a novel tire-force-driven modeling framework with guaranteed online stability. First, an offline Deep Koopman model is constructed by embedding 7DOF dynamic equilibrium constraints into the learning objective, ensuring the structural fidelity and physical interpretability of the lifted manifold. Second, we theoretically reformulate the operator update in the rank-deficient space as a minimum-norm solution problem. A Physics-Informed Variable Step-Size Normalized Least Mean Squares (PI-VSS-NLMS) algorithm is proposed, which leverages the projection property of NLMS to act as a stable pseudo-inverse solver while incorporating an anchoring mechanism to suppress parameter drift. Extensive simulations on CarSim and Hardware-in-the-Loop validation on dSPACE MicroAutobox III confirm the superiority of the proposed algorithm. It achieves robust prediction accuracy under unseen excitations while guaranteeing real-time feasibility with an average execution time of 0.421 ms, thus bridging the gap between theoretical models and practical deployment.


[18] 2606.15105

Optimal Ground-to-Air Interception with Time-Varying Acceleration Bounds

This paper proposes novel optimal-control-based guidance laws for ground-to-air missiles with time-varying acceleration bounds. In such engagements, as the missile climbs in altitude, its acceleration bound decreases, which may lead to acceleration saturation and significant miss distances if not explicitly accounted for. The proposed guidance laws incorporate hard acceleration command constraints directly into a linear-quadratic optimal-control framework, in contrast to conventional unbounded or softly constrained approaches. Analytically based guidance laws are developed for linear zero-order and first-order strictly proper missile dynamics with arbitrary-order linear target dynamics. Unlike the constant hard-bound case with minimum-phase missile dynamics, time-varying acceleration command bounds permit an initial unsaturated interval in which the proposed guidance laws can anticipate future saturation and reshape the acceleration profile accordingly. This enables earlier maneuvers when the missile possesses greater low-altitude maneuverability, fundamentally altering the structure of the optimal solution. The proposed approach is evaluated in nonlinear simulations and compared with equivalent unbounded and softly constrained optimal guidance laws. The results demonstrate substantially improved interception performance under saturation, reduced tuning requirements compared to softly constrained guidance laws, and enhanced capability in challenging engagement scenarios.


[19] 2606.15114

Vertical Sub-THz Channel Characterization: Sounder Implementation and Initial Measurements

We present a measurement-based characterization of indoor vertical ceiling-to-ground sub-THz channels in the 136-144 GHz band, motivated by ceiling-mounted radio-unit deployments for future distributed indoor networks. The measurements are performed using a vector network analyzer (VNA)-based channel sounder with a mechanically scanned planar virtual antenna array (VAA) at the receiver, enabling single-input single-output (SISO), small-array single-input multiple-output (SIMO), and large-array SIMO measurements in three indoor environments: an office, a laboratory, and a ventilation room. The small-array and large-array SIMO measurements synthesize 2 X 2 cm and 30 X 1 cm uniform rectangular arrays (URAs), respectively. The results show that the vertical links are generally dominated by a strong Line-of-Sight (LOS) component close to the ceiling-to-ground direction, but with clear environmental differences. The office and laboratory exhibit relatively limited delay dispersion, whereas the ventilation room shows stronger delayed multipath due to its corrugated metallic ceiling and surrounding metallic structures. The measured root mean square (RMS) delay spreads are 0.55-1.74 ns for the small-array measurements and 0.44-2.57 ns for the large-array measurements, smaller than those reported in several horizontal indoor sub-THz measurement campaigns at similar frequencies. However, the channel is not purely free-space. Repeatable second-order reflections involving the receiver table, ceiling, transmitter structure, and ceiling-mounted objects are observed in all environments. The large-array measurements further reveal spatial non-stationarity along the 30 cm aperture, with several multipath components visible only over limited parts of the array. These results show that ceiling materials, obstructions, and aperture-dependent variations matter in vertical sub-THz channel modeling.


[20] 2606.15135

Differentially Private Consensus for Time-Delay Multi-agent Systems

This paper is concerned with the differentially private consensus problem for discrete-time multi-agent systems with communication delays. The purpose of the paper is to achieve differentially private consensus for such systems while protecting the entire delayed initial histories of all agents. A novel adjacency relation for delayed histories is introduced, and a Laplace-noise-based privacy mechanism is developed, where the noise variance is allowed to vary with time and even increase. By using the difference resolvent function method, decay estimates for the fundamental solutions of the delayed difference equations are derived. Based on these estimates and a backstepping technique, mean square weak consensus, mean square strong consensus, and almost sure strong consensus are established. The estimates for the fundamental solutions are also used to derive an explicit sensitivity bound. Furthermore, a constructive parameter design is provided to achieve a prescribed infinite-horizon $\epsilon^\star$-differential privacy level. Numerical simulations illustrate the theoretical results.


[21] 2606.15141

EChO-Agent: Evidence Chain Orchestration Agent for Audio Reasoning

While LALMs show promise on audio question answering, they fail to focus on question-relevant segments of audio and provide a clear, checkable reasoning process when dealing with complex audio reasoning. Reinforcement learning and tool-augmented prompting can help models better relate questions to audio but lack a reliable way to understand, integrate, and self-verify audio segments. To address this gap, we present EChO-Agent, a modular agent framework that reformulates complex audio QA as a planning, tool execution, evidence integration, and answer verification workflow. Experiments on MMAR benchmark show EChO-Agent improves both accuracy and rubric scores over baseline and ablation studies show evidence integration is the key factor.


[22] 2606.15147

On the Feasibility of Human Presence Detection Using Ceiling-Mounted Sub-THz Channel Sounding: Conference Room Measurement

This paper presents a measurement-based investigation on the feasibility of human presence detection using a ceiling-mounted sub-THz channel sounder operating from 134 to 146~GHz. Wideband channel measurements were conducted in an indoor conference room under empty-room, human-present, and water-filled mannequin scenarios across five spatial positions. The measurements were performed using a vector network analyzer combined with sub-THz frequency extenders. Two antenna beamwidth configurations were used, one with a highly directive horn antenna on the transmitter side and one with a less directive, open-waveguide transmitter. The measured channel responses were transformed into calibrated power delay profiles and analyzed using normalized channel variation metrics in the delay domain. The results show that human detection is strongly dependent on target position relative to the ceiling-mounted transmitter and receiver as well as on antenna beamwidth. Furthermore, repeated empty-room measurements reveal that small environmental changes, such as slight furniture displacement, introduce non-negligible channel variations that must be considered when evaluating detection performance. In the wide-beam open-waveguide configuration, the human-present measurements produced lower values of the delay-domain variation metric than the repeated empty-room baseline, whereas the water-filled mannequin produced values at or above this baseline across all positions. With the directive transmitter, the human response exceeded the baseline significantly but only at favorable positions, especially P1 and P2, showing that the sensing response remains spatially selective. These findings provide experimental insight into the capabilities and limitations of ceiling-mounted sub-THz sensing for future integrated sensing and communication systems.


[23] 2606.15187

VoxWatermark: A Large-Scale Benchmark for Audio Watermark Detection under Perturbations

With the rapid deployment of speech generation systems in open environments, providing verifiable source attribution and copyright accountability for audio content has become critical. A gap in current research is the lack of a unified benchmark that systematically compares different watermark injection methods under realistic distribution shifts. To address this, we build VoxWatermark by applying 10 watermarking methods (4 neural and 6 traditional) with unified injection and annotation on multilingual, multi-source corpora, and introducing no-box, black-box, and white-box perturbations to simulate real recording and transmission conditions. Based on this benchmark, we propose AudioWMD as a robust baseline detector for large-scale, multi-method, cross-distribution settings. Results show that injection-method diversity and distribution shifts affect detection stability, while validating the effectiveness and scalability of AudioWMD. Dataset and code are publicly available.


[24] 2606.15234

Surrogate-Assisted Framework for SI-Compliant Interconnect Design Optimization Using the Earth Mover's Distance

This work presents a deterministic, machine-assisted framework for SI-compliant PCB design based on the Earth Mover's Distance (EMD). In contrast to conventional surrogate-based optimization methods that rely on iterative black-box search procedures, the proposed approach follows an interpretable, sequential evaluation strategy. Neural surrogate models are first used to efficiently predict waveform describing features from topology-dependent design parameters. A decision tree then acts as a physically motivated quality gate that identifies SI-compliant waveforms according to predefined SI criteria. Within the resulting valid solution space, the Earth Mover's Distance is employed as a similarity metric to rank candidate designs according to their proximity to an ideal reference signal. This enables not only the deterministic identification of admissible parameter regions but also a transparent prioritization of physically superior solutions without inverse modeling or stochastic search procedures. The methodology is demonstrated using a large-scale set of simulated DDR3 fly-by waveforms. By combining surrogate prediction, interpretable classification, and EMD-based waveform evaluation, the framework provides an explainable and computationally efficient alternative to conventional optimization strategies for supporting PCB development with AI-based methods.


[25] 2606.15264

DuraMark: Duration-Embedded Watermarking in LLM-based TTS

Large language model (LLM)-based text-to-speech (TTS) models have achieved remarkable voice cloning capabilities, raising concerns about potential deepfake misuse. Speech watermarking mitigates this by embedding traceable information into generated speech. Mainstream watermarking methods operate at the signal level (waveform or spectrogram), rendering the watermark vulnerable to generative attacks (e.g., neural codec and vocoder). To address this, we propose DuraMark, a robust information-level watermarking framework. It utilizes syllable duration editing to achieve watermark embedding. Specifically, DuraMark integrates a duration-controllable LLM-based TTS model to edit syllable durations during synthesis, coupled with a duration extractor to extract these durations for detection. Experiments demonstrate DuraMark's superior robustness against generative attacks, significantly outperforming signal-level baselines. Audio samples are available at this https URL.


[26] 2606.15267

Dynamic Prosody Prediction in LLM-based TTS for Improving Speaker Similarity

Personalized text-to-speech (TTS) aims to clone the target speaker in the synthesized speech, imitating both the voice and speaking style. Current large language model (LLM)-based TTS methods ignore the style-specific prosodic patterns in generated speech, resulting in deficient style learning and thus limiting speaker similarity in synthesized speech. To this end, we investigate the prosody learning conditioned on the synthesized speech, and propose to predict the prosody of the current syllable based on previously predicted speech. Experimental results obtained on three datasets demonstrated the efficacy of the proposed dynamic prosody prediction method in enhancing the prosody learning capability, thereby improving the speaker similarity of the generated speech. Audio samples are available at this https URL.


[27] 2606.15284

CAP: Towards PPG Universal Representation Learning with Patient-level Supervision

Photoplethysmography (PPG) plays a central role in wearable health monitoring and clinical decision support. Yet existing approaches to universal PPG representation learning largely focus on signal-level objectives and often overlook patient-level health context, which limits generalization to complex clinical tasks and heterogeneous cohorts. To address this gap, we construct a large-scale paired PPG-EHR multimodal dataset by distilling fragmented medical histories and clinical records into cohesive, patient-level electronic health records (EHR). Building on this resource, we propose Clinical Anchored Pretraining for PPG (CAP). During pretraining, CAP performs cross-modal contrastive alignment that anchors PPG representations to patient-level clinical semantics, guiding the encoder beyond waveform fitting toward modeling consistency in a patient's overall physiological state. During downstream adaptation, the pretrained PPG encoder provides clinically grounded representations that strengthen inductive bias and improve robustness and transferability. Experiments demonstrate that CAP consistently outperforms strong baselines on four diverse downstream tasks. CAP achieves a particularly large gain on respiratory rate prediction (up to +87.6% relative improvement over the state-of-the-art baseline) and delivers an average relative +26.7% across all tasks. We further enhance the interpretability of our approach through comprehensive analyses, including ablations and multiple complementary visualizations of the learned representations. The code for our experiments is available at: this https URL .


[28] 2606.15311

Hamilton-Jacobi Reachability-Based Safe Reinforcement Learning for Emergency Collision Avoidance

Emergency collision avoidance under extreme driving conditions demands safety-critical control that accounts for both obstacle proximity and vehicle dynamic stability over a future time horizon, yet existing methods often rely on instantaneous or local safety evaluations. This paper proposes a safe reinforcement learning framework guided by a Hamilton-Jacobi (HJ) reachability based motion safety set that provides forward-looking safety supervision for constrained policy optimization. Specifically, a unified signed safety function is formulated by combining geometric collision margins and chassis stability limits, and is then extended through reachability analysis into a finite-horizon motion safety set that characterizes whether safety can be maintained under future vehicle state evolution. To enable practical computation, the motion safety set is approximated from offline extreme driving data, mitigating the computational burden of grid-based HJ solvers. The learned motion safety set is then embedded as a continuous safety cost into a constrained Markov decision process, and a PID-Lagrangian policy optimization scheme is employed to adaptively regulate the Lagrange multiplier for safety constraint enforcement. Simulation and real-vehicle experiments on low-adhesion obstacle-avoidance scenarios demonstrate that the proposed method achieves higher goal-reaching rates, produces smoother avoidance maneuvers, and maintains larger unified safety margins than baseline methods.


[29] 2606.15313

DDPO-VC: Speaker De-Identification via Diffusion Denoising Policy Optimization

A key challenge of speaker de-identification is the balance between privacy and utility. Many utility variables, such as the cognitive health status of the speaker, are correlated with the privacy variable, such as the speaker identity, violating the independence assumption held by the disentanglement-based approaches, causing leakage of private information and the loss of useful information for downstream tasks. To tackle this challenge, we propose a general framework, DDPO-VC, for speaker de-identification through reinforcement learning-based post-training with diffusion models. Learning from reward signals combining knowledge from privacy-focused and utility-focused teachers, our method outperforms various strong \deid/ methods in both privacy preservation and cognitive utility on two commonly used dementia speech benchmarks. Please check out our code\footnote{\href{this https URL}{this https URL}} and demo\footnote{\href{this https URL}{this https URL}}.


[30] 2606.15343

Generalized likelihood ratio test for magnetic anomaly detection: a geometrical approach

State-of-the-art approaches to magnetic anomaly detection rely on the generalized likelihood ratio test (GLRT). These approaches are based on the formulation of a parametric model of the source to be detected, expressed in a suitable functional basis. One of the primary objectives of this study is to demonstrate that, for a given measurement configuration, the signal is constrained to evolve within a restricted subset of the space generated by these functional bases. The parametric representation of the signal is identified as a semi-algebraic space which, for the dipole model used in this article, turns out to be a cone outside of which the estimated signal does not satisfy the physical equations. Thus, a second objective is to exploit this property to constrain the signal parameters in the GLRT to belong to the semi-algebraic space, in order to improve detection performance. The performance gain of the proposed algorithm is compared to the one of conventional approaches; numerical simulations show that the proposed approach not only outperforms state-of-the-art methods but can even provide results close to those of the clear-seeing (optimal) receiver.


[31] 2606.15352

Chroma-gated, differentiable OKLCH interpolation: Continuous Oklab fallback for color-cast reduction

OKLCH -- the cylindrical (lightness, chroma, hue) form of Ottosson's Oklab color space -- is the interpolation space recommended by CSS Color 4 for gradients and color-mix(), and it is now broadly deployed. Its polar parameterization, however, casts color near the neutral axis in two ways: (1) an inter-hue detour between two chromatic endpoints that sweeps through an unintended hue (blue to yellow visibly passing through green), and (2) an off-line bow when one endpoint is achromatic. Existing remedies are uniformly two-valued -- a threshold switch that fires only at an achromatic endpoint -- so they address only (2); on chromatic pairs every one of them reduces to raw OKLCH, leaving the (1) inter-hue cast untreated. We introduce Continuous Oklab fallback (COFb), a one-parameter, differentiable chroma gate $w(C)=C^n/(C^n+\sigma^n)$ that continuously blends the OKLCH path toward the linear Oklab path as chroma falls. A single gate reduces the (1) cast that the two-valued family leaves untreated and unifies the handling of (1) and (2) without any endpoint test. We characterize a cast-hue trade-off frontier, adopt a default ($n=1$, the rational Michaelis-Menten form; $\sigma\approx0.19$ for a typical sRGB palette, from a normalization-independent cast-half criterion), and verify the gate's properties symbolically. At the default, COFb halves the inter-hue path detour (mean lateral deviation -49.5%, chroma-weighted hue excursion -35.5%). We also state the method's limits: on (2) alone the two-valued switch remains better, and like any Cartesian blend COFb does not preserve chroma. In deployment, COFb runs entirely in plain Oklab (a,b) to sRGB, so it serves as a fallback that delivers the same cast-reduced gradients where modern CSS color interpolation (color-mix(in oklch) and the like) is unavailable -- older engines, image and video pipelines, or GPU shaders.


[32] 2606.15366

Robust Conformal CBF and CLF Controllers via Iterative Policy Updates

Conformal prediction (CP) has been used to obtain probabilistic bounds on the error between a learned dynamics model and the true but unknown system. Such CP bounds can then be embedded into robust control Lyapunov function (CLF) and control barrier function (CBF) frameworks. However, such an approach does not retain stability/safety guarantees because of the distribution shift between the closed-loop trajectory distribution under the deployed CLF/CBF policy and the trajectory distribution from which the CP bound and its guarantees were derived. To address this issue, we propose an episodic framework that iteratively updates the robust conformal CLF/CBF policy while maintaining stability/safety guarantees across episodes. We achieve this by (1) using adversarially robust conformal prediction, and (2) quantifying a distribution shift budget that allows us to control how much the model error can increase across policy updates. This distribution shift budget is derived via a closed-loop trajectory sensitivity analysis, yielding an implicit and an explicit update rule for the CP bound. We analyze convergence of our algorithm, which we demonstrate on three case studies. To the best of our knowledge, these are the first results that provide stability/safety guarantees for robust conformal CBF/CLF policies.


[33] 2606.15408

Data Center Life Cycle Co-Design Optimization

Liquid cooled supercomputers dissipate tens of megawatts of waste heat through cooling plants organized as parallel subloops that serve coolant distribution units. The number of subloops and the assignment of units to them are design decisions fixed at construction, yet they have not been systematically optimized for facilities at this scale. As electricity grids decarbonize, embodied carbon becomes a larger share of facility life cycle emissions and the cost of an unnecessary subloop becomes harder to justify. We present a framework that integrates operational energy from a validated control optimizer based on sequential least squares programming, embodied carbon from a bill of materials, and expected unplanned downtime from a per subloop reliability model. The framework is applied to the Frontier supercomputer, evaluating all 611 ways of partitioning its 25 coolant distribution units into two through six subloops. The life cycle cost and carbon optimum is found at two subloops holding 14 and 11 units, achieving 3,320.7 tonnes of carbon dioxide equivalent and $3.99 million over a seven year horizon, a saving of 50.2 tonnes and $100,000 compared to built four subloop configuration. The optimum remains on the Pareto front in all 15 scenarios of a one at a time sensitivity sweep. A semi-analytical decision rule generalizes the result, predicting four subloops for Aurora, two for El Capitan, and one for LUMI. When reliability is treated as a hard constraint set by operations policy, the four subloop Frontier deployment is consistent with the constrained optimum.


[34] 2606.15418

Minimum settling-time PI control of pure delay processes under a hard non-overshoot constraint: exact boundary-contact characterization and the role of the MID point

We solve, exactly, the problem of minimum settling-time PI control of a pure delay process K e^{-Ls} under the hard time-domain constraint of zero overshoot, y(t) <= 1 for all t. The closed loop is a neutral delay system whose step response is piecewise polynomial on the delay segments, with geometrically decaying jump discontinuities at the segment boundaries t = kL. The constrained optimum is characterized by an equioscillation-type contact structure whose active contacts sit at echo boundaries: kink maxima grazing the setpoint, jumps landing on the settling-band edge, and boundary troughs anchored to it. The number of contact equations equals the number of gains, so the optimum is exactly computable for every band delta. In a closed-form regime, delta in [(3-2 sqrt2)/4, (3-2 sqrt2)/2] approx [4.29%, 8.58%], the optimal gains are independent of delta: K Kp = 1 - sqrt2/2, K Ki L = sqrt2/2, and the optimal settling time is Ts*(delta) = (4 - sqrt2 - 2 sqrt(delta)) L. Outside this window the optimum solves an explicit two-equation polynomial system per regime, and Ts*(delta) is a staircase with exact flats at integer multiples of L from jump-landing pinning. As delta -> 0 the optimal gains converge to K Kp = e^{-2}, K Ki L = 4 e^{-2}, the generic multiplicity-induced-dominancy (GMID) point of the neutral quasipolynomial. The GMID response satisfies the hard constraint and uniquely maximizes the decay rate; yet at every finite delta the delta-adapted optimum strictly beats the fixed GMID tuning, by about 40% at delta = 2%. The MID point is thus the limit of the optimal gains without ever being the optimal tuning. A numerical extension to first-order-plus-time-delay plants quantifies the speed/robustness trade across Ms in [1.39, 1.76].


[35] 2606.15454

Phonetically Explainable Speech Deepfake Detection

Speech deepfake detection is predominantly treated as an opaque classification task where all temporal frames are aggregated equally. This ignores that different phonetic categories carry vastly different amounts of discriminative information. To address this, we propose a phoneme-guided cross-attention framework that transforms detection into an interpretable, phonetically grounded process. We factorize the spoofing posterior $P(\text{spoofed}\mid X, W)$, conditioned on the acoustic representation $X$ and the phonetic posteriorgram $W$. The resulting factorization can be written as $P(\text{spoofed} \mid X, W) = \sum_{i=1}^{M} w_i \cdot P(\text{spoofed} \mid X, Z = z_i)$, where $M$ denotes the number of phonetic classes, $P(\text{spoofed} \mid X, Z = z_i)$ is the spoofing probability for the $i$-th phonetic class $z_i$ conditioned on $X$, and each $w_i$ is the prevalence of phonetic class $z_i$ in the utterance. Our transformer-based architecture instantiates this through a cross-attention block in which phonetic queries selectively probe information in acoustic keys and values, with softmax-normalized pooling supplying explicit phone-presence weights. Unlike prior approaches that rely heavily on post-hoc explainability methods, our framework offers phonetic-explainability-by-design. We evaluate the framework on an LJSpeech-derived corpus, ASVspoof 2019 LA, and ASVspoof 5 Track 1. Per-phone importance rankings reveal that discriminative power concentrates on articulatory categories that generative models struggle to reproduce faithfully. Stops, fricatives, affricates, nasals, and silence-boundary closures rank most discriminative, while periodic vowels and semivowels rank lower. Beyond competitive performance, our model provides structural interpretability, yielding an inspectable per-articulatory category breakdown of the final verdict.


[36] 2606.15477

Universal adaptive beamforming: A Bayesian approach

We present a Bayesian universal beamforming framework for adaptive array processing in dynamic underwater acoustic environments with unknown and time-varying propagation geometry. Motivated by ideas from universal prediction and estimation, the proposed approach discretizes the angular domain into a finite set of steering hypotheses and recursively computes posterior probabilities over competing spatial models using observation-dependent likelihood functions. For Gaussian observation models, the posterior update reduces to an exponential-weights recursion driven by hypothesis-dependent beamformer evidence metrics. The resulting framework performs soft spatial inference and adaptive beamforming by continuously redistributing posterior probability across competing steering hypotheses while forming posterior-weighted combinations of branch outputs. The formulation naturally connects to classical adaptive beamformers including matched filtering and minimum mean-square error (MMSE) beamforming. In addition, the framework is extended toward broadband underwater acoustic communication receivers through frequency-domain beamformer synthesis and adaptive equalization. Posterior probabilities are updated according to branch-specific equalization errors, enabling joint spatial-temporal adaptation under multipath propagation, Doppler-induced distortions, and time-varying channel conditions. Experimental results using MACE data demonstrate reliable communication performance with low overhead, low data detection mean-squared error, and zero observed bit errors.


[37] 2606.15505

Positive-Real Identification of Sparse Mori-Hamiltonians from Partial Observations

Discovering the governing equations of a physical system from data is a central goal across the sciences, yet in most experiments only a few states are accessible while the rest stay hidden. Existing approaches treat this partial observability as an obstacle to be removed by first reconstructing the hidden state -- a step that is ill-posed under noise and that discards the physical constraints, such as energy conservation, that the true dynamics obey. We show that for conservative (Hamiltonian) systems no reconstruction is needed: projecting the dynamics onto the measured coordinates yields a memory kernel that we prove to be a lossless positive-real rational matrix, whose poles are the hidden natural frequencies and whose positive-semidefinite residues encode the couplings. The governing equation -- and the underlying Hamiltonian -- can therefore be read directly from the autocorrelation of the measured signal, with guarantees of uniqueness and physical passivity, and without neural networks. We validate the approach on linear, nonlinear, and chaotic systems under realistic noise. By recovering interpretable equations of motion that conserve energy by construction from partial measurements, the method offers a common tool for problems spanning mechanics, fluid and plasma physics, and beyond.


[38] 2606.15638

MambAdapter: Lightweight Mamba-Based Adapters for Parameter-Efficient Transfer Learning in Speech and Audio

Fine-tuning Transformer-based foundation models has become the dominant strategy for domain adaptation in audio and speech processing. To reduce the computational and memory costs of this process, parameter-efficient transfer learning (PETL) methods have been widely explored. Meanwhile, Mamba, a recent state-space model, has emerged as a promising alternative to Transformers for sequence modeling. In this work, we present MambAdapter, a parameter-efficient transfer learning approach that integrates Mamba into low-rank bottleneck adapters. Our design combines parameter sharing across adapters with the injection of a lightweight Mamba module, enabling more effective modeling of audio features. We demonstrate that MambAdapter matches or outperforms strong PETL baselines on four audio classification tasks and five speech recognition languages, even when operating under reduced parameter budgets.


[39] 2606.15692

A Surface-based Multimodal Framework for Multitask Analysis in Alzheimer's Disease

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder, and longitudinal analysis is critical for early detection and effective intervention. Developing models capable of multimodal and multitask analysis enables a more comprehensive understanding of AD progression. However, multimodal learning remains challenged by cross-modal misalignment, non-Euclidean surface representations of cortical data, and limited data availability in small-sample clinical settings. In this work, we propose an augmented spherical data-driven multimodal framework for multitask AD analysis. A spherical diffusion model is first trained to generate paired cortical thickness and Tau PET Standardized Uptake Value Ratio (SUVR) data, enabling structurally consistent multimodal augmentation on cortical surfaces while preserving anatomical correspondence. The augmented data are subsequently used to train a contrastive learning model that learns aligned and fused cross-modal representations. This design strengthens multimodal integration and encourages more balanced representation learning. The learned imaging features are further integrated with tabular cognitive assessments and demographic variables, and processed using an in-context learning model to perform both classification and regression tasks without task-specific fine-tuning. Experiments on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset ($n = 802$) demonstrate consistent performance improvements across five diagnostic and longitudinal tasks, outperforming six baseline models.


[40] 2606.15706

Nodal Frequency Stability-Constrained UC & ED for Renewable-Dominated Power Systems

In modern power systems with high shares of renewable, inverter-based resources (IBRs), frequency stability becomes more complex due to the fast dynamics of IBRs and frequency trajectories that vary significantly from bus to bus. In this paper, we present an optimization framework for unit commitment and economic dispatch with endogenous frequency stability constraints at each bus. Two approaches for mitigating excessively low instantaneous frequency values in the event of the largest generator contingency are proposed: 1) by introducing a constraint requiring more thermal generation, and 2) by constraining the maximum power output of the generator that had the largest power output in the incumbent solution. Both approaches proved effective in eliminating dispatch scenarios that resulted in instantaneous frequencies below 58 Hz, while the second approach minimized the difference in production cost values from the non-stability-constrained case. Overall, the results indicate that the proposed optimization framework is a more effective alternative to frequency stability-constrained unit commitment and economic dispatch (UC & ED) than those based on the center-of-inertia (COI) principle.


[41] 2606.15742

Deep Learning-Based Automatic Modulation Classification Using GRU Networks

Automatic modulation classification (AMC) plays a critical role in modern wireless communication systems, particularly in non-cooperative scenarios where prior knowledge of the transmitted signal is unavailable. In this study, a gated recurrent unit (GRU)-based deep learning framework is investigated for the classification of digital modulation schemes by exploiting the temporal characteristics of received signals. The proposed approach operates directly on in-phase and quadrature (I/Q) signal representations and aims to learn discriminative features in a data-driven manner without relying on handcrafted feature extraction. The performance of the proposed model is evaluated for BPSK, QPSK, and 16PSK modulation schemes under additive white Gaussian noise (AWGN) channel conditions across a wide range of signal-to-noise ratio (SNR) levels. The obtained results demonstrate that the GRU-based model achieves reliable classification performance, with overall accuracy improving from 55.3% at -10 dB SNR to 98.5% at 15 dB SNR. In particular, the model exhibits strong performance at moderate and high SNR levels, while maintaining reasonable accuracy even under challenging low SNR conditions. These findings suggest that GRU-based architectures provide a promising and computationally efficient solution for modulation classification tasks. The presented results represent an initial step toward more comprehensive studies, including extensions to fading channel environments, additional modulation schemes, and real-time implementations using hardware platforms.


[42] 2606.15746

Age and Stability Trade-offs in Remote Monitoring Systems

Timely information is important in a wide variety of Internet of Things (IoT) services in which a shared server must manage two competing tasks: (i) processing a queue of jobs, and (ii) generating status updates to a remote monitor. This creates a fundamental trade-off between queue stability and data freshness. In this work, we model this scheduling decision as a Markov Decision Process (MDP) with the objective of minimizing a weighted sum of the average Age of Information (AoI) and the average queue length. We show that the optimal scheduling strategy is a queue-dependent age threshold which is monotonic. The shape of the switching curve differs according to different priority regimes. Finally, we compare the optimal MDP policy against heuristic policies.


[43] 2606.15775

Uncertainty-Aware Haptic Signal Estimation for Reliable and Resource Efficient Tactile Internet

The Tactile Internet aims to enable real-time remote haptic interaction; however, the high sampling rates required for transparency in haptic control often lead to severe congestion in multi-user wireless environments. This paper proposes the Agile AI-empowered Haptic (A2HAP) framework, which integrates VarxHAP, a novel probabilistic neural network for joint force and uncertainty estimation, with an error-resilient controller. By employing a hierarchical gating architecture, the system dynamically adapts transmission thresholds to balance model confidence against reliability targets. Simulation results demonstrate that A2HAP suppresses packet rates by up to 45% during peak traffic and reduces resource block consumption by 25% on average. Consequently, the framework supports a 20% increase in user capacity compared to state-of-the-art methods while maintaining the ultra-reliability required for stable teleoperation.


[44] 2606.15813

AdaTT: Text-Guided Instrument Timbre Transfer with Target-Adaptive Structural Control

This paper addresses timbral ambiguity in instrument timbre transfer under fine-grained structural conditions. We argue this issue stems from instrument-specific expressive details in these conditions, which conflict with the target timbral properties. For example, imposing a violin's pitch-dominant vibrato contours onto a flute, which naturally exhibits loudness-dominant vibrato, impairs timbral fidelity. We propose AdaTT, a target-adaptive system that ensures high timbral fidelity across diverse timbre transfer scenarios within the ControlNet scheme. It selectively scales the frame-wise influence of pitch and loudness controls via text prompts to match the target instrument's identity. We also present a semi-automatic data construction pipeline to teach the model which expressive details to transform or preserve. Results show AdaTT achieves superior timbral fidelity and naturalness while retaining score-level content. Audio samples are available at this https URL.


[45] 2606.15826

Geometrically Constrained Decentralized Independent Vector Analysis for Distributed Microphone Arrays

This paper proposes a geometrically constrained decentralized independent vector analysis (GC-Dec-IVA) method for distributed microphone arrays. Recently proposed Dec-IVA method enables source separation by exchanging only power-related statistics to exploit cross-array information. However, this initial attempt often provides negligible improvement over applying IVA locally at each array, mainly due to the potential permutation inconsistency among arrays and the strong cross-array dependency implied by its source model. To address these limitations, we incorporate direction-of-arrival (DOA) information to derive GC-Dec-IVA, which mitigates permutation mismatch across arrays and enhances source alignment. Furthermore, a new source model is introduced to weaken cross-array dependency, improving robustness against permutation inconsistency in noisy environments. Experiments show the proposed method improves both the separation performance and cross-array permutation consistency.


[46] 2606.15845

An Optimal Power Management Policy for Hydrogen-based Hybrid Aero Engines

This paper presents a power management policy for a hydrogen-based hybrid aero engine combining a gas turbine and a solid oxide fuel cell (SOFC). Specifically, we first identify a quadratic quasi-steady-state model of the propulsion system and formulate the minimum-fuel optimal control problem as a function of the power split between gas turbine and SOFC that captures the interconnections between the components and accounts for their operational limits. Second, leveraging the Karush-Kuhn-Tucker optimality conditions and partial convexity and monotonicity model properties, we compute the globally optimal steady-state power split for the different phases of the flight in closed form. Finally, we verify this power management policy with a high-fidelity integrated static model %simulator across different flight phases, revealing in less than 1.5 % normalized root mean square error in power allocation and less than 0.7 % in predicted fuel consumption. Our results show that the optimal power management policy can be translated into a heuristic control law requesting the highest SOFC power that does not exceed its maximum operating temperature, ultimately paving the way for minimal-effort on-board implementations.


[47] 2606.15851

Fast Convergence and Robustness for Two-Layered Forgetting Recursive Least Square under Finite Excitation

Under nonpersistent excitation (non-PE) conditions, conventional methods such as exponential forgetting (EF) or directional forgetting (DF) recursive least squares (RLS) that rely on direct regressor vectors exhibit inherent limitations in terms of stability guarantees for parameter errors, robustness to system changes, and convergence rates. To address these limitations, this study introduces a novel two-layer forgetting RLS (TLF-RLS) identification method based on an augmented regressor matrix constructed using DF, which ensures global exponential stability and enhances robustness under non-PE condition. However, the convergence rate of the parameter is strongly dependent on the forgetting factor because of the introduction of EF in the outer layer, which causes an estimation windup under non-PE condition. To address this issue, a novel reconfiguration-based EF (ReEF) algorithm is proposed, which is achieved through variable- and matrix-based forgetting related to the magnitude of the eigenvalues of the current covariance matrix. Theoretical analysis indicates that TLF-RLS with ReEF algorithm guarantees uniform ultimate boundedness of the condition number under mild assumptions. Consequently, the proposed method resolves the trade-off between fast parameter convergence and robustness in both transient and steady-state responses under changes in system characteristics. Numerical simulations of three aforementioned cases demonstrate the effectiveness of the proposed method.


[48] 2606.15856

Early Anomaly-Onset Detection based on Wigner--Ville Distribution Slice Spectra: A Transmission-Grid Test Case

Operational disturbance monitoring in power networks requires decisions to be made from waveform windows as they arrive, rather than from completed records after the event. This study evaluates full-vector Wigner--Ville Distribution Slice (WVDS) spectra for sequential anomaly-onset detection in high-voltage grid-voltage waveforms. The approach keeps the bilinear midpoint interaction structure of the Wigner--Ville distribution and represents each 128-sample voltage window by a 128-dimensional slice spectrum, avoiding manually selected fault-frequency markers. WVDS is used with a baseline-normalized deviation (BND) score and is compared against the BND of Fast Fourier Transform (FFT-BND), raw-window autoencoders, FFT autoencoders, and WVDS autoencoders under the same thresholding and three-window persistence rule. A synthetic autoencoder--clustering teacher is used to select RTE fault records that start from an initially normal region and then transition to anomalous behavior. On the filtered test set, FFT-BND achieves the highest sensitivity, whereas WVDS-BND provides the lowest false-alarm operating point, reducing record-level pre-onset false alarms to 0.69%. The autoencoder comparison follows the same selectivity pattern: WVDS reconstruction decreases false alarms relative to FFT reconstruction but misses more examples. The results indicate that preserved WVD cross-term information can form a selective representation for online grid-waveform anomaly monitoring when false alarms are costly.


[49] 2606.15942

Stability Analysis in Multi-Constraint Safety Filters for Linear Systems

Multi-constraint safety filters based on control barrier functions for linear systems with affine state constraints yield continuous piecewise-affine closed-loop dynamics and may introduce boundary equilibria and unstable active-set modes. Although they guarantee forward invariance, they can change nominal stability, and it remains unclear when unstable modes cause divergence versus bounded, convergent behavior. This paper develops a geometric framework to separate these cases: leveraging explicit active-set realizations, we show that equilibria associated with nonempty active sets lie on the corresponding constraint faces and that any unstable directions are tangent to those faces due to exponential enforcement of the active constraints. We characterize mode stability via a minimum-phase test, certify divergence under fixed active sets using recession cones, and derive tractable linear-matrix-inequality conditions for global exponential stability or boundedness using Lyapunov and LaSalle arguments.


[50] 2606.15948

Artificial Intelligence for Power-Converter-Rich Electrical Systems: A Review

Power-converter-rich electrical systems, formed by renewable generation, electrified transportation, and inverter-based resources, exhibit strongly nonlinear dynamics, multi-physics design tradeoffs, fast control requirements, and growing reliability and cybersecurity constraints. These characteristics strain workflows that rely only on physics-based modeling, sequential optimization, and rule-based operation. This paper reviews artificial intelligence (AI) for power-converter-rich electrical systems through a life-cycle and deployment-readiness perspective. The literature is organized across converter design, real-time control, system-level operation, and compliance-oriented governance. For design, we examine surrogate modeling, topology and parameter synthesis, EMI/EMC-aware optimization, reliability-oriented design, and knowledge-assisted workflows. For control, we compare supervised learning, reinforcement learning, learning-augmented predictive control, and safety-constrained learning according to their role in closed-loop implementation. For operations, we focus on microgrid coordination, forecasting, distribution-system observability, privacy-preserving coordination, and cyber-resilient operation where converter-interfaced resources shape the operating problem. Across these stages, the review emphasizes deployment-critical gaps, including stability certification, constraint satisfaction, interpretability, extrapolation, data efficiency, sim-to-real transfer, embedded latency, cybersecurity, privacy, and standards alignment. The resulting taxonomy is intended to clarify where AI is already useful as an engineering support tool and where further validation is needed before autonomous or safety-critical deployment.


[51] 2606.15968

Bridging the SEA Gap: An Initial Benchmark for Neural Audio Codec-Synthesized Speech Deepfakes in South-East Asian Languages

Codecfakes (CFs) are a type of speech deepfakes generated through Audio Language Models (ALMs), with Neural Audio Codecs (NACs) forming the core mechanism for speech encoding and generation. CFs exhibit distributional characteristics that differ from vocoder-based deepfakes, causing detectors trained on vocoder data to generalize poorly to CFs detection. Although this has led to the development of CF detection benchmarks, existing resources are largely confined to English -- and to a limited extent Chinese -- leaving South-East Asian (SEA) languages unexplored. To bridge this gap, we introduce SEA-CF, the first large-scale benchmark for CF detection spanning multiple SEA languages, diverse speaker profiles, and a wide range of NAC architectures. SEA-CF is constructed by synthesizing publicly available real speech corpora. Our experiments show that state-of-the-art (SOTA) CF detectors trained on English-centric datasets fail to generalize to SEA speech due to language-specific phonetic structures, tonal variations, and rich prosodic diversity. We further conduct a comprehensive zero-shot and fine-tuned evaluation of recent SOTA ALMs on SEA-CF. Fine-tuning the ALMs improves performance, however, these are very large being impractical for real-world application due to their scale, particularly in low-resource and latency-constrained settings. To address this limitation, we propose a novel small-ALM, GARUDA tailored for CF detection, which delivers strong performance while remaining lightweight. Extensive evaluations demonstrate that the proposed Small-ALM outperforms strong end-to-end and ALM-based baselines, establishing a new, practical direction for robust CF detection in SEA languages and beyond.


[52] 2606.15973

An auscultation location specific study on the relationship between expiratory-to-inspiratory acoustic patterns and spirometric airflow limitation across age and gender in asthmatic patients

Asthma causes expiratory airflow limitation and is clinically assessed using spirometry, which provides the FEV1/FVC ratio representing the proportion of air exhaled in the first second relative to total forced vital capacity. Prior studies suggest that respiratory sounds recorded at posterior sites (Left Lower, Left Upper, Right Upper, Right Lower) reflect regional airflow patterns. In this study, we investigate the relationship between the expiratory-to-inspiratory (E/I) spectral power ratio and FEV1/FVC in 141 participants aged 20-60 years using Spearman correlation across frequency subbands. The 100-200 Hz and 200-400 Hz bands showed significant correlations. Overall, lower posterior sites showed stronger associations; younger adults showed stronger correlations at the Left Lower site, whereas older adults showed stronger correlations at the Left Upper site. Gender-stratified analysis showed stronger Left Lower correlations in males and stronger Left Upper correlations in females.


[53] 2606.16016

SparseCol: A 1320 BTOPS/W Precision-scalable NPU Exploiting Training-free Structured Bit-level Sparsity and Dynamic Dataflow

Bit-serial computation enables sequential processing of data at the bit level, providing several advantages, such as scalable computational precision. This approach has gained significant attention, especially for exploiting bit-level sparsity in AI workloads. While current bit-serial processors leverage bit-level sparsity to eliminate the computation associated with zero bits, they face a fundamental trade-off: either they suffer from low memory-access and computation efficiency caused by irregular patterns of non-zero bits, or they incur substantial area overhead from complex online scheduling mechanisms required to reorganize bit-level data and preserve memory access and computation regularity. Therefore, we present the SparseCol processor, designed to harness extensive bit sparsity while maintaining high hardware utilization across various AI applications, including CNNs, RNNs, and transformers. In contrast to traditional methods, SparseCol exploits structured bit-level sparsity, denoted by bit-column sparsity, without requiring any re-training. Furthermore, SparseCol implements a dynamic dataflow architecture that tackles hardware under-utilization issues commonly found in existing bit-serial solutions. Fabricated in 16nm CMOS node, SparseCol delivers 1320 BTOPS/W (BTOPS represents Binary Tera-Operations Per Second, calculated as #W bits x #A bits TOPS) peak efficiency while maintaining accuracy, outperforming SotA sparse processors in terms of efficiency by 6.8x. Comprehensive evaluations on CNN classification tasks and transformer architectures demonstrate system-level efficiencies of 745.02 BTOPS/W and 850.5 BTOPS/W, respectively.


[54] 2606.16068

Anisotropic Template Ansätze for Robust Positive Invariance under State-Dependent Uncertainty

We establish sufficient conditions for robust positive invariance under state- and input-dependent disturbances with anisotropic covariance structure. The proposed ansatz maps a fixed ellipsoidal template through a GP-derived positive-definite matrix field, subsuming scalar homothetic scaling while retaining finite graph-based verification. The resulting LMI conditions couple the learned field to Schur-stable dynamics; an isotropic fallback with inflation factor $r=1/(1-\gamma_{\mathrm{cl}})$ proves admissibility. During each learning epoch the field is frozen, so online tube evaluation is one GP covariance query and a small matrix square root, with no online set iteration or LMI solve. Quadrotor simulations show a $195\times$ reduction in 3D velocity-tube volume and a $2.1{\times}10^5$ reduction in the joint 7D velocity-control subspace relative to a non-adaptive homothetic baseline. This extended version adds full proofs, a separated offline/online complexity analysis, and controller-sweep, contraction, and projection-area studies.


[55] 2606.16069

Data Scarcity in Gas Load Profiling: Generalized Proxy-Guided Load and Temporal Disaggregation

The electrification of heating systems is a critical pathway for decarbonizing the building sector; however, the development of sustainable strategies is often hindered by the lack of granular thermal load profiles. The nature of this problem is such that available data are extremely scarce, irregular, and low-frequency, rendering conventional data-hungry machine learning approaches impractical or unreliable. High-resolution gas metering is rarely available, as it falls outside standard utility business requirements. To bridge this gap and enable data-driven AI analytics under severe data limitations, this paper presents the Generalized Proxy-Guided Load and Temporal Disaggregation framework. Validated on a dataset of 11 multi-unit residential buildings over 18 months, the framework achieves a mean squared percentage error (MSPE) of 6.37% for reconstructed total gas consumption. The methodology operates through a four-stage process: (i) Generalized Occupancy Proxy Extraction via weather normalization to isolate behavioral signals from hourly electricity data; (ii) Unified Segmentation and Normalized Pooling to mitigate the statistical limitations of sparse billing data; (iii) Unified Baseload Parameter Estimation enhanced by local calibration to ensure building-specific accuracy; and (iv) Component-Wise Temporal Disaggregation to reconstruct distinct baseload and heating profiles. Overall, the proposed framework effectively bridges the resolution gap, transforming low-frequency utility data into high-fidelity training sets suitable for scalable, data-driven decarbonization modeling.


[56] 2606.16107

Variable-Rate Deep Image Compression based on Low-Rank Adaptation by Progressive Learning

In the digital age, image compression is crucial for numerous applications, including web media, streaming services, high-resolution medical imaging, and connected vehicle networks, enabling efficient data storage and transmission. With the increasing demand for high-quality image communication, the need for advanced compression techniques becomes increasingly critical. Numerous Deep Image Compression (DIC) techniques have recently been introduced, showing impressive performance compared to traditional standards. However, variable-rate image compression remains an unresolved issue. Specific DIC methods deploy multiple networks to attain different compression rates, whereas others use a single model, which often results in higher computational complexity and reduced performance. This work proposes a progressive learning approach for variable-rate image compression based on the parameter-efficient fine-tuning method, the Low-Rank Adaptation (LoRA). We introduce an additional LoRA Rate-Adaptive Module (LoRAM) in DIC methods. Due to the re-parameterized merging of LoRA, our proposed method does not introduce additional computational complexity during inference. Compared to methods utilizing multiple models, comprehensive experiments demonstrate that our approach achieves competitive performance, saving 99\% in parameter storage, 90% in datasets, and 97% in training steps.


[57] 2606.16115

Stabilizing Short Duration Speaker Verification through Neural Re-scoring with Hybrid Enrollment

Short-duration speaker verification (SDSV) is crucial for personalized keyword spotting, where test utterances are typically shorter than three seconds. Limited speech duration results in unstable speaker representations and increased sensitivity to noise and phoneme variations, thereby degrading performance. To investigate this issue, we construct VoxPhrase, a large-scale SDSV corpus automatically segmented from the VoxCeleb dataset. Our analysis shows that text-dependent (TD) enrollment is constrained by duration and yields unstable speaker representations. In contrast, although text-independent (TI) enrollment introduces content mismatch, its representations become more stable as the enrollment duration increases. Accordingly, we propose a hybrid-enrollment neural re-scoring framework that combines TD and TI enrollment and performs frame-level comparison via parallel cross-attention. Experiments on VoxPhrase demonstrate consistent improvements across multiple speaker models.


[58] 2606.16116

Distributed Safe Consensus Under Asymmetric Input and Time-Varying Output Constraints

This paper studies safe distributed consensus for single-integrator multi-agent systems over connected undirected graphs under simultaneous asymmetric actuator constraints and output safety constraints. Each agent is equipped with a continuously differentiable asymmetric actuator dynamics that maps a commanded control signal to the realized plant input while keeping the latter strictly inside a prescribed admissible interval. To address output safety, a barrier-coordinate transformation is introduced over a common time-varying safe interval, and a distributed synchronization law is designed in the transformed coordinates. The resulting controller integrates a graph-based coordination layer with an actuator-side tracking layer, thereby enabling simultaneous enforcement of input admissibility, forward invariance of the safe output set, and asymptotic synchronization. For compact admissible sets of initial conditions, it is shown that the closed-loop solution is complete, all signals remain bounded, the actuator inputs remain strictly within their asymmetric bounds, and the agent outputs remain inside the prescribed safe interval for all time. Moreover, the transformed synchronization errors converge exponentially to zero, and the original agent outputs asymptotically synchronize to a designer-selected admissible trajectory embedded in the common safe interval. Numerical simulations validate the proposed framework and demonstrate safe consensus under both asymmetric actuation bounds and time-varying output constraints.


[59] 2606.16117

Geospatial sensitivity of transmission-constrained ACOPF to generator retirement

The US faces a growing resource adequacy challenge: new loads are being added at unprecedented scale while aging generating assets are being retired. In transmission-constrained grids, it is difficult to determine which units can be safely retired and which cannot be retired and instead require lifetime extensions until new generation can be built. Historically, this analysis was prohibitively time consuming. Transmission-constrained AC optimal power flow (ACOPF) is computationally intensive, and a thorough comparison and prioritization of generators could require hundreds or thousands of scenarios. We present an HPC-enabled framework that enables computation and geospatial mapping of the effects of generator retirement in terms of voltage magnitude and angle effects in the steady state. Specifically, our framework detects the effects of generator retirement using a simple k-nearest-neighbors model and a voltage-class-adjusted neighbor model. We demonstrate the results on over 8,000 generator retirement scenarios for a 70,000-bus transmission-constrained synthetic grid.


[60] 2606.16132

Neural Network-Enabled Codebook Design for Phased Array Calibration with Arbitrary Array Sizes

Array calibration is critical to achieving accurate beamforming in millimeter-wave (mmWave) antenna-in-package (AiP) phased arrays, where over-the-air (OTA) calibration in ALL-ON mode is a standard requirement. For practical calibration measurements, two core metrics are paramount: efficiency (defined by measurement time) and reliability (robustness, governed by the condition number of the phased array calibration codebook). In this work, we propose a neural network-enabled codebook generation method for phased array calibration compatible with arrays of arbitrary sizes. Codebooks generated via the proposed method achieve low condition numbers while requiring the minimum number of measurements, outperforming state-of-the-art calibration approaches. Practical measurements on a 26-GHz AiP phased array validate the effectiveness and robustness of the proposed method, with superior performance in both array calibration accuracy and beamforming quality.


[61] 2606.16157

PALM: Single-Station Super-Resolved Small-Scale Radio-Map Localization by Path-Atom Matching

Localizing from a single base station is a longstanding goal, since it removes the synchronized anchors that geometric methods require. A radio map (RM) answers a position query from this one-station survey, yet classical RMs store coarse received power and match it by correlation, ignoring the small-scale path structure a ray tracer provides. We instead build a small-scale RM and show that cell identification, rather than candidate generation, is its information-limited bottleneck. We propose path-atom localization by matching (PALM), which super-resolves a coarse angle-delay observation into scored atoms and matches them to a ray-traced RM by an exact marginal likelihood. The score marginalizes atom reality inside the logarithm, and we prove that the common posterior-scaled surrogate is a Jensen lower bound whose deficit grows with the number of strong paths. We match on the absolute delay axis under a clock nuisance, since relative delays jump across shadowing boundaries, and we prove a unit-gradient law, a capped miss cost, a minimum-mean-square local centroid, and finite-sample conformal coverage. On the real DeepMIMO campus scenario, PALM localizes to a 1.7 meter median from a single base station, cuts the ninetieth-percentile error of received-power RM matching by 34 to 62 percent, and halves the single-snapshot median to 7 meters.


[62] 2606.16170

Synergizing Global Pattern Learning and Time Order Characterization in Mobile Channel Prediction: An RWKV-Based Approach

Owing to the potential to reduce pilot overhead and mitigate channel aging, channel prediction is emerging as an important research topic in wireless communications. Meanwhile, deep neural networks are becoming a foundational technology for high-precision prediction thanks to their excellent non-linear representation capabilities. In this paper, we conceive a task-driven prediction network, which aims to deeply synergize the following two functions: learning global patterns for shareable features across adjacent time slots and structurally encoding time order to characterize the inherent causality within the channel dynamics. To implement channel prediction accuracy, we employ RWKV (receptance weighted key value) as network backbone and adapt it to the task's specific characteristics, utilizing its deep interleaved learning architecture to extract global patterns across multiple channel samples and leveraging its unique exponential decay to characterize temporal order. These task-driven unique designs significantly improve the learning efficiency of prediction network. Comprehensive experimental evaluations demonstrate the superiority of the proposed method over current data-driven methods, such as long short-term memory and Transformer, in the channel prediction task, including 1.84~4.29 dB gains in normalized mean squared error and 2.6~10.5 percentage point gains in cosine correlation.


[63] 2606.16171

Data-driven Control with Real-time Uncertainty Compensation for Multi-Fuel Engines

Multi-fuel compression ignition (CI) engines offer superior power density and fuel flexibility. However, achieving consistent and optimal combustion phasing across a wide range of operating conditions remains a major challenge, particularly in the presence of modeling uncertainties. This paper presents a novel, data-driven real-time uncertainty compensation framework for combustion control in multi-fuel CI engines. The proposed approach introduces a pseudo-engine speed that enables dynamic adaptation of control inputs in response to uncertainty affecting the engine. To model the underlying combustion process, a Gaussian Process Regression (GPR) model is first trained on available input-output data, capturing the nonlinear and fuel-dependent behavior across varying operating conditions. Control inputs are then synthesized through model inversion of the learned GPR surrogate and augmented with an uncertainty compensator designed to mitigate deviations caused by dynamic variations in operating conditions and model inaccuracies. This integrated control strategy allows for real-time input corrections within a finite number of combustion cycles. Theoretical analysis establishes finite-time convergence guarantees for the proposed controller. Simulation results demonstrate that the proposed method steers the combustion phasing to the desired value in real-time, providing a scalable and adaptive control solution for multi-fuel CI engine operation.


[64] 2606.16186

MOSAIC: Mobile Object Segmentation under Adverse Imaging Conditions for Rapid L-PBF Keyhole Behavior Characterization

In laser powder bed fusion (L-PBF) processes, the rapid evolution of gas and fluid interactions complicates our ability to properly monitor or control the process, with unstable keyholes leading to porosity and spatter formation. High-speed operando x-ray imaging of the keyhole has been used to better understand the impact of these interactions on the monitoring and control of the L-PBF process. MOSAIC, a Mobile Object Segmentation algorithm for experiments under Adverse Imaging Conditions, is designed to perform rapid analysis of keyhole dynamics during active beamline experimentation without needing time consuming manual labeling or model training. Validation studies performed on 12 unique samples proved the robustness of MOSAIC with an average F1 score of 0.894 and a precision of 0.953 when compared to manually segmented images, performing equally or better than the SAM and YOLO machine learning methods tested. MOSAIC is efficient, processing frames cropped to a moving window approximately 150x250 pixels at 19.9 milliseconds per image on CPU, compared to 54 and 5284 milliseconds per image for inference on CPU for YOLO and SAM models.


[65] 2606.16282

Geometry-Driven Islanding Detection and Fault Classification for Grid-Forming Inverters: A Normally Hyperbolic Invariant Manifold Framework with Physics-Derived Thresholds

This paper presents a geometry-driven detection and fault-classification framework for grid-forming (GFM) inverters based on normally hyperbolic invariant manifolds (NAIM) and stochastic hypothesis testing. The GFM droop manifold $\mathcal{M}_0$ is identified as a NAIM of the closed-loop dynamics. Transverse fluctuations under grid noise are modeled as an Ornstein--Uhlenbeck process, and the long-run covariance is obtained from the algebraic Lyapunov equation. The detection statistic $D_t=T_w\bar{\xi}_{\perp}^{\top}\Sigma_{\mathrm{long}}^{-1}\bar{\xi}_{\perp}$ converges to $\chi^2(2)$ under the null hypothesis, yielding the tuning-free threshold $D_{\alpha}=-2\ln\alpha$ and an asymptotically exact false-alarm rate $\alpha$. A factor-of-2 error in earlier formulations is corrected and validated using 8,000 Monte Carlo realizations over nine window lengths and three significance levels. The Berry--Esseen bound $d_{\mathrm{KS}}\leq1.6704/(\beta T_w)$ is confirmed empirically. The minimum window condition $T_w\geq10/\beta_{\min}\approx1.0$ s, where $\beta_{\min}=\min(\omega_f,\omega_v)$, satisfies the IEEE 1547-2018 two-second detection requirement. A co-design theorem shows that increasing $(\omega_f,\omega_v)$ simultaneously enlarges the Fenichel spectral gap, tightens the null covariance, and reduces the false-alarm rate. Modal decomposition separates frequency and voltage contributions, enabling classification of islanding and voltage faults without additional sensors. Case studies confirm correct acceptance of normal operation, rapid detection of soft islanding, and accurate identification of a 10\% voltage sag.


[66] 2606.16303

An Adjoint-based Neural Regulator for Real-Time Optimal Control with State Constraints

This paper introduces a learning-based control framework for real-time constrained optimal control of nonlinear systems with safety guarantees based on the Pontryagin's Minimum Principle. The approach learns a neural co-state (adjoint) policy that encodes optimality through the system Hamiltonian, rather than directly approximating a control law. Feasibility is enforced separately at runtime through an efficient convex projection that incorporates actuator limits and safety constraints expressed as control barrier functions. We refer to this framework as an adjoint-based neural regulator (ANR) as it yields a controller that satisfies constraints while retaining the optimality structure encoded by the learned adjoint. We demonstrate the effectiveness of the proposed framework on nonlinear constrained control tasks using a unicycle model. The ANR achieves performance at par with nonlinear model predictive control at more than two orders of magnitude lower computational cost, while exhibiting near-invariant performance across unseen scenarios, thus, significantly outperforming reinforcement learning methods in out-of-training-distribution regimes.


[67] 2606.16306

Graph Diffusion-Advection Operator for Directed Graph Signal Processing

Graph signal processing (GSP) provides a framework for analyzing data on irregular domains, with applications in neuroscience, finance, chemistry, and social sciences. Classical GSP primarily models symmetric relationships using undirected graphs, yet many real-world systems exhibit asymmetric interactions, motivating extensions to directed graphs. Central to directed GSP is the graph shift operator, typically defined via the directed graph Laplacian. Building on the well-known link between the undirected graph Laplacian and the diffusion operator, we establish a correspondence between the directed graph Laplacian and the diffusion-advection operator. This perspective opens new avenues for addressing crucial points such as frequency ordering, smoothness definition, and the design of spectral and graph filters. Specifically, we introduce two new orderings of frequencies based on the modulus and argument of the eigenvalues, naturally leading to new definitions of smoothness. Then we present two kernels reflecting diffusive and advective processes, namely the heat and transport kernels, respectively. Finally, we propose novel graph filters obtained by composing diffusive and advective parts, which approximate ideal spectral filters accurately and characterize the evolution of graph signals in richer ways. All aforementioned developments are illustrated on both synthetic and real graphs, including an application to temperature graph and signals.


[68] 2606.16321

Sustainable Heating with Karma: A Simulation Study of the KTH Live-In Lab

Space heating in buildings accounts for 10% of the global CO2 footprint. The widespread adoption of energy-efficient heating technology, e.g., heat pumps, could help reduce this figure, but technology alone may not suffice to reach carbon neutrality. Additionally, human occupants have an important role to play by adopting sustainable heating behaviors, e.g., avoid excessive window opening in the winter or (pre-)heat their units while clean energy is abundant. Thus far demand response policies aimed at promoting these behaviors have been monetary, which discriminates against low-income households and exposes human occupants who do not actively engage with real-time control signals to financial risks. This paper instead investigates the suitability of a non-monetary karma economy for promoting sustainable heating behaviors. Karma leverages the repeated and dynamic nature of heating energy allocations to attain climate targets both fairly and efficiently over time without resorting to financial means. As a first step towards experimentally validating the karma concept with real human occupants in the KTH Live-In Lab, we perform a simulation study on a digital model of the Live-In Lab. The study provides initial estimates of expected effects to guide the design of human-in-the-loop experiments, as well as assists with designing and tuning the karma economy in this context. As a specific example, we investigate how incorporating consumption memory in the form of karma affects window opening behaviors in comparison to conventional memory-less heating operation.


[69] 2606.16362

Input-Dependent Fisher Information for Local Sensitivity Analysis of Medical Image Classifiers

Deep neural networks have achieved strong performance in medical image classification, but often work like black-box. Commonly used post-hoc interpretation methods often provide heuristic visualizations whose relationship to the classifier's predictive distribution is indirect. This work introduces a local sensitivity analysis framework based on the input-dependent Fisher Information Matrix (iFIM) of a trained classifier. The iFIM characterizes how the classifier's predictive distribution changes under infinitesimal perturbations of the input image. By using a Gram-matrix formulation, the nonzero eigenspectrum of the iFIM can be recovered without explicitly forming the full image-dimensional Fisher matrix. The leading iFIM eigenspace is then used to project an input image into a high local-sensitivity component and its orthogonal component. These components provide a model-intrinsic description of local predictive sensitivity, rather than a conventional pixel-wise attribution heatmap or a causal segmentation of task-relevant anatomy. The framework is evaluated on controlled and clinical medical image classification tasks using multiple classifier architectures. Perturbation-based experiments show that high-sensitivity iFIM components are more strongly coupled to changes in predictive confidence and classification performance than lower-sensitivity complementary components. The results support the iFIM framework as a principled tool for analyzing local decision sensitivity and for complementing existing attribution-based interpretability methods in medical imaging.


[70] 2606.16374

Transient-Safe Platooning via Dynamic Headway

Managing autonomous vehicle platoons requires a delicate balance between string stability and rigorous safety. This challenge is intensified by aggressive transients, such as highway merging. Although Constant Time Headway (CTH) spacing is the industry standard for Cooperative Adaptive Cruise Control, it lacks formal safety guarantees during significant velocity deviations. This letter proposes a computationally efficient control framework that considers a linear time-invariant model for the dynamics of each vehicle, while ensuring formal transient safety and stability. By introducing a spacing policy that naturally converges to CTH at steady state, we establish platoon safety as an inductive property. We derive a non-linear and saturated control law for the lead follower and provide sufficient initial conditions to guarantee velocity non-negativity and safety throughout the platoon for any CTH-based followers' control law. Numerical examples indicate the proposed methodology may be applicable even under non-nominal setups.


[71] 2606.16422

Performance Analysis of AFDM Under In-Phase and Quadrature Imbalance at Receiver

Affine Frequency Division Multiplexing (AFDM) is a chirp-based multicarrier waveform that achieves full diversity in doubly selective channels while requiring reduced pilot overhead. It is regarded as a highly promising candidate for sixth-generation (6G) mobile communication waveforms in high-mobility scenarios. However, AFDM deployment remains subject to hardware impairments, particularly the in-phase and quadrature (IQ) imbalance commonly encountered in direct conversion transceivers. This paper investigates the impact of receiver IQ imbalance on the bit error rate (BER) performance of AFDM systems. A mathematical model of AFDM under receiver IQ imbalance is first established, where the resulting inter-carrier interference (ICI) in the discrete affine Fourier transform (DAFT) domain is explicitly characterized. Moreover, a closed-form expression for the BER is derived under the influence of receiver IQ imbalance in an M-QAM-AFDM system over an AWGN channel. Numerical simulation results validate the accuracy of the theoretical analysis, while also indicating that under identical IQ imbalance conditions, AFDM exhibits more pronounced BER degradation compared to OFDM. The results provide fundamental insights into the sensitivity of AFDM to receiver IQ imbalance and offer guidance for practical system design.


[72] 2606.16435

Unified Audio Generation and Editing via Joint Condition Modeling and Progressive Training

With the growing focus on audio in multimedia applications, numerous advanced works on audio generation have emerged. Existing studies typically treat text-to-audio (TTA) and other related audio generation tasks, such as instruction-based audio editing, as independent challenges, adopting task-specific architectures or modules. This absence of a unified modeling paradigm substantially increases the overhead and complexity of building a system for both audio generation and editing, while also leading to limited scalability. To address this issue, we introduce AudioWeave, a unified model for TTA and audio editing without additional task-specific components. Specifically, we propose a joint condition modeling approach with a factorized position embedding, enabling the diffusion transformer backbone to operate under heterogeneous inputs of TTA and audio editing. We further propose a progressive multistage training strategy to mitigate task competition and catastrophic forgetting caused by interference among multiple tasks. This in turn helps maintain the performance of each individual task and may even lead to improvements in certain aspects. Experimental results on TTA task and six audio editing tasks show that our unified model achieves competitive performance with task-specific models, laying a groundwork for further exploration of unified audio generation models.


[73] 2606.16464

Towards Robust Generative Speech Enhancement Using Vector Quantisation-Based Neural Audio Codec

This work investigates modelling strategies in continuous and discrete latent spaces in the vector quantisation (VQ)-based neural audio codec (NAC) speech enhancement (SE), along with the role of VQ regularisation. We propose cNAC-SE and dNAC-SE frameworks that predict continuous representations and discrete tokens in latent space, respectively. Theoretical analysis and visualisations in latent space are performed to exhibit their inherent modelling mechanisms. Experimental results show that the fully fine-tuned cNAC-SE model consistently outperforms all dNAC-SE variants across diverse test conditions and achieves leading performance among established generative approaches in DNS-MOS metrics. Comparison with the discriminative counterpart shows that VQ enhances robustness through an intrinsic effect of clean-prior-constrained regularisation, independent of discrete token processing. This highlights the transferable value of VQ regularisation to other continuous modelling methods.


[74] 2606.16468

GreenBox: Prototyping of an Automatic Road Accident Detection System with Real-Time Notification SMS

The Internet of Things (IoT) project, called "GreenBox", proposes the development of a prototype for the detection of road accidents, using sensors and actuators connected to an Arduino Uno and a Global System for Mobile Communications (GSM) card with 4G support, SIM7600G-H (global version) from DF-Robot. This system sends Short Message Service (SMS) to pre-established contacts, alerting, for example, family or friends, so that they immediately contact emergency entities in the event of an accident and provide them with all the necessary information, such as the location of the vehicle. The sensors include four push-buttons, with a resistance of 10k{\Omega}, in order to define their default logical state, representing impact or impact sensors for each side of the vehicle; two water level sensors for the engine compartment and trunk; and a gyroscope/accelerometer to detectrollover from a 70 degrees inclination. The prototype also has a relay that is activated to turn off the engine in the event of a detected accident, preventing further damage. The GSM board has a Global Positioning System (GPS) antenna attached, allowing it to locate the vehicle and determine its speed, moments before the possible accident. The system also allows you to turn off the vehicle via SMS in case of theft. The entire project is prepared for the owner, for example the driver of the vehicle, to cancel all automation, via SMS, in case of false accident detection. Rollover detection is calculated using the arctangent of the accelerometer values and instructions for sending notifications are carried out by AT (Attention Commands) commands, between the microcontroller and the SIM7600 shield.


[75] 2606.16486

Can Optimal Dispatch Models Recreate Reality? A Retrospective Analysis of Europe's 2022 Energy Crisis Using PyPSA-Eur

Electricity prices result from the complex interplay of supply and demand, which depends on variable renewable energy production, fuel costs, CO$_2$ price, and grid this http URL 2020 and 2024, COVID-19 shifted demand and disrupted supply chains and operations in Europe, while Russia's invasion of Ukraine constrained gas supply, causing exceptional volatility in gas prices. It remains unclear whether optimal dispatch models can reliably replicate historical hourly prices during crises, given the rapid fluctuations in fuel prices and operators' limited foresight. In this work, we ask whether an optimal dispatch model, parametrised with historical data on demand, fuel, and CO$_2$ prices, can reproduce the observed market outcomes during this period. We perform hourly hindcasts of electricity generation in 35 countries from 2020 to 2024 using PyPSA-Eur and compare the nodal marginal electricity prices with historical ENTSO-E market prices using the Symmetric Mean Absolute Percentage Error (SMAPE). The scenarios compare static vs dynamic assumptions on fuel and CO$_2$ prices, as well as perfect foresight vs a two-week rolling-horizon optimization. Combining high-resolution fuel and CO$_2$ price time series with limited-foresight substantially improves hindcast accuracy, yielding an average SMAPE of 20.76% based on the daily load-weighted average price for the entire Europe. While improvements relative to the scenario with perfect foresight and static price inputs occur, they are most pronounced during periods of high fuel-price volatility, when marginal-cost swings propagate to electricity prices. In 2022, the optimal generation mix in most countries shows substantially less natural gas and more coal than historically observed. Other discrepancies can be attributed to the model's omission of real-world policy interventions, other dispatch constraints, and generator outages.


[76] 2606.16492

On the Lyapunov equation with the state matrix in companion form

We study the continuous-time Lyapunov equation under the assumption that the state matrix is a Hurwitz companion matrix. The standard Lyapunov theory implies that the unique solution $X$ is positive semidefinite. Motivated by positive systems, we investigate the question of whether $X$ is entrywise nonnegative. We prove that this is the case when the companion matrix has only real eigenvalues. The proof reduces each entry of $X$ to a quadratic form associated with a class of Cauchy-like matrices whose entries are expressed in terms of elementary symmetric polynomials. The required nonnegativity then follows from the positive semidefiniteness of these Cauchy-like matrices. We also discuss a stronger total-positivity property: total nonnegativity does not hold in general, but it is recovered under an additional sign condition on the expansion of the forcing vector in the eigenbasis of $A^\top$.


[77] 2606.16539

Decoding while Adapting: Zero-Shot Online Speaker Adaptation via Audio-Textual Prompts for Elderly Speech Recognition

This paper proposes a novel cross-utterance audio-textual prompts based speaker adaptation approach for elderly speech recognition. It enables zero-shot, real-time adaptation to unseen speakers. Speech and text embeddings are extracted from the current and a few preceding utterances, before being fused in a cross-modal manner to produce compact speaker prompts that are more consistent than i/x-vectors and ECAPA-TDNN features. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed online adaptation outperforms the speaker-independent (SI) model by statistically significant word error rate (WER) or character error rate (CER) reductions of 0.61% and 1.22% absolute (2.99% and 4.48% relative). Real-time factor (RTF) speed-up ratios of up to 9.83 times are obtained over offline batch-mode adaptation.


[78] 2606.16546

Confidence Score Guided Incremental and Speaker Adaptive Pseudo-Labeling for Semi-Supervised Elderly Speech Recognition

This paper proposes a novel confidence score guided incremental and speaker adaptive pseudo-labeling approach for semi-supervised elderly speech recognition. It facilitates higher-quality pseudo-label selection and progressive refinement, while also mitigating speaker heterogeneity. A confidence estimation module is designed to rank the reliability of untranscribed data, enabling a curriculum learning trajectory that progressively folds in unlabeled data subsets from high to low confidence. Speaker-specific characteristics are captured through speaker adaptive training with learnable prompts. Experiments on the English DementiaBank Pitt and Cantonese JCCOCC MoCA elderly speech datasets suggest that the proposed method outperforms the semi-supervised baseline using no confidence scores guided incremental or speaker adaptive pseudo-labeling by statistically significant word error rate (WER) or character error rate (CER) reductions of 1.45% and 2.27% absolute (6.21% and 6.98% relative).


[79] 2606.16551

Learning Input-Channel Permutation Equivariance for Multi-Channel Source Separation: Reducing Bleeding in Small Music Ensembles

Microphone bleed is a persistent challenge in small ensembles and orchestral recordings, where close microphones intended for individual instruments also capture leakage from nearby sources. This overlap degrades track isolation and complicates mixing. This paper addresses the bleeding problem by making channel-permutation-equivariance a core learning principle. During training, we apply the same random permutation to the input microphone channels and their corresponding reference targets. This discourages reliance on fixed channel-instrument associations and improves robustness to changes in the recording setup and even in the recorded instruments. The proposed model is trained on synthetic ensembles with diverse simulated room acoustics and microphone placements, and evaluated on unseen simulated conditions and real URMP recordings. The results show that permutation-aware training consistently improves SDR and reduces bleeding under unseen conditions compared with non-permutation baselines. The findings highlight permutation-equivariance as a simple, data-centric strategy for robust debleeding and practical multi-channel source separation in music production workflows.


[80] 2606.16581

Optimizing Multiple Feature Types for Image Inpainting in the Linear and Nonlinear Setting

Inpainting-based compression stores a carefully optimized subset of the full image data and reconstructs the missing data by inpainting. The quality of these lossy codecs depends decisively on the stored data. So far, these data consist almost exclusively of pixel locations along with their grayscale or color values. In the present paper, we present a general theory and a practical framework that allows to incorporate arbitrary features which can be described by linear or nonlinear equations. This includes e.g. derivatives of arbitrary order or local integrals. Our features can be combined with linear or nonlinear inpainting operators. Moreover, we present an algorithm that automatically optimizes the location and the type of the selected feature. The approach of allowing different types of optimized features turns inpainting-based compression into a more general, versatile and powerful paradigm. Our experiments report a consistent quality gain when increasing the number of feature types from 1 to 5. With the same amount of stored data, the average peak signal-to-noise improvement is 2.76 dB for harmonic (homogeneous diffusion) inpainting, and 1.82 dB for edge-enhancing diffusion inpainting.


[81] 2606.16607

Context-Aware Markov VAE for CSI Compression in Wireless Systems

This paper considers neural channel state information (CSI) compression for time-varying massive multiple-input multiple-output (MIMO) channels in frequency division duplex (FDD) systems with limited feedback resources. The main challenge lies in obtaining a compact and efficient representation of the CSI given that it exhibits strong temporal correlation across successive snapshots. Existing memoryless compression models do not exploit this property, while simple temporal extensions often incorporate multiple observations without explicitly modeling the latent dynamics. We propose a context-aware compression framework based on a k-memory Markov variational autoencoder (k-MMVAE), which uses a finite temporal window to capture the evolution of CSI in the latent space. The model introduces Markov-structured latent dynamics with finite memory, enabling efficient use of temporal dependencies for compression. Simulation results show that the proposed approach improves target CSI reconstruction performance compared to memoryless and weakly sequential baselines, particularly at low and moderate compression rates. These results suggest that explicit latent temporal modeling can provide an effective mechanism for CSI compression under limited feedback constraints.


[82] 2606.16618

Acoustic, VOC, and Multimodal Stress Source Localization in the Internet of Plants

The Internet of Plants (IoP) treats distributed plant networks as bio-sensing infrastructure for environmental monitoring, but spatial localization of stress sources within such networks remains unaddressed. Plant stress signals have fundamentally different spatial dynamics: acoustic emissions propagate omnidirectionally and independently of wind, while volatile organic compound (VOC) plumes are narrow and advection-dominated. We propose a two-stage, coarse-to-fine localization pipeline for a network of ``agent plants'' -- bio-hybrid sensing nodes embedded in the canopy. Stage 1 localizes the source via time-difference-of-arrival (TDOA) multilateration on acoustic time-of-arrival (ToA) readings; Stage 2 refines this estimate using a closed-form, steady-state Green's function model of VOC dispersion. A VOC informativeness gate and an inverse-variance fusion rule combine the two estimates according to their across-trial reliability, with graceful degradation to the TDOA-only estimate when no informative VOC signal is detected. We evaluate TDOA-only, VOC-only, and fused approaches on a new open-source dataset of 52 scenarios generated via a finite-volume advection-diffusion solver and a ray-based acoustic attenuation model. Across network densities of 1 to 50 agent plants, TDOA multilateration achieves sub-meter mean absolute error (MAE) once three or more agents are within acoustic range, far outperforming VOC-only localization (MAE $> 3$ m at all densities). Fusion differences from the TDOA-only estimate are small and statistically indistinguishable from noise in most cases. The pipeline is robust to physical parameter perturbations, ToA noise, the VOC gate threshold, and the bounding radius. TDOA localization is deployable with current acoustic hardware, whereas VOC localization remains a forward-looking capability pending advances in compact biochemical sensors.


[83] 2606.16628

XL-ChannelDiff: An Efficient Diffusion-Based Multi-Domain Near-Field Channel Extrapolation Framework for XL-MIMO Systems

Accurate channel state information (CSI) acquisition is essential for unleashing the performance gains of extremely large-scale multiple-input multiple-output (XL-MIMO) systems. However, in near-field regions, CSI acquisition is much more challenging than in the far field due to the high-dimensional channel representation and spherical wavefront propagation. To address this, in this paper, we propose an efficient multi-domain near-field channel extrapolation framework for XL-MIMO systems. Leveraging the conditional denoising diffusion implicit model (CDDIM), our approach enables accurate channel extrapolation across the antenna, frequency, and spatial domains. Specifically, we design a physics-aware CDDIM backbone that incorporates position-embedded patch tokenization and a mask-guided multi-head attention mechanism, enabling the model to exploit position-dependent channel correlations induced by near-field spherical-wave propagation. To ensure high-fidelity extrapolation, we incorporate a Wasserstein GAN (WGAN) discriminator that provides adversarial supervision to the CDDIM during both the training and reverse sampling phases. Additionally, a RePaint-style refinement scheme is introduced to optimize the sampling trajectory, further boosting extrapolation accuracy. Extensive experiments demonstrate the superiority of the proposed framework, achieving superior extrapolation accuracy and robust generalization across diverse domains, varied configurations, and severe masking conditions.


[84] 2606.16631

Exponential Weighting Model Predictive Control with Observer for Modular Multilevel Converters

In this article, we propose a model predictive control (MPC) scheme with an exponential cost function, along with an observer for the Modular Multilevel Converter (MMC), to enhance converter dynamic performance. In particular, as the prediction horizon $(N_P)$ increases, the numerical conditioning deteriorates rapidly, especially when a large $N_P$ is employed. This research work uses an appropriate cost function weighted to overcome the limitations of a large $N_P$. We further analyse the effects of constraints, observing that the designed MPC strictly adheres to them and that the control variable influences the MMC plant's response. The presence of the observer improves the prediction of the output, particularly for setpoint changes in the reference signal. We also analyze the prescribed performance, which provides a priori guarantees of closed-loop stability for the proposed controller.


[85] 2606.16635

Closed-loop Optimal Fault Detection for Uncertain Systems

Faults compromise the reliability and safety of complex engineering systems. The aim of this article is to address the problem of robust fault detection filter design for continuous-time linear time-invariant uncertain systems in open-loop or closed-loop configurations. The developed method offers a unified approach to handle parametric and dynamic uncertainties by solving a single Riccati equation, based on a worst-case disturbance and uncertainty model. This worst-case model is obtained by nonlinear optimization and application of the boundary Nevanlinna-Pick method. The efficacy of the proposed approach is demonstrated using an uncertain model of an experimental reticle stage used in the lithography industry. The results illustrate that an optimal compromise is achieved between sensitivity to faults and rejection of modelling uncertainties and disturbances on the other hand. This capability enables the clear differentiation between faults and undesired effects in residuals, thereby enhancing fault detection reliability, ultimately contributing to improved safety and performance of machines.


[86] 2606.16657

Towards mm-Level Accurate UWB Radar: High-Accuracy Phase-Based Obstacle Detection through Multi-Channel Fusion

Accurate, tag-free distance estimation with ultrawideband (UWB) radar is essential for applications such as autonomous guided vehicles, robotics, and environment characterization. For tag-based localization systems, phase-based UWB signal processing techniques have demonstrated sub-wavelength ranging precision, but these approaches are not applicable for passive (tagless) radar setups with weak reflections, mixed multipath conditions, and the absence of a known time-of-flight (ToF) first-path reference. This paper demonstrates for the first time that phase information can be effectively exploited in a fully passive UWB radar setting. We introduce a signal processing framework that extracts reliable distance information by combining coarse amplitude-based estimates with high-resolution phase changes across multiple frequency channels. By referencing phase measurements with the line-of-sight component, the method compensates for hardware-induced phase drift, while the use of multichannel frequency diversity enables disambiguation of periodic phase information and improves robustness against frequencyspecific channel degradation such as Fresnel zones. The proposed approach is validated on a robot equipped with a bistatic UWB radar using DW3000 devices and evaluated in a realistic metallic industrial environment. Experimental results show that our work consistently achieves centimeter-level accuracy even at high speeds, with a median error of 1.69 cm, significantly outperforming existing ~10cm accuracy UWB radar approaches relying only on amplitude-information. We further show how multi-channel fusion exploits uncorrelated channel degradation to reduce the error by more than 40% compared to single-channel operation, and outline how phase modeling and fusion can be pushed toward sub-centimeter accuracy.


[87] 2606.16668

CraBERT: Efficient Phoneme Encoder Pre-Training via Cascade Fusion of Subword Representations for Text-to-Speech

This paper introduces CraBERT, a pre-trained phoneme encoder (PPEnc) designed for efficient pre-training in text-to-speech (TTS). CraBERT employs a cascade-fusion architecture and a subword-phoneme alignment algorithm to integrate representations from a pre-trained subword-level BERT into a phoneme-level BERT. This design provides prior word- and sentence-level information, reducing the amount of pre-training required by the phoneme encoder. Subjective listening evaluations show that CraBERT achieves MOS values comparable to existing PPEncs after approximately one epoch of pre-training, whereas the baselines in our comparison are pre-trained for approximately ten epochs. These results demonstrate that CraBERT can efficiently learn representations suitable for improving the perceived naturalness and prosody of synthesized speech.


[88] 2606.16717

Sensing-Assisted Predictive Beamforming for UAV-Enabled Ocean Monitoring Networks

This paper investigates a sensing-assisted predictive beamforming framework for UAV--buoy maritime monitoring by explicitly accounting for wave-induced buoy dynamics and residual sea clutter. A frame-based UAV mission workflow is first established, where the UAV transmits integrated sensing and communication signals to acquire buoy echoes and to support subsequent uplink beam alignment. To characterize short-horizon buoy motion, a correlated-acceleration state-space model is developed by combining a Singer process for wave-driven excitation with a slowly varying current-drift term. Given the resulting nonlinear reflection, Doppler, and delay measurements, the posterior Fisher information matrix and the corresponding posterior Cramér--Rao bound (PCRB) are derived, and the predicted horizontal-position PCRB is adopted as the sensing metric. A per-frame worst-buoy design is then formulated to jointly optimize sensing power allocation and UAV position under uplink-rate, UAV-power, and mobility constraints. By exploiting a Schur-complement reformulation and a lagged successive convex approximation, the resulting subproblem is converted into a convex conic program with tractable complexity. Simulation results show that the proposed scheme maintains robust prediction and communication performance under denser buoy deployments and harsher sea conditions, and outperforms several baseline designs. In particular, the pronounced root mean square error (RMSE) degradation of the communication-only benchmark confirms that sensing-assisted state refinement is essential for accurate predictive beamforming in dynamic maritime environments. Compared with a full first-order Taylor expansion method, it achieves a more attractive performance--complexity tradeoff for online deployment.


[89] 2606.16750

Data-Aided Channel and Doppler Estimation for mMIMO LEO SatComs with Uncompensated Doppler

This paper presents a framework for estimating and tracking massive multiple-input multiple-output (mMIMO) low-Earth-orbit (LEO) satellite channels under uncompensated Doppler. The approach begins with a pilot-based minimum mean square error (MMSE) estimate, followed by Doppler estimation and data-aided channel estimation using either a decision-directed MMSE (DD-MMSE) or an expectation-maximization (EM)-based estimator. The proposed framework achieves improved channel and Doppler estimation accuracy compared to existing methods. Results demonstrate that the DD-MMSE variant offers lower complexity, while the EM variant provides higher estimation accuracy.


[90] 2606.16815

A Perception vs. Distortion Perspective on Score-Based Generative Channel Estimation

Driven by their remarkable success in computer vision and inverse problem solving, score-based models are increasingly applied to wireless communications, where they show promise across a range of physical-layer tasks. However, despite this growing interest, the current literature often lacks a rigorous analysis of when score-matching offers a tangible advantage over traditional discriminative learning. This paper aims to address this gap through the use-case of channel estimation, a fundamental inverse problem in wireless systems. We present a theoretically grounded interpretation of score-based channel estimation through the lens of the perception-distortion tradeoff, identifying the conditions where score matching excels as well as its key limitations. In particular, by modeling downstream wireless tasks (e.g., capacity maximization) as functionals of the channel estimation process, we quantify the excess risk incurred by standard distortion-minimization approaches. Extensive numerical results show that under high predictive uncertainty, the large excess risk gap can be offset by score-based estimation, enabling near Bayesian-optimal precoding via the learned posterior, whereas in the low predictive uncertainty regime, discriminative distortion-minimization approaches are preferable due to lower complexity and more efficient use of model capacity.


[91] 2606.16835

Conditioning Deep Anatomical Prior Knowledge for Reconstruction of Multispectral Optoacoustic Tomography Images

Accurately delineating tissues and reconstructing their chromophore compositions from Multispectral Optoacoustic Tomography (MSOT) images is a key challenge in optoacoustic imaging. The difficulty arises because light fluence distributions within tissue intrinsically depend on spectral optical properties, making the inverse problem inherently ill-posed. Currently, there is a lack of studies leveraging a priori probabilistic anatomical knowledge to guide tissue segmentation and infer chromophore composition. Moreover, most current studies address these two tasks sequentially, which can result in errors accumulating. through the process. To address these issues, we present Anatomical Priors for Reconstruction of Optoacoustic Tomography (APRECOT), a method that leverages probabilistic models of anatomical structures and tissue properties, to enable simultaneous segmentation of tissues and reconstruction of their bulk chromophore compositions. In this proof-of-concept using in-silico data, we show that incorporating probabilistic anatomical context strongly improves the accuracy of bulk chromophore concentration estimation compared to reference methods that do not use any anatomical context or use sequential strategies. This work represents an essential step towards an MSOT imaging mode that directly provides clinically relevant information, such as imaging tissue oxygenation dynamics or disease-related changes in tissue composition.


[92] 2606.16927

Communication Channel Modelling of Unmanned Aerial Vehicles

Wireless communication channel characterization for unmanned aerial vehicles (UAVs) is essential for reliable control, data transmission, and mission performance in civil, industrial, and defence applications. Channel behaviour is examined using a measurement-based approach that captures both large-scale propagation effects, represented by path loss, and small-scale characteristics, represented by the channel impulse response (CIR) and power delay profile (PDP). An SDR-based channel sounding system is employed to collect and process in-phase and quadrature (IQ) data, enabling the extraction of key channel parameters. Following system verification, measurements are conducted in ground-to-ground (G2G), air-to-ground (A2G), and air-to-air (A2A) scenarios. The results demonstrate that path loss alone is insufficient to describe UAV communication channels, as CIR and PDP provide additional insight into multipath propagation and delay-domain behaviour. The findings indicate that realistic UAV channel models should incorporate both large-scale and small-scale channel statistics. Further improvements may be achieved through increased sounding bandwidth, enhanced synchronization, measurements in a wider range of environments, and more detailed analysis of Doppler effects.


[93] 2606.16930

Learning practically stabilizing output-feedback nonlinear controllers

This paper addresses the problem of learning an output-feedback surrogate controller offline that approximates a given, possibly computationally expensive, nonlinear controller-observer pair. The surrogate is modeled as a recurrent dynamical system and is trained to imitate closed-loop input/output trajectories generated by the given controller. Beyond imitation accuracy, the offline training problem promotes input-to-state practical stability by incorporating estimated state trajectories to learn a candidate Lyapunov function. The approach is validated on a nonlinear continuous stirred tank reactor, where constraint satisfaction and practical stability are assessed through a probabilistic validation approach. The numerical results highlight the benefit of jointly learning the Lyapunov function by comparing against an imitation-only baseline.


[94] 2606.17001

Sandbox-Enabled Digital Twin for Cyber-Physical Systems

Cyber-physical system (CPS) controllers are vulnerable to faults and malicious attacks, including failures triggered only under complex plant conditions, yet pre-deployment validation typically relies on plant models or digital twins that exercise the controller's I/O as a black box. Side-channels, used to detect those run-time behavioral anomalies, are complementary but also open-loop, detached from I/O instrumentation, and driven by synthetic inputs rather than realistic plant feedback. We present a closed-loop digital twin framework that bridges this gap by hosting an unmodified controller binary within the SaMOSA Linux sandbox with its I/O rerouted to an external plant simulator, allowing coupled capture of simulated plant conditions and events alongside the controller's behavioral side-channels. The framework captures four time-synchronized controller side-channels (hardware performance counters, system calls, disk activity, and network activity) alongside plant state, and uses orchestration hooks for repeatable, parameterized runs. We demonstrate the framework on an OpenPLC runtime executing a Structured Text control program against a Modbus-connected IEEE 14-bus power-system model, and also discuss the application to robotics systems. The captured side-channels correlate controller behavior with simulated plant events, establishing an observability foundation for online testing, coverage analysis, and anomaly detection in CPS controllers.


[95] 2606.17004

Data-Driven Personalization of Automated Insulin Delivery

Automated insulin delivery (AID) systems are often tuned for the population and offer limited online adaptation to the inter- and intrapatient variability in insulin needs caused by meal patterns, physical activity, and fluctuations in insulin sensitivity. We present a real-time, data-driven personalization approach that adapts controller parameters using the subject's daily glycemic data. The adaptation is formulated as projected gradient descent on a daily risk metric, where the gradient estimation is designed to attenuate noise and metabolic variability. We use contraction theory to validate the optimization framework and convergence of the closed-loop system under adaptation. In silico experiments on the 100-adult cohort of the FDA-accepted UVA/Padova T1D simulator show that our method improves glycemic risk and increases time-in-range (TIR, 70-180\,mg/dL) by 2%, 3%, and 4% after 4, 8, and 17 weeks, respectively, under variability in meal timing, meal size, and insulin sensitivity.


[96] 2606.17050

Optimal Bounded Thrust Powered Descent with Analytical Ground-Collision Avoidance

The paper proposes a new approach to address the bounded-thrust powered-descent problem while ensuring ground-collision avoidance. A time-dependent polynomial approximation of the mass is employed to formulate a bounded linear-quadratic optimal-control problem that minimizes the thrust-acceleration control effort, terminal miss, and terminal velocity error. The resulting approximation is used to impose a hard constraint on the horizontal thrust profile while keeping the vertical thrust profile unconstrained. The key idea is a hierarchical separation of the thrust allocation, which enables analytical ground-collision avoidance under bounded thrust. Unlike existing bounded-thrust powered-descent approaches based on numerical optimization and trajectory-shaping constraints, the proposed method provides explicit analytical collision-avoidance conditions. Building on this formulation, the guidance law predicts the switching times between saturated and unsaturated arcs and shapes the thrust-acceleration profile to achieve a soft landing, even when the controller remains saturated over extended portions of the trajectory. Owing to its analytical nature, the guidance law is computationally efficient, and its continuous thrust profile facilitates real-time implementation. The proposed method was evaluated over a grid of perturbed initial conditions in realistic simulations, demonstrating accurate collision-free soft-landing performance. The results highlight the importance of combining saturation-aware guidance with ground-collision avoidance under bounded thrust.


[97] 2606.14713

Recurring Public Transit Schedules: Stable Identification from GTFS and Similarity Analysis

Public transit schedules contain recurring service structures: many dates share the same passenger-facing timetable, while others differ because of weekends, holidays, seasons, or special events. GTFS does not encode these recurring schedules directly, so downstream scheduling, assignment, and mismatch models lack an explicit recurrent supply object. This paper formalizes recurring schedules as DayTypes, defined by the complete set of Route Pattern trips operating on a date, and develops a stable GTFS-based extraction method using H3 route-pattern keys. We then define a schedule-comparison framework with exact, time-tolerant, and structural-comparability metrics that distinguish strict timetable differences from small timing shifts and larger service changes. Validation on Japanese and Canadian GTFS feeds shows substantial schedule compression, a median of four nonempty DayTypes per agency in the pairwise-analysis sample, hierarchical nesting between service classes, and country-level differences in the persistence of exact disjointness. The resulting DayTypes provide compact Route-Pattern-time schedule sets for timetable synchronization, vehicle scheduling, demand assignment, and schedule-consolidation analysis.


[98] 2606.14722

A Comparative and Hybrid Study of CNN and Transformer Models for Multi-Class Virus Classification in Transmission Electron Microscopy

The automatic recognition of virus particles in transmission electron microscopy (TEM) images remains a demanding task, primarily owing to strong inter-class similarity, scale variability, and pronounced class imbalance. In this study, several convolutional neural networks and transformer-based architectures were comparatively evaluated for the classification of 22 virus categories using the TEM virus dataset. All models were trained under identical preprocessing and optimization conditions, and imbalance effects were mitigated through a weighted cross-entropy formulation. Performance was quantified using overall accuracy together with macro-averaged precision, recall, and F1 score. Among standalone models, the Swin Transformer achieved the highest accuracy (0.8831) and macro-F1 score (0.8444), followed by DeiT (accuracy 0.8669). Convolutional architectures exhibited comparatively lower balanced performance, with ResNet50 demonstrating substantial degradation (accuracy 0.5887) under imbalanced conditions. To exploit complementary representational properties, decision-level hybrid strategies were implemented. The performance-weighted hybrid attained an accuracy of 0.8831 and the highest macro-F1 score (0.8528), slightly surpassing the equal-weight hybrid configuration. These observations indicate that architectural heterogeneity contributes to improved inter-class balance without sacrificing overall predictive accuracy. Future work may explore scale-aware representations, feature-level fusion mechanisms, and expanded TEM datasets to further enhance robustness and generalization in virus identification tasks.


[99] 2606.14739

An RRAM-based Hardware Implementation of a Radial Basis Function Neuron for Edge Classifiers

The deployment of modern machine learning (ML) solutions on resource-constrained edge devices highlights implementation challenges. This is especially true for extreme edge applications that include safety-critical components, such as autonomous navigation tasks. This paper demonstrates an artificial neural network (ANN) design leveraging Metal-Oxide Resistive RAM (RRAM) -based Analogue Content Addressable Memory (ACAM) as an efficient hardware substrate for performing metric-based classification and online adaptation on the edge. The proposed design is based on a custom Template piXeL (TXL) cell used for building the ACAM module, where each TXL cell acts as a configurable receptive field neuron. These cells employ a Radial Basis activation function to calculate the distance of an input from the programmed receptive field. The TXL can be organised into dense arrays for calculating the distance of a high-dimensional input against all stored prototypes, effectively performing fast and energy efficient similarity search. This hardware engine enables on-the-fly learning, where the receptive field parameters can be tuned to track domain shift. Through simulation of the proposed TXL-RBF classifier we can achieve 89.1\% accuracy on the MNIST dataset while consuming 185fJ per cell per operation when operating at 100MHz.


[100] 2606.14784

LLM-Based Synthetic Ground Truth Generation for Audio-Based Emotion Classification via In-Context Learning

Understanding human states and interaction dynamics is a core goal of human-computer interaction (HCI). As interaction paradigms become more immersive, virtual reality (VR) has emerged as a powerful platform for studying collaborative work. In such settings, evaluating team collaboration states, including team performance and team resilience, requires continuous and reliable inference of latent team-level cognitive and affective states from multi-modal sensor data, such as speech signals. However, generating ground truth labels for these latent states remains challenging due to sensor-induced noise, contextual variability, and sparse expert annotations. Traditional self-reporting approaches provide only static and delayed measurements and are therefore insufficient for capturing dynamic team processes reflected in continuous speech data. In this work, we propose a large language model (LLM)-driven, agentic inference workflow for automated emotion-related synthetic ground truth generation from streaming speech data in multi-user VR environments. Leveraging the generalization capabilities of LLMs, we use In-Context Learning (ICL) with few-shot demonstrations of paired audio-based samples and their corresponding transcriptions. ICL tends to achieve task adaptation comparable to model fine-tuning while circumventing the computational overhead of parameter updates. To construct informative and robust in-context prompts, we adopt a retrieval-based selection strategy that dynamically identifies relevant audio demonstrations based on similarity in the acoustic feature space.


[101] 2606.14788

Unifying Acoustic Features and Text with Multimodal LLMs for Neurodegenerative Screening

Voice-based screening offers a scalable and non-invasive way to assess neurodegenerative diseases such as Alzheimer's disease (AD) and Parkinson's disease (PD), but their staging remains challenging due to the difficulty of integrating heterogeneous data. This paper presents NeurMLLM, an efficient multimodal generative framework for neurodegenerative disease staging. NeurMLLM first encodes the spectrograms and Mel-frequency cepstral coefficients of audio data with vision transformers and projects their representations into the embedding space of a large language model (LLM), where they are concatenated with transcript and demographic instruction tokens as a single unified sequence. The LLM is then instruction-tuned via Low-Rank Adaptation using task prompts to autoregressively predict a constrained label token, enabling a generative classification. By evaluating on the Bridge2AI-Voice dataset for fine-grained staging of AD and PD, we observe that NeurMLLM achieves strong performance, consistently outperforming classical machine learning methods and existing LLM-based approaches. The results show the high potential of multimodal LLMs in neurodegenerative disease staging, improving staging accuracy and supporting accessible deployment.


[102] 2606.14800

Bridging data-driven priors via the score function for posterior sampling -- Comparative review and experimental study

This paper reviews how a diverse set of popular data-driven priors commonly used in Bayesian inverse problems can be unified through their respective score functions. By framing these priors under this common perspective, we show that they can benefit from their straightfoward and effective integration into a recently proposed sampling algorithm. The applicability of this common framework is illustrated by considering several data-driven priors, namely regularization-by-denoising, normalizing flow-based priors, score-based generative models, and convex-ridge regularizers. For these four particular priors, the performance of the method is evaluated when conducting image inpainting and single image super-resolution. These results, as well as those obtained when restoring real images acquired in a geological context, demonstrate the efficiency of the method. This unified framework proves versatile enough to handle any posterior distribution defined by a broad class of score function-based priors, beyond the specific cases considered in this paper.


[103] 2606.14820

Spectro-Temporal Interference Confounds Phase Encoding in Spatial Audio Foundation Models

Recent spatial self supervised audio models achieve high performance on localization tasks, raising questions about their encoding of microsecond interaural phase fine structures. We propose a psychoacoustic benchmark based on the binaural masking level difference to evaluate this. Using an equalization cancellation baseline and a GCC PHAT positive control we evaluate nine frozen audio models spanning binaural SSL, monaural SSL, and neural audio codecs. Four monaural negative controls yield zero BMLD confirming binaural specificity. Two general purpose binaural SSL models exhibit minimal phase sensitivity while dedicated binaural spatial SSL models achieve BMLD comparable to the analytical baseline. Progressive physical ablations show that general purpose binaural SSL models rely on spectro temporal interference textures rather than cross channel phase computation. High detection rates in speech reflect a confounding reliance on broadband envelopes rather than genuine phase encoding.


[104] 2606.14922

An Empirical Study on Learning Latent Representations for Emotional Speech Synthesis

For the last couple of years, the field of speech synthesis has improved dramatically thanks to deep learning. There are more and more deep learning-based TTS systems developed to make it possible to produce voices with high intelligibility and naturalness. Meanwhile, controlling the expressiveness is yet a big deal, generating speech in different styles or manners has received a lot of attention from community recently. This paper aims to give our solutions to deal with the task emotional speech synthesis (ESS) at VLSP 2022 which allows to generate humanlike natural-sounding voice from a given input text with desired emotional expression. By integrating speaker embedding, prosody bottleneck into FastSpeech 2, our systems can promisingly generate emotional speech of a single speaker (Sub-task 1), transfer speaking styles from another speaker to the target speaker with neutral non-expressive data while retaining the target speaker's identity (Sub-task 2).


[105] 2606.14966

Controller and Control Architecture Co-Design via Mixed-Integer System-Level Synthesis

We study controller and control-architecture co-design for dynamic output-feedback systems. The architecture selects active sensors and actuators, sensor-to-actuator links, and link delays, with costs for hardware activation and communication latency. Direct optimization over controller transfer matrices and discrete links is mixed-integer nonconvex; common alternatives fix the architecture, use regularization, or restrict the controller information pattern to a quadratically invariant (QI) class. We instead optimize finite-horizon output-feedback system-level synthesis (OF-SLS) responses. Binary variables select sensors, actuators, links, and delays, and indicator constraints zero unavailable FIR response blocks before the selected delays. For implementation-local OF-SLS architectures, this gives an exact mixed-integer convex program over a prescribed finite delay menu. A global solve certifies the best architecture-response pair for the chosen delay menu, FIR horizon, admissible architecture set, and scalarization weight. The same encoding gives a QI controller-support reference problem. In a vehicle-platoon benchmark, 99 of 8748 architectures are QI-compatible. At equal architecture cost, the selected non-QI OF-SLS architecture reduces performance loss by a factor of 3.8 relative to the best QI architecture and outperforms regularization-based and canonical information-flow baselines.


[106] 2606.14976

Real-time nonlinear model predictive control framework for event-triggered switching in industrial batch polymerization process

Controlling batch polymerization is challenging because the absence of a steady operating point prevents standard linearization; the dynamics are intrinsically nonlinear; and multi-phase operation induces state-triggered switching. This study systematically combines four established real-time NMPC ingredients, smooth mode blending, advanced-step warm starts, variable scaling, and a capped iteration budget, to attain real-time feasibility without ad hoc switching heuristics. We provide practice-oriented guidance for selecting smoothing gains and locating switching surfaces, and we make explicit the approximations introduced by smoothing such that, with appropriate tuning, the smoothed and original switching logic are numerically indistinguishable at solver-tolerance levels. All results are obtained in closed-loop simulation using an industrial gas-liquid polymerization benchmark with estimator-in-the-loop, compared against PID and conventional NMPC baselines. Results show improved constraint satisfaction and shorter batch duration under bounded computation, while an ablation study quantifies the specific contributions of each component individually


[107] 2606.15024

Resilient Consensus in Agentic AI

Large language model (LLM) agents are increasingly deployed in multi-agent systems where they must coordinate and agree on shared decisions. We ask whether classical resilient consensus theory, developed for deterministic agents, transfers to LLM agents that may behave adversarially. Framing LLM agreement as a Byzantine consensus game, we run controlled experiments on complete and general communication graphs. We find that prompted LLM agents fail to reach agreement that is achievable in principle: consensus can fail even in settings where classical theory guarantees that a convergent algorithm exists, and this failure persists across temperatures and horizons. At the same time, wrapping the agents with classical resilient consensus filters improves agreement. The benefit of filtering depends on how much robustness the underlying topology already provides. Our results suggest that classical resilient consensus theory is a useful lens for the safety of agentic AI.


[108] 2606.15047

BT-MTD: Bus Traversal-based Moving Target Defense for Smart Grid

Moving Target Defense (MTD) is a proactive security strategy designed to enhance cyber-resilience by dynamically altering system parameters, thereby preventing adversaries from acquiring the critical information needed to execute stealth attacks. In this paper, we consider the case in which the operator modifies the admittance of branches to enable MTD, and focus on the problem of effectively protecting the system with fewer number of branch admittance modifications and shorter computational time. Specifically, we identify the ineffectual branches whose admittance modification do not contribute to the improvement of MTD effectiveness via theoretical analysis. Building on these insights, we propose the Bus Traversal-based MTD (BT-MTD), which is a bus-oriented algorithm that traverses over the buses of the network according to analytically derived guidelines. The performance of the BT-MTD is evaluated and compared with four existing strategies on standard IEEE test systems, demonstrating its robustness and superior performance in effectiveness, efficiency, and computational cost. The code of BT-MTD is available at: this https URL.


[109] 2606.15083

REGRID-QAOA: A Resource-Efficient Graph-Reduced Hybrid QAOA Framework for Physics-Constrained Power System Islanding

Quantum computing has rapidly emerged as a powerful paradigm for tackling computationally demanding problems. In particular, quantum optimization shows strong promise for hard combinatorial problems in power systems, where increasing distributed energy penetration heightens the need for intentional islanding to maintain grid reliability and resilience. However, power system islanding is an NP-hard combinatorial optimization problem that becomes computationally prohibitive for classical solvers as network size grows, motivating the use of quantum computing as a promising alternative pipeline. This study develops a resource-efficient hybrid QAOA islanding framework that brings physics-constrained power-system partitioning into the quantum optimization workflow. The framework combines coherency-informed graph reduction, physics-aware constraint modeling, and structured post-processing to efficiently convert shallow-circuit QAOA samples into high-quality feasible islanding decisions without deep circuits or large shot budgets. The proposed framework is validated on the standard IEEE benchmark systems (9-, 14-, 24-, 30-, 39-, and 57-bus), demonstrating that the hybrid workflow achieves Gurobi-optimal solution quality with a clear quantum resource advantage over vanilla QAOA, while the resulting islanding solutions satisfy all physical feasibility requirements after network separation. This study establishes QAOA-based islanding as a viable quantum approach for critical infrastructure, with structured post-processing as the key enabler of quantum resource efficiency.


[110] 2606.15088

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.


[111] 2606.15149

AUDEDIT: Inversion-Free Text-Guided Editing with Pretrained Audio Flow Models

We introduce AudEdit, an inversion-free method for text-guided editing of real audio with a pretrained rectified-flow audio generator. Text-to-audio systems such as Stable Audio 3 already expose audio-to-audio editing by noising an input recording and denoising it under a new prompt, but this inversion-style route must trade prompt adherence against preservation of rhythm, transients, timbre, and long-range musical structure. Motivated by recent inversion-free flow editing in computer vision, we develop an audio-specific direct source-to-target ordinary differential equation for one-dimensional Stable Audio 3 latents: at each flow step, we compare the target- and source-conditioned velocity fields under a shared stochastic source marginal, and update the edited latent by their difference. The resulting editor requires no training, no paired edit data, no optimization, and no access to internal attention maps. Across sound-effect and music editing sets built from FSD50K and the Song Describer Dataset, AudEdit improves CLAP text alignment and audio preservation over SDEdit, ODE inversion, and FireFlow; for example, on sound effects it raises target-text CLAP similarity from 0.42 to 0.52 over the strongest baseline while reducing FAD from 65.70 to 50.37.


[112] 2606.15181

Revisiting The PBH Test: Fast Uncontrollability Certificates via Krylov Methods

This letter revisits the classical PBH test through the lens of finite-horizon reachability. By casting state transfer as a minimum energy, primal optimization problem, we show that unreachable state-space maneuvers admit dual infeasibility certificates. These certificates are computable without forming the controllability matrix meaning that uncontrollability can be efficiently certified. We prove that any such certificate is a linear combination of uncontrollable generalized eigenvectors, thereby providing a spectral interpretation without a global eigendecomposition. We also devise algorithms based on Krylov sub-space methods that extract some of the uncontrollable PBH modes from a certificate and demonstrate favorable scaling on large dynamic networks with thousands of nodes.


[113] 2606.15186

FreeSonic: Training-Free Temporal-Aware Decoupled Attention for Precise Audio Editing

Text-to-audio (TTA) generation has made significant strides, yet achieving precise and consistent audio editing remains a major challenge. However, existing methods struggle to balance temporal consistency with background preservation. In this paper, we propose FreeSonic, a training-free framework leveraging the state-of-the-art Rectified Flow-based TangoFlux model. FreeSonic utilizes an optimized inversion-reverse process and joint text-audio attention maps for precise target segment extraction. For content editing, a novel scheduled attention decoupling confines modifications to target regions while preserving original acoustic context. Furthermore, task-oriented noise injection enhances versatility for tasks such as audio removal and non-rigid replacement. Extensive experimental results demonstrate that FreeSonic achieves a superior balance by providing a high-fidelity and efficient solution for precise and consistent audio editing. Project and demos: this https URL


[114] 2606.15434

A Bilateral Teleoperation Framework for Dexterous Manipulation

Dexterous teleoperation requires precise arm-hand coordination, low-latency feedback, and robust interaction in real-world contact-rich environments. This paper presents a modular bilateral teleoperation framework that integrates operator-side input interfaces with a robot-side dexterous hand and compliant robotic arm in a unified control architecture. The system supports position-based hand retargeting, differential arm control, multi-scale haptic feedback, and shared control for stable manipulation. We validate the framework through a real-world dexterous manipulation task, highlighting coordinated arm-hand control and contact-aware interaction. Beyond feasibility, we identify key design insights related to cross-embodiment mismatch, haptic feedback granularity, and shared control. The proposed platform provides a practical teleoperation system and a foundation for collecting high-quality demonstrations for future learning-from-demonstration research.


[115] 2606.15435

On Type Deception in Linear-Quadratic Differential Games

We consider two-player linear-quadratic differential games of incomplete information, in which one player has a private type initially unknown to the other. The typed player has incentive to conceal their type, while the uninformed player has the potential to infer it during play. Any ex-ante equilibrium in this setting will decompose into a deceptive, pooling phase, and a complete-information, revelatory phase. We demonstrate how to solve both phases via nested Riccati equations. Candidate equilibria are then found by maximizing the game value over a scalar revelation time, for which we provide a gradient in the case of time-homogeneous system matrices. We conclude by demonstrating our framework in a pursuit-evasion game with time-varying control advantages, finding interior optimal revelation times that confirm deception has quantifiable ex-ante value.


[116] 2606.15436

Beyond Classification: A Cough Regression Benchmark for Respiratory Acoustic Foundation Models

Respiratory acoustic foundation models (FMs) excel at cough classification, yet their ability to predict continuous health quantities from cough audio remains largely unexplored, despite the clinical value of passive age, BMI, and disease probability estimation in settings where physical measurements are unavailable. We introduce the multi-model, multi-target cough regression benchmark evaluating five FMs (OPERA-CT, OPERA-CE, OPERA-GT, HeAR, M2D+Resp) across six targets on three datasets under subject-disjoint protocols, comparing linear, MLP-small, and full MLP regression heads. MLP-small beats the mean-predictor baseline on all tasks and linear probing in 23 of 30 model x task cases, with full MLP overfitting on small clinical data but recovering on larger sets, revealing a dataset size x head-capacity trade-off. HeAR leads within-dataset age regression on Coswara (9.12 yr MAE); its CIDRZ result is excluded from headline claims owing to possible HeAR-CIDRZ pretraining overlap. OPERA-GT is favored over OPERA-CT on age in all three datasets, with the CIDRZ margin within seed variance, extending a generative-pretraining advantage from breath to cough. HeAR and M2D+Resp reach near-full performance at N = 50 samples while OPERA models require N = 400. Cross-dataset transfer is strongly asymmetric as large diverse data generalises to small clinical populations (CoughVID to CIDRZ: -0.17 yr) but not vice versa (CIDRZ to Coswara: +2.43 yr, +26.6%).


[117] 2606.15450

Kernel Density Estimation by Spectral Decomposition: Data-Driven Tapering and Superposition

Kernel density estimation depends largely on one choice, the smoothing bandwidth. We treat bandwidth selection and density estimation in the characteristic-function domain, where the cyclic group-averaged covariance of the binned data has the squared empirical characteristic function as its spectrum: the true characteristic function sits over a sampling-noise floor of $1/n$, and the bandwidth is the spectral cutoff where the two meet. Several methods follow. An automatic selector strips the floor and minimizes a frequency-domain error criterion, matching the rule of thumb on smooth densities and approaching the best fixed bandwidth on multimodal ones. An adaptive estimator generalizes the fixed kernel to the per-frequency optimal Wiener taper, matching or surpassing the best fixed bandwidth on most standard densities, including sharply peaked and comb-like cases where fixed bandwidths fail; deconvolution under known measurement error follows in the same domain. Because the Wiener estimator resolves sharp structure but does not fit smooth bases as economically as a mixture, a Gaussian mixture is combined with it two ways, a piecewise partition and a superposition of a smooth base and a band-limited residual, the default. A data-driven floor read from the spectrum replaces the assumed $1/n$ floor and stays robust on heaped and rounded data. On the Marron-Wand benchmark scored by exact integrated squared error, the advantage emerges with sample size, a bias-variance tradeoff: the spectral estimators carry low bias but pay in variance, so cross-validation leads at $n=100$ while the Wiener filter and superposition take the top two ranks at $n=5000$. The methods are validated on six real datasets (CRSP returns, NHANES self-reports, CMS dimuon and SDSS spectra, a random-beacon stream, and UNSW-NB15 traffic) and on a synthetic-data quality check. All experiments are reproducible.


[118] 2606.15500

LLM4RTL: Tool-Assisted LLM for RTL Generation

Large language models (LLMs) have facilitated impressive progress in software engineering, code generation, tooling, and systems. Concurrently, a significant body of research has developed which explores a growing variety of methods and systems for applying LLMs to hardware and chip design (e.g., systems for RTL code generation based on functional description). However, when it comes to open Verilog/RTL code-generation, we need high-quality training samples to build specialized and more effective LLM systems through fine-tuning or low-rank adaptation. Here, we propose a ``judge-renew-check-renew-check'' (JRCRC) pipeline which updates a current public dataset using a hierarchy of state-of-the-art commercial LLM models differing in their costs and capabilities in RTL code generation. This approach achieves a cost-effective mechanism for filtering and refining code-generation samples into a higher-quality training dataset. Our experiments also identify some common weaknesses of LLMs in rule-based reasoning and logic, and consequently, in RTL code-generation. Having identified these weaknesses, we develop an architecture for incorporating pre-processing tools to dynamically assist the LLMs in inferring logical relationships from tabular data formats. With our tools-assisted architecture for RTL code generation, we achieve significant overall performance gains in the VerilogEval benchmark and outperform many state-of-the-art methods. Our LLM4RTL system achieves performance comparable to that of GPT-4O using a significantly much smaller LLM.


[119] 2606.15540

AP-GRPO: Anchor-Gated Phonetic Alignment with Policy Optimization for Pathological Speech Reconstruction

Pathological speech from patients with neurodegenerative and neuromotor disorders is often acoustically distorted and linguistically fragmented, making pathological speech reconstruction necessary to recover intended textual content from distorted and incomplete speech recordings. Crucially, such recordings are rarely uniformly degraded: some words or short phrases remain reliable and can serve as audible anchors for reconstructing the corrupted surrounding content. We introduce Anchor-gated Phonetic Group Relative Policy Optimization (AP-GRPO), a GRPO framework with phonetic reward that aligns speech language models (SLMs) through audible-anchor preservation and inter-anchor phonetic compatibility to the original speech signal. AP-GRPO consists of: (i) an anchor-gated reward that matches reliable audible anchors in clear regions; and (ii) an inter-anchor phonetic alignment reward that evaluates whether recovered contents are phonetically supported by the corresponding corrupted inter-anchor speech span. Across four disease conditions, AP-GRPO improves faithful speech reconstruction, and the learned anchor constraint automatically adapts to each condition and thus reveals interpretable disease-specific profiles: conditions with severe articulatory degradation require stronger anchor enforcement, whereas milder impairment or linguistically impaired conditions rely more on phonetic alignment for inter-anchor recovery.


[120] 2606.15594

Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.


[121] 2606.15634

Sparse Channel Estimation for SIM-based mmWave Near-Field Communications

In this paper, we address the channel estimation (CE) problem in SIM-based multi-user (MU) millimeter-wave (mmWave) near-field communication systems. To address the severe path loss and blockage in mmWave communication systems, many meta-atoms are typically integrated into each layer of the SIM. Then, the number of radio frequency (RF) chains at the base station (BS) is fewer than that of meta-atoms per layer, resulting in an underdetermined problem. Additionally, the increase in the number of meta-atoms in each layer expands the SIM's near-field region, leading to the user equipment (UEs) being mostly situated in this region, necessitating precise modeling of the channel under the spherical wavefront assumption. To address these issues, we introduce a compressed sensing (CS)-based CE protocol to tackle the underdetermined problem. In contrast to the traditional CS-based estimation framework, we investigate a polar-domain channel representation to tackle the severe energy spread effect of the classical angular-domain channel representation in near-field communication systems. Specifically, we design a novel polar-domain transform matrix for uniform planar arrays (UPAs), thereby transforming the CE problem into a sparse recovery task of the paths' support set and complex gains. To overcome the limitations of the sparse Bayesian learning (SBL) framework in tackling high-dimensional dictionaries, we propose a low-complexity polar-domain SBL (LCPD-SBL) algorithm, which significantly reduces computational complexity without compromising estimation accuracy.


[122] 2606.15744

Comparative Performance Analysis of NIST PQC Standards: From STM32 Software Limitations to FPGA-SoC Acceleration

The rapid advancement of quantum computing poses a significant threat to classical public-key cryptographic systems, necessitating the transition to Post-Quantum Cryptography (PQC). This study investigates the implementation challenges of NISTstandardized signature schemes on resource-constrained embedded hardware. We present a comparative analysis of SPHINCS+ and CRYSTALS-Dilithium on an ARM Cortex-M4 (STM32F407G) microcontroller. Our findings reveal that SPHINCS+ is practically unusable in this software-only environment, with impractical execution times. Furthermore, the reference Dilithium implementation failed to execute entirely on the MCU due to severe RAM and timing constraints. To overcome these hardware limitations, we integrated a hardware-accelerated Dilithium core onto a Xilinx Zynq-7000 ZedBoard SoC. By implementing a specialized Number Theoretic Transform (NTT) accelerator in the FPGA fabric, we achieved successful execution with performance rates for key generation and signature generation at millisecond levels. These results demonstrate that while pure software PQC is non-viable for standard microcontrollers, a hardware-software codesign approach provides the necessary efficiency for quantumresistant embedded systems.


[123] 2606.15749

OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning

Traffic scene understanding requires models to reason beyond object recognition, including lane topology, multi-view geometry, temporal evolution, and signal-phase semantics. However, existing traffic-oriented multimodal benchmarks largely emphasize passive visual recognition or isolated video understanding, offering limited support for evaluating structure-aware traffic reasoning under controlled conditions. We introduce OmniTraffic, a controllable generation pipeline and benchmark for spatio-temporal traffic reasoning. Built around 12 real-world intersections reconstructed into editable 3D traffic environments and complemented by surveillance footage from two countries, OmniTraffic supports both controlled and natural-condition evaluation. It defines a three-level task hierarchy spanning scene perception, multi-view and temporal reasoning, and decision support. Using structured traffic metadata, OmniTraffic generates synchronized multi-view VQA samples covering vehicle states, lane functions, view--BEV correspondence, temporal dynamics, and signal-phase analysis, resulting in 8M VQA samples and a 3K human-verified test set. Evaluation of eleven frontier MLLMs reveals a large human--model gap, with the most pronounced failures in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data further improves performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision for traffic-specific multimodal reasoning. Beyond a fixed dataset, OmniTraffic provides an extensible pipeline with configurable intersections, camera views, traffic demands, signal phases, visual conditions, and rare events.


[124] 2606.15751

Acoustic Prompting via Stage-wise Modulation for Few-Shot Learning in Audio Language Models

Audio-Language Models (ALMs) have shown remarkable success in zero-shot audio classification by aligning audio waveforms with text. Recent efforts to improve downstream performance focus on learning optimal text prompts. However, previous approaches focus on the text encoder, leaving the potential of learnable prompts within the audio encoder unexplored. In this paper, we propose a novel framework that introduces trainable prompts into the audio encoder to capture task-specific acoustic features. We demonstrate that integrating audio-side prompt learning with existing text-side approaches enhances few-shot adaptation. Through extensive experiments across 11 datasets show that integrating our method as a plug-and-play module alongside existing text prompt tuning generally leads to performance improvements. These findings suggest that explicitly modulating the audio representation space effectively complements text-only prompting approaches. The code is available at this https URL.


[125] 2606.15831

An Integrated System for Real-Time Student Assessment and Career Guidance Using Neural Networks in Computing Disciplines

Many undergraduate students in Computer Science (CS) and Software Engineering (SWE) struggle to identify suitable career paths, particularly when their academic performance, abilities, and interests do not fully align. To address this issue, this study proposes an AI-driven Student Assessment and Career Prediction System that integrates a Career Guidance Expert (CGE) system with a Web-Based Student Assessment (WBSA) platform. Within the integrated framework, CGE enhances personalized career recommendations using AI while also assisting students after graduation in identifying suitable jobs, research domains, and higher study opportunities aligned with their skills and interests. The WBSA platform further strengthens interaction between students and faculty through assessments, personalized tasks, mentorship activities, and a secure real-time chat application. The CGE system employs a Multilayer Perceptron (MLP) model trained on real-world academic and extracurricular data collected using the snowball sampling method from the students of universities, achieving a validation accuracy of 94.71% in predicting personalized career paths. A pre-survey was conducted across universities to evaluate the proposed model before deployment. The WBSA system was developed as a modern web application using technologies such as this http URL, this http URL, and PostgreSQL to ensure scalability, responsiveness, and secure data management. The overall system is supported by a secure cloud-based infrastructure, the platform provides reliable performance while assisting graduates to select suitable career path in IT sector. In addition, a post-survey involving both students and faculty was conducted to gather feedback and further improve the overall effectiveness and usability of the system.


[126] 2606.15834

AIChilles: Automatically Uncovering Hidden Weaknesses in AI-Evolved Systems

The computer systems community has recently seen growing interest in AI-driven system evolution, where AI agents iteratively rewrite systems. Frameworks such as AdaEvolve and Engram report 12-60% score improvements over human-designed algorithms. While these results are promising, there are practical concerns if these AI-evolved programs can perform worse on unseen workloads and exhibit scalability regressions. Given the speed and scale of AI-generated code, we need automated mechanisms to uncover such identify hidden weaknesses in AI-evolved systems programs. To this end, we develop AIChilles that takes as input a baseline program $P$ and an AI-evolved program $P'$, AIChilles searches for valid workloads where $P'$ regresses relative to $P$ in correctness, runtime, memory usage, or output quality. To tackle the diversity in system applications, weakness types and potential bugs, AIChilles combines deterministic workload-parameter extraction, agent-based constraint inference, differential oracles, and code-frequency coverage to discover diverse failures. Across five system applications and 30 AI-evolved programs, AIChilles finds 49 distinct hidden weaknesses. We also show that explicitly including AIChilles in the AI-driven development lifecycle can mitigate several of these weaknesses.


[127] 2606.15888

NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.


[128] 2606.15952

SINR-Aware Base Station Deployment in Wide Area IoT Sensor Networks

The rapid expansion of Internet of Things (IoT) applications necessitates the effective deployment of base stations (BSs) to enable consistent connectivity across large geographic areas under interference-limited conditions. Existing techniques typically use distance-based or binary coverage models; however, these abstractions fail to account for the influence of co-channel interference on the quality of communication in dense deployments. In this paper, we investigate the Signal-to-Interference-plus-Noise Ratio (SINR)-aware Base Station Deployment (BSD) problem in wide-area IoT sensor networks. The objective is to determine a minimum-cost subset of BSs from a predefined set of candidate BSs such that every IoT sensor is covered by at least one BS and a target SINR threshold is satisfied. The problem is formulated as a combinatorial optimization problem, which is NP-hard. Theoretical analysis establishes that the proposed coverage function is monotone and submodular, enabling the SINR-aware greedy algorithm to achieve a (1-1/e)-approximation to the optimal solution while maintaining a polynomial-time computational complexity. Numerical evaluations on a real water distribution network dataset demonstrate that the proposed SINR-aware greedy algorithm achieves near-optimal base station deployment while significantly reducing computational effort. Compared with the Genetic Algorithm (GA) and Particle Swarm Optimization (PSO) algorithms, the proposed approach attains complete sensor coverage with deployment costs within 12.3% of the best-performing metaheuristic solution while requiring up to 190 times lower execution time.


[129] 2606.15997

Friction Characterization of a Cable-Driven Differential Actuation System for Lower-Limb Exoskeletons

Lower-limb exoskeletons require actuation systems that can provide accurate joint torque control while preserving low mass and encumbrance. Conventional architectures often rely on independently actuated joints and joint-level torque sensors, increasing system complexity and weight. This paper presents a novel differential actuation architecture for hip-knee flexion/extension, enabling cooperative torque sharing between two motors via a linear differential mapping between motor and joint. To compensate for transmission losses, a model-based friction estimation strategy is developed and experimentally implemented, allowing accurate joint torque estimation without the need for torque sensors. The proposed solution is validated on a physical prototype, demonstrating the feasibility of sensorless torque estimation in a differentially actuated hip-knee module of a lower-limb exoskeleton.


[130] 2606.16057

A Smart-Scheduled Hybrid (SSH) EKF-FGO State Estimation

Reliable state estimation in robotics and control re quires balancing estimation accuracy against computational cost. While filtering-based methods such as the Extended Kalman Filter (EKF) provide efficient real-time updates, and optimisation based formulations using factor graphs improve global consistency, the role of optimisation scheduling is often treated implicitly rather than examined as an explicit design variable. This paper presents an experimental study that explicitly isolates optimisation scheduling using a Smart Scheduled Hybrid (SSH) EKF-FGO framework as a controlled testbed. By combining EKF-based state propagation with periodically invoked batch optimisation and holding solver structure and effort fixed, the main contribution of this work is the experimental characterisation of optimisation scheduling as an independent design variable governing the trade-off between intermediate estimation accuracy and computational cost. Simulation results in a planar SLAM environment show that scheduling strongly influences pre optimisation drift, transient error behaviour, and runtime. In particular, the results identify operating regimes in which most of the benefit of global optimisation can be retained at a fraction of the computational cost, highlighting optimisation scheduling as an under-explored yet critical consideration in hybrid state estimation systems.


[131] 2606.16305

Extended Kalman Filter-Based State Estimation for a Nine-Compartment Nonlinear Epidemic Model -- Convergence Analysis and In-Silico Benchmark Calibrated on the COVID-19 Third Wave in Italy

This paper addresses real-time state estimation for a nine-compartment nonlinear COVID-19 epidemic model with two co-circulating strains, a super-spreader subpopulation, vaccination with waning immunity, hospitalization, and mortality. Time-varying transmission and vaccination rates are known inputs from a companion calibration, leaving the reconstruction of all nine states from three routinely reported observables: hospitalizations H, fatalities F, and vaccinated stock V. The contributions are theoretical rather than in the filter recursion. First, a Lie-derivative observability analysis yields, via a six-step derivation, the closed-form determinant |det(O9)| = delta_w * gamma_a^2 * kappa * rho2 * w1^2 * (delta_i - delta_p)^2 * |r1 - r2|, showing the level-2 codistribution is rank-deficient at the calibrated symmetric parameters (delta_i = delta_p, r1 = r2); the third Lie derivative restores full rank 9, with r2 the symmetry-breaking parameter. Second, an EKF is designed on the Euler-discretized dynamics with a closed-form 9x9 Jacobian and Joseph covariance update. Third, local exponential mean-square boundedness of the error is proved as a full theorem via the Reif-Gunther-Yaz-Unbehauen hypotheses, exploiting the bilinear drift and linear output to obtain a global-radius quadratic remainder bound that extends to bilinear-drift, linear-output systems. Fourth, the noise covariances are designed from calibration residuals and assessed by NEES and innovation-whiteness tests. All experiments use synthetic measurements from the calibrated model, so reported RMSE values (0.07%-2.72%) are methodology benchmarks, not predictive accuracy. A parameter-mismatch study shows measured and directly-coupled channels stay accurate under model error up to +/-30% while indirectly observed states degrade gracefully. The framework provides the state-feedback basis for future Model Predictive Control.


[132] 2606.16327

ArtBoost: Synthetic Articulatory Data Augmentation for Acoustic-to-Articulatory Inversion

Recent acoustic-to-articulatory inversion (AAI) models rely on electromagnetic articulography (EMA) data, which are costly and limited in scale. To address this limitation, we propose \textit{ArtBoost}, a novel data augmentation strategy that leverages large-scale speech--mesh datasets originally developed for speech-driven 3D facial animation to improve AAI under limited EMA supervision. \textit{ArtBoost} extracts pseudo articulatory trajectories from visible facial anchors and uses them for pre-training before fine-tuning on real EMA data. Experiments show consistent improvements in PCC and RMSE. Trajectory analyses confirm that the pseudo articulatory signals reflect physically meaningful visible articulatory dynamics. Additional evaluations across different AAI architectures demonstrate stable performance gains, indicating that \textit{ArtBoost} can be integrated into diverse AAI models. These results suggest that speech--mesh data provide an effective and scalable source of articulatory supervision for AAI. Project page: this https URL


[133] 2606.16345

Output-Feedback Boundary Control of Reaction--Diffusion PDEs on Arbitrary Lipschitz Domains: A Target-Domain Approach

We present a domain-extension framework for output-feedback boundary stabilization of reaction-diffusion equations on arbitrary bounded Lipschitz domains, including non-convex and multiply connected geometries. The plant is posed on an irregular domain whose boundary has actuated and uncontrolled portions. Just as backstepping transforms the plant dynamics into a stable target system, the method embeds the plant in a target domain, such as a ball or a rectangle, where a stabilizing design is already known. Every boundary portion through which the extension proceeds must carry actuation and the complementary collocated measurement. Uncontrolled portions are allowed when they are shared with the target boundary and have the same boundary-condition type. The gap between the two domains is filled with a virtual copy of the plant dynamics, coupled to the plant through interface conditions, and the concatenated state evolves exactly as the known closed loop on the target domain. Well-posedness and exponential stability of the physical state follow by restriction. The offline design data are inherited from the target design and are closed-form for constant-coefficient plants on balls and rectangles. Online simulation of the virtual PDE has the same computational character as a full-order PDE observer, a standard component of output-feedback designs. A new explicit Neumann-actuated backstepping law on n-balls enlarges the available target designs. Output feedback is obtained by lifting the target-domain observer, driven by a collocated interface measurement relayed through the virtual domain. Numerical experiments on star-shaped, horseshoe, and multiply connected domains, with a partitioned plant/controller implementation and a shared-wall cavity, test the designs.


[134] 2606.16396

SP$^3$: Spherical Priors for Plug-and-Play Restoration

In this paper, we introduce SP$^3$, a novel Plug-and-Play algorithm that accelerates maximum a posteriori image restoration by replacing denoisers with Spherical Encoders (SE) as generative priors. SP$^3$ approximates the intractable proximal prior step by utilizing the SE tightly structured latent space as a robust projection onto the natural image manifold. Alternating this projection with a closed-form data-consistency step, via Half-Quadratic Splitting, achieves stable convergence without requiring gradient computation during inference. This unique formulation unlocks "anytime" restoration capabilities, producing sharp, plausible images from the first iteration. Evaluations across a variety of image restoration tasks demonstrate that SP$^3$ achieves perceptual quality comparable to state-of-the-art zero-shot diffusion and flow methods while being $3$-$630\times$ faster.


[135] 2606.16412

An Asymmetric Formula for Interval Consonance and its Relation to Harmonic Coincidence

Euler's Gradus Suavitatis (1739) assigns a dissonance value to a musical interval p/q by the formula G(p/q) = 1 + \Omega^(p) + \Omega^(q), where \Omega^(n) = \sum_i e_i(p_i - 1) sums the weighted prime exponents of n. We propose the simpler asymmetric formula f(p/q) = p + \Omega^(q), which treats numerator and denominator differently and performs comparably on standard consonance data. We also show that, under a model in which harmonics are integer-indexed and counted uniformly up to a fixed truncation level, Gradus is equivalent to a weighted harmonic coincidence count with weights w(n) = \Omega^(n), connecting it to Galileo's earlier pulse-coincidence model (1638). The formula naturally generates a coprime integer triangle T(n,k) = n + \Omega^(k), whose rightmost diagonal gives the two-stage dissonance of the superparticular (consecutive-harmonic) intervals. The formula f admits a simple two-stage interpretation in terms of harmonic context and partial recognition, which we offer as a speculative perceptual hypothesis.


[136] 2606.16417

Joycent: Diffusion-based Accent TTS without Accented Phone Prediction

Accent text-to-speech (TTS) aims to synthesize speech with target accents. Existing accent TTS systems typically rely on a two-stage pipeline that first converts standard phone sequences into accented phone sequences and then synthesizes accented speech. However, such approaches suffer from error accumulation and require paired standard-accented phone sequence data, which is often limited in practice. Moreover, text-based accented phone representations are insufficient to model acoustic accent characteristics such as prosody and rhythm. In this work, we propose Joycent, a diffusion-based accent TTS model that synthesizes accented speech directly from standard phone sequences and speech references without accented phone prediction. Joycent integrates accent and speaker representations through conditional layer normalization (CLN) in the text encoder. We introduce WhisAID, a Mandarin accent identification model trained on accented Mandarin speech to extract accent representations. Experimental results show that Joycent improves accentedness while preserving speaker identity compared with baseline systems. We release our code and demos at: this https URL.


[137] 2606.16466

Information aging in massive MIMO systems affected by phase noise

In massive MIMO systems, phase noise can spoil the performance of the usual receiver techniques. The problem arises because of the aging of phase-noise information based on pilots. In this paper, in a realistic 5G uplink scenario, we quantify the impact of information aging and we propose an iterative receiver based on expectation-maximization (EM). Simulation results show that the iterative receiver is robust to information aging related to phase noise.


[138] 2606.16480

HOLO-MPPI: Multi-Scenario Motion Planning via Hierarchical Policy Optimization

Robots deployed in the real world must plan motions across diverse scenarios without per-scenario retuning. End-to-end reinforcement learning (RL) can generalize across scenarios but often becomes brittle under distribution shift, reward misspecification, and stochastic interactions. Model predictive path integral (MPPI) control enables strong real-time refinement without gradients, but its performance depends on a well-shaped sampling prior, while manually designing the priors does not scale to multi-scenario deployment. We present HOLO-MPPI (High-level Offline, Low-level Online MPPI), a multi-scenario motion planning framework that combines high-level policy learning with low-level stochastic optimal control. Offline, we learn a high-level policy that proposes scenario-robust plans in an abstract action space, with a learned world model for online rollout. Online, the policy serves as a data-driven prior generator that parameterizes MPPI's sampling distribution conditioned on the current observation and goal. MPPI then optimizes low-level control sequences around this prior in real time to adapt to local disturbances. We instantiate HOLO-MPPI in autonomous driving by designing an effective high-level action space and tailored model architectures. Our evaluation across diverse driving scenarios shows that HOLO-MPPI improves upon MPPI and end-to-end RL baselines while maintaining real-time control.


[139] 2606.16485

Robust Koopman MPC with Sets Updates for Time Delayed Systems

Koopman operators have shown significant potential in designing linear model predictive control (MPC) schemes for nonlinear systems on a lifted observable space. Recent advances have tackled the robust Koopman MPC design issue in the presence of modeling errors, relying on the prior estimation of the modeling uncertainty set. However, deriving a robust positively invariant set using a precalculated uncertainty set can be conservative because the uncertainty set bound is time-varying and dependent on the state and control. Additionally, no existing Koopman MPC design has addressed the closed-loop robustness challenge for nonlinear time delayed systems. Thereby, this article presents a robust adaptive Koopman MPC approach with online updates of uncertainty sets for a class of nonlinear time delayed systems. The unknown nonlinear time delayed system is first modeled in a data-driven manner to derive a lifted time delayed Koopman model in the feature space. By analyzing fundamental properties such as controllability and observability, a robust tube-based MPC algorithm is designed for the time delayed Koopman model. The robust adaptive Koopman MPC algorithm with online updates of the uncertainty sets is then presented to reduce conservatism. Closed-loop robustness under exogenous disturbances and asymptotic convergence in the nominal scenario are proven. Finally, numerical examples verify the effectiveness of the proposed approach.


[140] 2606.16558

ROSA-RL: Uncertainty-Aware Roundabout Optimized Speed Advisory with Reinforcement Learning

Roundabouts challenge automated driving in mixed traffic, as heterogeneous and non-deterministic human behavior, unknown driving intentions, and high interaction complexity create uncertainty about whether the conflict zone will be blocked or available at the moment of entry. We present ROSA-RL -- uncertainty-aware Roundabout Optimized Speed Advisory with Reinforcement Learning. It enables safe and efficient roundabout entry for automated and human-driven vehicles in mixed traffic through probabilistic conflict forecasting. A Transformer-based model predicts conflict zone occupancy over a five-second horizon, capturing multi-agent interactions to anticipate upcoming conflicts and available gaps. The prediction outputs encode uncertainty in future motion and intent, and augment the state of a classical RL framework, enabling uncertainty-aware speed coordination. Evaluated in simulations grounded in real-world data, ROSA-RL can effectively handle uncertainty and outperform a comparable model-based baseline, closing the gap to an ideal setting assuming fully known occupancy while improving traffic efficiency and safety. The source code of this work is available under: this http URL.


[141] 2606.16567

TNODEV: Toolbox for Neural ODE Verification

Neural ordinary differential equations (neural ODE) have started to appear in safety critical settings such as continuous-time controllers for cyber-physical systems and classifiers integrated into automated decision pipelines, raising the question of whether their behavior can be formally verified. Existing tools dedicated to neural ODE provide only a single reachability call without iterative input set refinement, limiting the precision of their verdicts to whatever one reachability call can deliver. We present TNODEV, the first sound formal verifier for neural ODE that integrates a falsification checker, a fast interval-based reachability backend based on continuous-time mixed monotonicity, a verification and refinement loop with three input-set splitting heuristics, and a parallel scheduler in a single end-to-end pipeline. TNODEV supports safe-set inclusion verification on pure neural ODE, neural ODE in closed loop with a neural network controller and general neural ODE (GNODE), with the safe set specified either as an interval or as the half-space intersection induced by a target classification label. We evaluate TNODEV on a range of benchmarks across safe-set inclusion and classification-robustness properties, including a direct reachability comparison against NNV~2.0 and CORA and a verification comparison against NNV2.0 on MNIST general neural ODE classifiers.


[142] 2606.16644

Enhancing Secret Key Generation for UAV Communications via Codeword Reconstruction

With the rapid advancement of unmanned aerial vehicle (UAV), ensuring the security of communication links among UAVs has become crucial. In this paper, we propose a novel physical layer key generation scheme based on channel codeword reconstruction. In UAV communications, the high mobility of aerial nodes leads to short channel coherence time, which together with noise causes inevitable channel estimation errors. These errors significantly degrades the performance of wireless channel-based key generation. Therefore, we propose a codeword construction algorithm that achieves a polarization characteristic, which effectively segregates reliable keys from unreliable ones. Compared to the existing quantization-based key generation scheme, our approach maximize the utilization of raw channel information and employ soft-decision decoding to generate key. Simulation results demonstrate that the proposed scheme reduces the key disagreement rate for legitimate users and increases the number of consistently generated keys. Furthermore, our method ensures a lower key consistency rate for eavesdropper, which guarantees system security.


[143] 2606.16951

Simulation-Based Multi-Fillet Evaluation of Woody Breast Poultry Fillets

Woody breast (WB) is a myopathy in modern broiler chickens that causes the breast muscle to become unusually stiff and fibrous, leading to decreased meat quality and significant economic losses. State-of-the-art automated WB detection relies on a side-view imaging system to analyze the bending behavior of a single fillet as it falls off a conveyor belt. While highly accurate, this approach is constrained by its single-fillet field of view, creating throughput bottlenecks on commercial processing lines. In this paper, we address this limitation via a novel multi-fillet detection architecture utilizing a top-down camera configuration. To validate our approach, we first develop a high-fidelity digital twin of an industrial conveyor system. Next, we synthesize a diverse dataset of 3D fillet meshes and model their viscoelastic bending dynamics using a physics-based simulation engine. Lastly, a continuous 2D shape deformation score is extracted from the top-down perspective as the simulated fillets traverse the roller precipice. Experimental results demonstrate that the top-down shape score effectively captures the contour changes of the fillets as it bends, providing a robust and scalable alternative to a side-view imaging system for simultaneous multi-fillet WB evaluation.


[144] 2606.16969

Probing Low Frame Rate Degradation in Neural Audio Codecs

Low frame rates in neural audio codecs are attractive for autoregressive speech synthesis, where the generation cost scales linearly with the sequence length. Recent work has demonstrated that codecs can operate at 12.5 Hz and below, but the mechanisms underlying low frame rate degradation remain insufficiently understood. We investigate these mechanisms through a controlled frame rate ablation. We reproduce a quality cliff at 6.25 Hz reported in previous works and evaluate candidate explanations: phonemic collisions and codebook saturation, neither of which shows evidence of a fundamental barrier. The cliff is instead caused by suboptimal training configuration: fixed clip duration during training yields too few tokens at low frame rates, starving the decoder of inter-token context. Once corrected, WER degrades smoothly with phonemic load down to 3.1 Hz and 1.6 Hz, suggesting the inference-time efficiency gains of low frame rate codecs are more accessible than previously assumed.


[145] 2606.16972

When Should a Robot Replan? Regret-Guided Update Scheduling in Time-Varying MDPs

Robots operating in non-stationary environments must continually adapt their policies as the dynamics drift, but onboard energy and compute budgets cap how often a full state estimation and re-planning step can be performed. This raises a question: \emph{when}, along a horizon, should a robot spend its limited budget? We formulate this problem in time-varying Markov decision processes (TVMDPs) with a known bound on the rate of transition drift. We model execution as a \emph{skip-update} scheme in which, at chosen update times, the agent estimates the transition kernel by maximum likelihood and computes a finite-horizon policy, and between updates reuses this policy under a propagated state estimate. We analyze the dynamic regret of this scheme and show how it grows during skip intervals in terms of the properties of the TVMDP and the skip lengths; the resulting bound answers the opening question via an online, regret-guided update rule that allocates the budget adaptively. We evaluate the rule in a simulated Mars-rover navigation task with time-varying slip dynamics and on a Crazyflie quadrotor in indoor obstacle fields. Adaptive allocation outperforms other budgeted baselines.


[146] 2606.16978

Task-Error Residual Learning for Real-Robot Five-Ball Juggling

For residual learning that refines existing behavior, sample efficiency depends on two things: how much information each rollout returns, and how efficiently the learner uses that information. Reinforcement learning's standard scalar reward carries far less information than the directional task error that defines the task. Random exploration further discards whatever information each rollout returns. Through residual learning with directional task-error supervision and a task error model that drives sample selection, we achieve stable three-, four-, and five-ball juggling on anthropomorphic Barrett WAM arms. Despite planning and controlling through a simple, idealized stack, the system converges from the second attempt. The first attempt drops, after which task error decreases monotonically without further failures. In comparison, five-ball juggling typically takes humans years of practice. We compare residual learners across two ternary axes, the directional information in the learning feedback and the commitment of the analytic prior, spanning Newton-style Jacobian updates, Composite Bayesian Optimization, and stochastic search methods. Both axes prove necessary: neither directional feedback nor an informative prior suffices alone, and the simplest method that combines them, a fixed-Jacobian Newton update, is the most reliable. The learned residual tolerates substantial prior misalignment and degraded joint tracking, affecting mainly convergence speed. The bottleneck for residual learning on real robots is therefore the information content of the supervision signal and how the learner uses it, not the accuracy of the surrounding stack. Video documentation of all experiments is available at this https URL.


[147] 2606.16985

Dynestyx: A Probabilistic Programming Library for Dynamical Systems

State-space models (SSMs) are the standard formalism for Bayesian treatment of dynamical systems, with natural applications in statistics, signal processing, and machine learning. Despite their importance in both theory and application, dynamical systems have proven difficult to incorporate in modern probabilistic programming languages (PPLs), making state-of-the-art methods less accessible to practitioners and introducing friction in following the "Bayesian workflow." We introduce dynestyx, a probabilistic programming library with first-class support for SSMs, including state-of-the-art methods in the estimation of both states and parameters. Through a single, unified interface, users may specify arbitrary priors for discrete-time or continuous-time dynamical systems, perform inference over mixed-effect data, and make state and parameter estimates with principled uncertainty quantification.


[148] 2606.17006

TuneJury: An Open Metric for Improving Music Generation Preference Alignment

We introduce TuneJury, an open, instance-level pairwise reward model for text-to-music that predicts a music preference score from a text prompt and an audio clip. The released checkpoint is trained on publicly available human-preference labels covering arena-style (A vs. B) votes, metric-alignment preference pairs, crowdsourced pairwise comparisons, and expert aesthetic ratings. The predicted score margin between two clips is well calibrated on our held-out test split, supporting data filtering via a simple score threshold. TuneJury generalizes to both held-out test pairs and out-of-distribution benchmarks, remaining competitive with prior baselines on the latter. For generators released after training, we introduce anchor calibration, a post-hoc, per-system Bradley-Terry calibration that recovers agreement at substantially better data efficiency than from-scratch retraining. The same frozen reward drives consistent reward-axis gains across three downstream applications: inference-time best-of-N selection, DITTO-style latent optimization, and expert-iteration post-training. TuneJury is available at this https URL.


[149] 2310.08679

Data-driven invariant set for nonlinear systems with application to command governors

This paper presents a novel approach to synthesize positive invariant sets for unmodeled nonlinear systems using direct data-driven techniques. The data-driven invariant sets are used to design a data-driven command governor that selects a command for the closed-loop system to enforce constraints. Using basis functions, we solve a semi-definite program to learn a sum-of-squares Lyapunov-like function whose unity level-set is a constraint admissible positive invariant set, which determines the constraint admissible states and input commands. Leveraging Lipschitz properties of the system, we prove that tightening the model-based design ensures robustness of the invariant set to the inherent plant uncertainty in a data-driven framework. To mitigate the curse-of-dimensionality, we repose the semi-definite program into a linear program. We validate our approach through two examples: First, we present an illustrative example where we can analytically compute the maximum positive invariant set and compare with the presented data-driven invariant set. Second, we present a practical autonomous driving scenario to demonstrate the utility of the presented method for nonlinear systems.


[150] 2311.03501

Maximum A Posteriori Direction-of-Arrival Estimation via Mixed-Integer Semidefinite Programming

We propose a joint sparse maximum a posteriori (MAP) estimator for DOA estimation from multiple snapshots, reformulated as a mixed-integer semidefinite program (MISDP). This enables efficient computation of globally optimal solutions using off-the-shelf MISDP solvers based on the branch-and-bound method. Unlike other nonconvex approaches for joint sparse recovery, such as the greedy methods and sparse Bayesian learning techniques, it provides a solution with an optimality assessment even with early termination. Additionally, we present a more scalable approximate solution approach for the MISDP problem based on randomized rounding. Numerical simulations demonstrate the improved threshold behavior, resolution, and robustness of our proposed method against popular DOA estimation methods. In particular, the proposed method applied with the randomized rounding algorithm exhibits a superior estimation performance at a significantly reduced running time, compared to the deterministic maximum likelihood (DML) estimator.


[151] 2407.09534

DFS-based fast crack pre-detection

This paper develops a computationally efficient pre-detection method for cracks in three-dimensional CT images of concrete. Instead of attempting full voxel-wise crack segmentation, the method focuses on locating cubic subregions where crack structures are likely to be present and should be analyzed further. The proposed pipeline combines multiscale Maximal Hessian Entry filtering with graph-based connectivity analysis. After binarization, each subregion is represented by the boundary face with the largest foreground pixels, which transforms the local detection problem from a three-dimensional image task into a two-dimensional graph problem. A sparse lattice graph is constructed on the selected face, and Depth-First Search is applied to detect connected components corresponding to possible crack cross-sections. The choice of mesh size is justified by a probabilistic upper bound on a lattice-miss event. Experiments on semi-synthetic and real CT data show that the method gives fast, interpretable crack pre-localization while avoiding exhaustive analysis of the full image.


[152] 2411.05824

Navigating Distribution Shifts in Medical Image Analysis: A Survey

Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges posed by distribution shifts, where models trained on specific datasets underperform on others from varying hospitals, or patient populations. To address this issue, researchers have been actively developing strategies to increase the adaptability of DL models, enabling their effective use in unfamiliar environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Rather than organizing existing methods by technical characteristics, we explicitly bridge real-world clinical constraints -- such as limited data accessibility, strict privacy requirements, and heterogeneous collaboration protocols -- with the technical paradigms able to address them. By establishing this connection between operational constraints and methodological evolution, we categorize existing works into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, each aligned with specific healthcare scenarios. Beyond this taxonomy, our empirical analysis suggests that, as domain information becomes progressively less accessible across these paradigms, performance improvements become increasingly constrained, and further uncovers a gradual shift in methodological focus from explicit distribution alignment toward uncertainty-aware modeling, ultimately pointing to the need for more deployability-aware design in real-world MedIA.


[153] 2505.05647

A New k-Space Model for Non-Cartesian Fourier Imaging

For the past several decades, it has been popular to reconstruct Fourier imaging data using model-based approaches that can easily incorporate physical constraints and advanced regularization/machine learning priors. The most common modeling approach is to represent the continuous image as a linear combination of shifted "voxel" basis functions. Although well-studied and widely-deployed, this voxel-based model is associated with longstanding limitations, including high computational costs, slow convergence, and a propensity for artifacts. In this work, we reexamine this model from a fresh perspective, identifying new issues that may have been previously overlooked (including undesirable approximation, wrap-around, and nullspace characteristics). Our insights motivate us to propose a new model that is more resilient to the limitations (old and new) of the previous approach. Specifically, the new model is based on a Fourier-domain basis expansion rather than the standard image-domain voxel-based approach. Illustrative results, which are presented in the context of non-Cartesian MRI reconstruction, demonstrate that the new model enables improved image quality (reduced artifacts) and/or reduced computational complexity (faster computations and improved convergence).


[154] 2508.05279

Passive Lifted FIR Filters for Nonlinear System Identification

Passivity is a fundamental property of physical systems. In data-driven modeling, ensuring that a learned model preserves this structural property is critical to avoiding instability in close loop. Although linear passive system identification is well-established, nonlinear extensions remain challenging. We propose nonlinear operators defined through passivity-preserving lifting of linear passive FIR filters. Passivity is enforced efficiently through frequency-domain constraints, and the nonlinear lifting includes output feedback for expressivity. Numerical and real-world experiments demonstrate the framework capabilities, including the computational advantage of frequency-domain constraints against LMI-based alternatives.


[155] 2508.17742

EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation and Diagnostic Analyses of EEG Foundation Models

Electroencephalography foundation models (EEG-FMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduce \textbf{EEG-FM-Bench}, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning often acts as a useful regularizer that mitigates overfitting in data-scarce EEG contexts, although negative transfer can arise under specific task paradigms; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) under released checkpoints and a matched downstream protocol, model or data scale alone does not fully explain transfer performance, while objective alignment, adaptation compatibility, and EEG-specific design appear to be important factors. This benchmark enables fair comparison and reproducible analysis, providing a step toward fairer comparison and more interpretable analysis of EEG-FMs. Code is available at this https URL.


[156] 2509.14959

Discrete optimal transport is a strong audio adversarial attack

In this paper, we investigate discrete optimal transport (DOT) as a black-box attack against modern automatic speaker verification (ASV) and anti-spoofing countermeasure (CM) systems. Our attack operates as a post-processing distribution-alignment step. Frame-level WavLM embeddings of generated speech (or another person speech) are aligned to an unpaired bona fide speech pool using entropic optimal transport and a top-k barycentric projection, followed by neural vocoding. Unlike gradient-based attacks, the proposed method requires no access to model parameters, gradients, or training data. Experiments on ASVspoof2019 and ASVspoof5 demonstrate that DOT attack substantially increases CM EER and substantially degrades ASV performance across multiple spoofing attacks. The attack transfers across datasets and remains effective after CM fine-tuning. Analysis using speaker similarity, Fréchet Audio Distance, and visualization of embedding distributions suggests that DOT succeeds by shifting source speech toward bona fide regions of the representation space rather than by maximizing speaker similarity. These results indicate that optimal-transport-based distribution alignment represents a previously underexplored attack vector for contemporary ASV and anti-spoofing systems.


[157] 2509.21425

Quaternionic Pole Placement via Companion Forms and the Ackermann Formula

We present an extension of state-feedback pole placement for quaternionic systems, based on companion forms and the Ackermann formula. For controllable single-input quaternionic LTI models, we define a companion polynomial that annihilates its companion matrix, characterize spectra via right-eigenvalue similarity classes, and prove coefficient-matching design in controllable coordinates. We then derive a coordinate-free Ackermann gain expression valid for real target polynomials, and state its scope and limitations. Short examples demonstrate correctness, practical use, and numerical simplicity.


[158] 2509.22167

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

Mel-spectrograms have been widely used in zero-shot text-to-speech (TTS); their inherent redundancy leads to inefficiency in text-speech alignment. Compact VAE-based latent representations have emerged as a stronger alternative but exhibit an optimization dilemma: higher-dimensional latents improve reconstruction quality and speaker similarity but degrade intelligibility, while lower-dimensional latents improve intelligibility at the cost of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, which uses semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems and vanilla acoustic VAE baselines with improved training efficiency. Demo and codes: this https URL


[159] 2511.05522

AIRMap: AI-Generated Radio Maps for Wireless Digital Twins

Accurate, low-latency channel modeling is essential for real-time wireless network simulation and digital-twin applications. Traditional modeling methods like ray tracing are however computationally demanding and unsuited to model dynamic conditions. In this paper, we propose AIRMap, a deep-learning framework for ultra-fast radio-map estimation, along with an automated pipeline for creating the largest radio-map dataset to date. AIRMap uses a single-input U-Net autoencoder that processes only a 2D elevation map of terrain and building heights. Trained on 1.2M Boston-area samples and validated across four distinct urban and rural environments with varying terrain and building density, AIRMap predicts path gain with under 4 dB RMSE in 4 ms per inference on an NVIDIA L40S-over 100x faster than GPU-accelerated ray tracing based radio maps. A lightweight calibration using just 20% of field measurements reduces the median error to approximately 5%, significantly outperforming traditional simulators, which exceed 50% error. Integration into the Colosseum emulator and the Sionna SYS platform demonstrate near-zero error in spectral efficiency and block-error rate compared to measurement-based channels. These findings validate AIRMap's potential for scalable, accurate, and real-time radio map estimation in wireless digital twins.


[160] 2511.09140

LMMSE-Optimal Pilot Pattern Design Based on Covariance Matrix Approximation for OFDM Channel Estimation in Doubly Dispersive Channel

This paper investigates the optimal pilot pattern design, in the linear minimum mean square error (LMMSE) estimator sense, for OFDM systems in doubly dispersive channels. To enable analytical tractability, the channel covariance matrix is decomposed into the Kronecker product of two Hermitian Toeplitz matrices corresponding to the delay and Doppler domains. By invoking the Szegö limit theorem, these matrices are shown to be approximately diagonalizable by discrete Fourier transform (DFT) matrices. Based on this structure, the LMMSE channel estimation error is reformulated into a compact analytical form, from which a closed-form lower bound is derived. Furthermore, we establish the condition under which this bound is achieved by a lattice-based pilot pattern. Numerical results verify that the proposed matrix approximation introduces negligible error and examples of the proposed lattice design are given.


[161] 2511.17433

The Iberian Blackout: A Black Swan or a Gray Rhino? A Protection-Aware Dynamic Voltage Security Assessment

On 28 April 2025, the Iberian mainland power system collapsed after a rapid voltage rise, widespread generation disconnections, and loss of synchronism. The ENTSO-E Expert Panel final report attributes the blackout to multiple interacting factors including ineffective voltage control, fixed power factor reactive behavior, fast generation ramps, protection settings not aligned with requirements, slow or unavailable reactive absorption, and limited observability outside the transmission system. This paper uses the incident as a motivating case for a broader operational voltage security problem: given the present grid state, can the next plausible trip, ramp, topology action, or shunt action push protected downstream voltages above relay thresholds before available voltage controls can respond? We develop a protection-aware dynamic voltage security assessment for this question. Starting from a nonlinear hybrid differential-algebraic equation (DAE) model, we derive mode wise finite window voltage maps that include automatic voltage regulators (AVRs), inverter-based resources (IBRs), static synchronous compensators (STATCOMs), high-voltage direct-current (HVDC) links, loads, shunts, transformers, protection functions, and limiter behavior whenever the corresponding models are available. We define normalized overvoltage margin erosion at the protection measurement side and time resolved lower bounds on useful control response. We then develop a monotone pickup cascade screen, robust data-limited certificates under uncertain relay and protected-voltage data, and a mitigation optimization that computes the minimum fast reactive action needed to keep protected voltages below relay thresholds. Case studies on a 2000-bus mechanism replica and multiple dynamic benchmark systems show that the screen predicts nonlinear cascade propagation.


[162] 2512.03977

An Information Theory of Finite Abstractions and their Fundamental Scalability Limits

Finite abstractions are discrete approximations of dynamical systems, such that the set of abstraction trajectories contains all system trajectories. There is a consensus that abstractions suffer from the curse of dimensionality: for the same ``accuracy" (how closely the abstraction represents the system), the abstraction size scales poorly with system dimensions. And yet, after decades of research on abstractions, there are no formal results on their accuracy-size tradeoff. In this work, we derive a statistical, quantitative theory of abstractions' accuracy-size tradeoff and uncover fundamental limits on their scalability, through rate-distortion theory -- the information theory of lossy compression. Abstractions are viewed as encoder-decoder pairs, encoding trajectories of dynamical systems. Rate measures abstraction size, while distortion describes accuracy, defined as the spatial average deviation between abstract trajectories and system ones. We obtain a fundamental lower bound on the minimum achievable abstraction distortion, given the system dynamics and the abstraction size; and vice-versa a lower bound on the minimum size, for given distortion. The bound depends on the complexity of the dynamics, through trajectory entropy. We demonstrate its tightness on some dynamical systems. Finally, we showcase how this new theory enables constructing minimal abstractions, optimizing the size-accuracy tradeoff, through an example on a chaotic system.


[163] 2601.14759

Improved GPR-Based CSI Acquisition via Spatial-Correlation Kernel

Accurate channel estimation with low pilot overhead and computational complexity is key to efficiently utilizing multi-antenna wireless systems. Motivated by the evolution from purely statistical descriptions toward physics- and geometry-aware propagation models, this work focuses on incorporating channel information into a Gaussian process regression (GPR) framework for improving the channel estimation accuracy. In this work, we propose a GPR-based channel estimation framework along with a novel Spatial-correlation (SC) kernel that explicitly captures the channel's second-order statistics. We derive a closed-form expression of the proposed SC-based GPR estimator and prove that its posterior mean is optimal in terms of minimum mean-square error (MMSE) under the same second-order statistics, without requiring the underlying channel distribution to be Gaussian. Our analysis reveals that, with up to 50% pilot overhead reduction, the proposed method achieves the lowest normalized mean-square error, the highest empirical 95% credible-interval coverage, and superior preservation of spectral efficiency compared to benchmark estimators, while maintaining lower computational complexity than the conventional MMSE estimator.


[164] 2602.01394

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in WER across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream detection of the acoustic scene. Code and pretrained models will become available upon acceptance. Demo page: this https URL


[165] 2602.03718

A Narrowband Fully-Analog Multi-Antenna Transmitter

This paper proposes a narrowband fully-analog $N$-antenna transmitter that emulates the functionality of a narrowband fully-digital $N$-antenna transmitter. Specifically, in symbol interval $m$, the proposed fully-analog transmitter synthesizes an arbitrary complex excitation vector $\boldsymbol{x}[m]\in\mathbb{C}^N$ with prescribed total power $\|\boldsymbol{x}[m]\|_2^2=P$ from a single RF tone, using only tunable phase-control elements embedded in a passive interferometric programmable network. The programmable network is excited through one input port while the remaining $N-1$ input ports are impedance matched. In the ideal lossless case, the network transfer is unitary and therefore redistributes RF power among antenna ports without dissipative amplitude control. The synthesis task is posed as a unitary state-preparation problem: program a unitary family so that $\boldsymbol{V}(\boldsymbol{\varphi}[m])\boldsymbol{e}_1=\boldsymbol{c}[m]$, where $\boldsymbol{c}[m]=\boldsymbol{x}[m]/\sqrt{P}$ and $\|\boldsymbol{c}[m]\|_2=1$. We provide a parameter-minimal realization and a closed-form programming rule: a balanced binary magnitude-splitting tree allocates the desired per-antenna magnitudes $|c_n|$ using $N-1$ tunable split ratios, and a per-antenna output phase bank assigns the target phases using $N$ tunable phase shifts. The resulting architecture uses exactly $2N-1$ real tunable degrees of freedom and admits a deterministic $O(N)$ programming procedure with no iterative optimization, enabling symbol-by-symbol updates. Using representative COTS components, we model the compute-excluded RF-front-end DC power of the proposed fully-analog transmitter and compare it against an equivalent COTS fully-digital array. For $N\le 16$, the comparison indicates significant RF-front-end power savings for the fully-analog architecture under a common delivered antenna-port power normalization.


[166] 2602.07169

ML-Enabled Deformable Matched Filters for Band-Limit Compensation in Free-Space Optics

This paper proposes a neural-network-assisted deformable matched filtering (DMF) framework for carrier-less amplitude and phase (CAP) modulation operating under bandwidth-limited channel conditions. Instead of replacing the analytically derived CAP matched filter, the proposed receiver learns a residual deformation of the nominal matched filter based on a compact set of physically motivated signal features extracted from the received waveform. A total of 16 time-domain, frequency-domain, and memory-related features are used to provide a low-dimensional representation of bandwidth-induced pulse distortion. These features are mapped by a fully connected neural network to complex-valued matched filter coefficients, enabling adaptive pulse-shape compensation prior to symbol-rate sampling. The network is trained end-to-end using a differentiable loss function based on error vector magnitude (EVM). Experimental results obtained using a hardware-in-the-loop CAP transmission system demonstrate that the proposed DMF significantly outperforms conventional fixed matched filtering under severe bandwidth constraints, without requiring decision feedback or increasing receiver latency.


[167] 2602.07300

Distributed Omniscient Observers for Multi-Agent Systems: Design and Applications

This paper proposes distributed omniscient observers for both heterogeneous and homogeneous linear multi-agent systems, such that each agent can correctly estimate the states of all agents. The observer design is based on local input-output information available to each agent, and knowledge of the global communication graph among agents is not necessarily required. The proposed observers can contribute to distributed Nash equilibrium seeking in multi-player games and the emergence of self-organized social behaviors in artificial swarms. Simulation results demonstrate that artificial swarms can emulate animal social behaviors, including sheepdog herding and honeybee dance-based navigation.


[168] 2602.11547

H.265/HEVC Video Steganalysis Based on CU Block Structure Gradients and IPM Mapping

Existing H.265/HEVC video steganalysis research mainly focuses on detecting the steganography based on motion vectors, intra prediction modes, and transform coefficients. However, there is currently no effective steganalysis method capable of detecting steganography based on Coding Unit (CU) block structure. To address this issue, we propose, for the first time, a H.265/HEVC video steganalysis algorithm based on CU block structure gradients and intra prediction mode mapping. The proposed method first constructs a new gradient map to explicitly describe changes in CU block structure, and combines it with a block level mapping representation of IPM. It can jointly model the structural perturbations introduced by steganography based on CU block structure. Then, we design a novel steganalysis network called GradIPMFormer, whose core innovation is an integrated architecture that combines convolutional local embedding with Transformer-based token modeling to jointly capture local CU boundary perturbations and long-range cross-CU structural dependencies, thereby effectively enhancing the capability to perceive CU block structure embedding. Experimental results show that under different quantization parameters and resolution settings, the proposed method consistently achieves superior detection performance across multiple steganography methods based on CU block structure. This study provides a new CU block structure steganalysis paradigm for H.265/HEVC and has significant research value for covert communication security detection.


[169] 2603.08215

Skill-Evolving Grounded Reasoning for Free-Text Promptable 3D Medical Image Segmentation

Free-text promptable 3D medical image segmentation offers an intuitive and clinically flexible interaction paradigm. However, current methods are highly sensitive to linguistic variability: minor changes in phrasing can cause substantial performance degradation despite identical clinical intent. Existing approaches attempt to improve robustness through stronger vision-language fusion or larger vocabularies, yet they lack mechanisms to consistently align ambiguous free-form expressions with anatomically grounded representations. We propose Skill-Evolving grounded Reasoning (SEER), a novel framework for free-text promptable 3D medical image segmentation that explicitly bridges linguistic variability and anatomical precision through a reasoning-driven design. First, we curate the SEER-Trace dataset, which pairs raw clinical requests with image-grounded, skill-tagged reasoning traces, establishing a reproducible benchmark. Second, SEER constructs an evidence-aligned target representation via a vision-language reasoning chain that verifies clinical intent against image-derived anatomical evidence, thereby enforcing semantic consistency before voxel-level decoding. Third, we introduce SEER-Loop, a dynamic skill-evolving strategy that distills high-reward reasoning trajectories into reusable skill artifacts and progressively integrates them into subsequent inference, enabling structured self-refinement and improved robustness to diverse linguistic expressions. Extensive experiments demonstrate superior performance of SEER over state-of-the-art baselines. Under linguistic perturbations, SEER reduces performance variance by 81.94% and improves worst-case Dice by 18.60%. Project page: this https URL.


[170] 2603.19499

Geometric Performance Analysis of Doppler-Based Positioning with a Single LEO Satellite

Low Earth Orbit (LEO) satellites have gained increasing attention as potential signal sources for Positioning, Navigation and Timing (PNT) applications. However, while most existing studies focus on multi-satellite LEO constellations, the fundamental positioning performance achievable with a single LEO satellite remains less extensively explored. This paper analyzes the geometric characteristics and positioning performance of single-satellite Doppler positioning through a theoretical analysis of the Dilution of Precision (DOP) and extensive numerical simulations. The results reveal a strong directional error behavior, with severe error in the cross-track direction but a significantly less error along the satellite track, reflecting an intrinsic geometric limitation of single-satellite LEO positioning. While these features were already identified at the early stages of satellite PNT missions, the present work provides an in-depth analysis and unveils the fundamental limitations and characteristics that could make LEO-based Doppler positioning feasible nowadays, using one single satellite only. In this way, the results of this work not only provide valuable insights into the role of observational geometry in Doppler navigation, but also offer guidance for optimizing geometric configurations in future small or single-satellite LEO constellations for strategic applications.


[171] 2603.27998

HRIR-Former: Grid-Free Time-Domain Reconstruction of Head-Related Impulse Responses with a Spatially Encoded Transformer

Individualized head-related impulse responses (HRIRs) enable binaural rendering, but dense per-listener measurements are costly. We address HRIR spatial up-sampling from sparse per-listener measurements: given a few measured HRIRs for a listener, predict HRIRs at unmeasured target directions. Prior learning methods often work in the frequency domain, rely on minimum-phase assumptions or separate timing models, and use a fixed direction grid, which can degrade temporal fidelity and spatial continuity. We propose HRIR-Former, a time-domain, grid-free binaural Transformer for reconstructing HRIRs at arbitrary directions from sparse inputs. It uses sinusoidal spatial features, a Conv1D refinement module, and auxiliary interaural time difference (ITD) and interaural level difference (ILD) heads. On SONICOM, it improves normalized mean squared error (NMSE), cosine distance, and ITD/ILD errors over prior methods; ablations validate modules and show minimum-phase preprocessing is unnecessary.


[172] 2604.17362

FARM: Foundational Aerial Radio Map for Intelligent Low-Altitude Networking

Precise aerial radio environment characterization is vital for low-altitude airspace planning. However, existing datasets and construction methods lack the high-resolution granularity required for complex aerial spaces, particularly failing to capture spatial variations across both horizontal and vertical dimensions. To address these gaps, this paper introduces FARM, a pioneering foundation model for unified aerial radio map (ARM) construction. FARM is supported by our newly curated, high-granularity full-domain ARM dataset, which features multi-band and multi-antenna configurations, effectively filling a critical void in comprehensive low-altitude radio data. Structurally, FARM leverages a masked autoencoder to extract deep latent representations of the aerial radio environment, which subsequently guide a diffusion-based decoder to synthesize high-fidelity signal distributions through only a few iterative refinement steps. Benefiting from this design, the architecture seamlessly accommodates both condition-based and condition-free ARM construction, providing robust support for diverse signal and environmental priors. Extensive experiments demonstrate that FARM significantly outperforms state-of-the-art benchmarks while exhibiting strong cross-scenario generalization. Crucially, we validate the transferability of FARM on a real-world dataset collected from field tests, proving its robust deployment capability. Ultimately, FARM serves as a foundational infrastructure for the low-altitude economy by enabling autonomous aerial logistics and intelligent urban networking.


[173] 2605.04749

Spatial-Magnifier: Spatial upsampling for multichannel speech enhancement

While the spatial directivity of multichannel speech enhancement algorithms improves with the number of microphones, fitting large capture arrays into real-world edge devices is typically limited by physical constraints. To overcome this limitation, we propose Spatial-Magnifier, a neural network designed to generate virtual microphone (VM) signals from a limited set of real microphone (RM) measurements. Moreover, we introduce the Spatial Audio Representation Learning (SARL) framework, which leverages estimated VM signals and features to condition a downstream speech enhancement system. Experimental results demonstrate that the proposed framework outperforms existing spatial upsampling baselines across various speech extraction systems, including end-to-end multichannel speech enhancement and neural beamforming. The proposed method nearly recovers the oracle performance achieved when all microphones are available.


[174] 2605.20830

Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

Recent advances in text-to-speech (TTS) models show impressive speech naturalness and quality, yet the role of large-scale open data in driving this progress remains underexplored. In this work, we introduce Raon-OpenTTS, an open TTS model that performs competitively with state-of-the-art closed-data TTS models, and Raon-OpenTTS-Pool, a large-scale open dataset for reproducible TTS training. Raon-OpenTTS-Pool consists of 615K hours of 240M speech segments aggregated from publicly available English speech corpora and web-sourced recordings. With a model-based filtering pipeline applied to Raon-OpenTTS-Pool, we derive Raon-OpenTTS-Core, a curated, high-quality subset of 510K hours and 194M speech segments. Using Raon-OpenTTS-Core, we train Raon-OpenTTS, a series of diffusion transformer (DiT)-based TTS models from 0.3B to 1B parameters. On multiple benchmarks, Raon-OpenTTS-1B shows comparable performance to state-of-the-art models such as Qwen3-TTS and CosyVoice 3, which are trained on several million hours of proprietary speech data. Notably, on Seed-TTS-Eval, Raon-OpenTTS-1B achieves a word error rate (WER) of 1.78% and a speaker similarity (SIM) of 0.749, ranking second on WER and first on SIM among recent open-weight TTS baselines. On CV3-Hard-EN, Raon-OpenTTS-1B achieves a WER of 6.15% and a SIM of 0.775, ranking first on both metrics. Furthermore, to support robust evaluation, we introduce Raon-OpenTTS-Eval, a structured benchmark for assessing TTS robustness across diverse acoustic conditions including clean, noisy, in-the-wild, and expressive speech. On Raon-OpenTTS-Eval, Raon-OpenTTS-1B achieves the best average WER and SIM among all evaluated models, and the second-best human preference, as measured by comparative mean opinion score (CMOS). Our data pool, filtering pipeline, training code, and checkpoints are publicly available at this https URL.


[175] 2606.08898

Few-shot Class-variable Incremental Audio Classification via Prototype Adaptation and Pseudo Class-variable Training

In the task of few-shot class-incremental audio classification, the number of classes is assumed to always increase without considering the possibility of decrease. However, the number of classes generally increases or decreases in practice. In this paper, we investigate a problem of Few-shot Class-variable Incremental Audio Classification (FCIAC), in which the number of classes increases or decreases. We propose a FCIAC method using prototype adaptation and pseudo class-variable training. The model in our method consists of an encoder and a classifier. The classifier is initialized by a class-variable prototype adaptation network, whose structure dynamically changes with the change of classes. In addition, we design a pseudo class-variable training strategy to enhance the model's adaptability to changing classes. Experiments on three public datasets show that our method exceeds previous methods in average accuracy. The code is at: this https URL.


[176] 2606.09095

Joint Antenna Placement and Power Allocation for RSMA-Enabled Pinching Antenna Systems

This letter investigates a rate-splitting multiple access (RSMA)-enabled multi-user pinching antenna system (PASS). A fairness-aware sum-rate maximization problem is formulated to jointly optimize pinching antenna locations and common/private stream power allocation. The resulting mixed discrete-continuous non-convex problem is addressed using an alternating optimization framework that combines greedy antenna placement with successive convex approximation (SCA)-based power allocation. Numerical results demonstrate that the proposed RSMA-enabled PASS significantly improves achievable sum-rate, user fairness, and bit error rate (BER) performance compared with conventional non-RSMA PASS schemes.


[177] 2606.09098

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

Video dubbing is a cornerstone of multimedia content creation, aiming to synthesize synchronized acoustic sequences for visual streams. While Text-to-Speech (TTS) and Text-to-Audio (TTA) generation have each achieved remarkable progress, existing dubbing systems remain confined to isolated speech synthesis without incorporating sound effects and ambient audio, forcing practitioners to rely on fragmented workflows and laborious manual post-mixing. To address this limitation, we present HoliDubber, a holistic video dubbing framework that moves beyond speech-only generation by enabling the joint synthesis of speech and sound effects from a single text prompt. Specifically, HoliDubber adopts a patch-based autoregressive diffusion transformer architecture, where a causal language model autoregressively models aggregated patch embeddings to capture global temporal structure, and a Diffusion Transformer decoder generates high-fidelity continuous tokens within each patch, following a divide-and-conquer strategy. To achieve cross-modal alignment, visual features are encoded into patch-level representations and fused with audio patches via cross-attention, enabling the model to ground speech generation in the speaker's visual articulation dynamics. In addition, we introduce HoliDub-Bench, a benchmark curated from established datasets with synchronized video-text-audio triplets designed for holistic dubbing evaluation. Extensive experiments demonstrate that HoliDubber significantly outperforms existing methods across multiple benchmarks in speech quality, synchronization, and speaker similarity. Furthermore, results on HoliDub-Bench validate the effectiveness of joint speech-and-sound generation, establishing a new paradigm for holistic video dubbing in complex acoustic scenes. \footnote{The demo page of the project is this https URL}


[178] 2606.12336

Analysis of a Distributed Optimization-Based Control Architecture for Inverter-Interfaced Virtual Power Plants

We develop a large-signal stability analysis for a sampled-data, optimization-based secondary controller for inverter-interfaced distributed energy resources in virtual power plants.


[179] 2606.12707

Storage and Transport Capacity Design for a Self-Reliable Two-Node Stochastic Resource System

We study a two-node stochastic resource system operating over a finite horizon. Each node experiences uncertain supply and demand and is equipped with finite storage. The objective is to ensure that resource levels remain within prescribed limits with high probability. To this end, we formulate a chance-constrained capacity-design problem in which resources can be exchanged through a capacity-limited transport link. We characterize the minimum storage required at each node, derive the optimal transport policy, and quantify the trade-off between storage and transport capacities. Our results show the existence of a critical transport-capacity threshold that enables full risk pooling between the nodes. Moreover, this threshold decreases with the operating horizon, implying that full-pooling performance can be achieved with progressively smaller transport capacity over longer horizons.


[180] 2606.12791

The GIST 2217-Bus Test System: A Public-Data Synthetic Model of the Korean Power Grid

No model of the Korean transmission system at native resolution is publicly available, which makes reproducible research on one of the world's most distinctive grids difficult-an islanded interconnection with extreme separation between generation and the Seoul Metropolitan Area load center, low renewable penetration, and heavy reliance on extra-high-voltage (EHV) transmission. Working strictly from public data, and for research purposes only, we present the GIST 2217-bus test system, a geographically grounded synthetic model of the Korean grid. Unlike fully synthetic cases, whose lines match no real corridor, and aggregated public Korean models, it derives its 345 and 154 kV layout from the OpenStreetMap/OpenInfraMap power layer by a multi-source shortest-path reassembly of overhead-line geometry, gap-fills unreachable substations with a geographic minimum-spanning-tree backbone, and calibrates the aggregate circuit length to published national statistics (94/100/109% at 765/345/154 kV). The model spans 2217 buses, 512 generation and renewable sources (144 GW), 3708 AC line circuits plus four high-voltage direct-current (HVDC) converter links, 3324 transformers, and reactive resources (shunts and 11 FACTS devices), serialized to a PSS/E-compatible CSV schema. The model is distributed as a frozen operating point-taps, setpoints, and bus voltages settled once offline and baked into the data-so a single deterministic pandapower Newton-Raphson pass (with reactive limit enforcement and HVDC converter settling) reproduces an 85 GW high demand snapshot at a single connected operating point (mean transmission voltage 0.995 pu, 2.6 % losses), structurally consistent with the independent public KPG193 model. The dataset, maps, and tooling are released as a citable platform for power flow, planning, and decarbonization studies.


[181] 2606.13919

GMN4AD: Graph Matching Network for Alzheimer's Disease Diagnosis with Test-Time Domain Adaptation using Multi-centered Structure Magnetic Resonance Imaging

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that affects millions of older adults, with prevalence expected to rise significantly in the coming years. Early diagnosis, particularly during the mild cognitive impairment (MCI) stage, is critical for timely intervention. Structural Magnetic Resonance Imaging (sMRI) has emerged as a key modality for detecting AD-related brain changes, but traditional graph-based approaches often struggle with modality and inter-site heterogeneity, limiting diagnostic performance. In this paper, we propose Graph Matching Network for Alzheimer's Disease Diagnosis (GMN4AD), designed to model interactions between heterogeneous brain graphs derived from neuroimaging data. Unlike conventional methods that treat each brain graph independently, GMN4AD leverages graph matching to capture cross-graph relationships, enhancing diagnostic precision. Furthermore, we introduce a test-time domain adaptation strategy that combines contrastive learning to mitigate domain shifts during inference. Extensive experiments on three public AD datasets demonstrate that GMN4AD achieves superior performance compared to state-of-the-art methods, offering a robust and generalizable solution for AD diagnosis.


[182] 2606.14114

Digital Twin-Based Channel Generation Toolchain and Foundation Model for Low-Altitude XL-MIMO

The rapid development of the low-altitude economy (LAE) has created growing demand for reliable aerial communication systems. Extremely large-scale multiple-input multiple-output (XL-MIMO) is a promising enabler for such systems due to its high spatial resolution and robust connectivity. However, three-dimensional (3D) mobility together with near-field propagation makes it difficult to obtain dedicated high-fidelity wireless datasets, hindering systematic algorithm development and evaluation. To address this issue, we develop LAETwin-XL, a digital twin (DT)-based toolchain and dataset for XL-MIMO research in LAE scenarios. Built on the Sionna ray-tracing (RT) module, the proposed toolchain simulates near-field and far-field channels with diverse wireless labels for practical environments. Building on this dataset, we further develop a conditional denoising diffusion implicit model (CDDIM)-based generative foundation model that is pretrained to learn transferable XL-MIMO channel representations from incomplete channel observations. Unlike conventional task-specific or foundation models that rely on relatively complete channel inputs, the proposed model can generatively infer informative channel representations from partially observed channels. Experimental results demonstrate that the proposed framework achieves effective zero-shot channel extrapolation performance. Furthermore, using lightweight task heads and limited training data, it enables parameter-efficient transfer to various downstream tasks (e.g., channel estimation, classification, and localization), delivering high accuracy and robustness even under sparse antenna observations. The codes and dataset are available at this https URL.


[183] 2310.05507

MEDUSA: Scalable Biometric Sensing in the Wild through Distributed MIMO Radars

Radar-based techniques for detecting vital signs have shown promise for continuous contactless vital sign sensing and healthcare applications. However, real-world indoor environments face significant challenges for existing vital sign monitoring systems. These include signal blockage in non-line-of-sight (NLOS) situations, movement of human subjects, and alterations in location and orientation. Additionally, these existing systems failed to address the challenge of tracking multiple targets simultaneously. To overcome these challenges, we present MEDUSA, a novel coherent ultra-wideband (UWB) based distributed multiple-input multiple-output (MIMO) radar system, especially it allows users to customize and disperse the $16 \times 16$ into sub-arrays. MEDUSA takes advantage of the diversity benefits of distributed yet wirelessly synchronized MIMO arrays to enable robust vital sign monitoring in real-world and daily living environments where human targets are moving and surrounded by obstacles. We've developed a scalable, self-supervised contrastive learning model which integrates seamlessly with our hardware platform. Each attention weight within the model corresponds to a specific antenna pair of Tx and Rx. The model proficiently recovers accurate vital sign waveforms by decomposing and correlating the mixed received signals, including comprising human motion, mobility, noise, and vital signs. Through extensive evaluations involving 21 participants and over 200 hours of collected data (3.75 TB in total, with 1.89 TB for static subjects and 1.86 TB for moving subjects), MEDUSA's performance has been validated, showing an average gain of 20% compared to existing systems employing COTS radar sensors. This demonstrates MEDUSA's spatial diversity gain for real-world vital sign monitoring, encompassing target and environmental dynamics in familiar and unfamiliar indoor environments.


[184] 2405.07636

Nonlinear Network Identifiability with Full Excitations

We derive conditions for the identifiability of nonlinear networks characterized by additive dynamics at the level of the edges when all the nodes are excited. In contrast to linear systems, we show that the measurement of all sinks is necessary and sufficient for the identifiability of directed acyclic graphs, under the assumption that dynamics are described by twice continuously differentiable functions without constant terms (i.e., $f(0)=0$). But if constant terms are present, then the identifiability is impossible as soon as one node has more than one in-neighbor. In the case of general digraphs that may contain cycles, we consider additively separable functions for the analysis of the identifiability, and we show that the measurement of one node of all the sinks of the condensation digraph is necessary and sufficient. Several examples are added to illustrate the results.


[185] 2410.16089

Multi-Sensor Fusion for UAV Classification Based on Feature Maps of Image and Radar Data

The unique cost, flexibility, speed, and efficiency of modern UAVs make them an attractive choice in many applications in contemporary society. This, however, causes an ever-increasing number of reported malicious or accidental incidents, rendering the need for the development of UAV detection and classification mechanisms essential. We propose a methodology for developing a system that fuses already processed multi-sensor data into a new Deep Neural Network to increase its classification accuracy towards UAV detection. The DNN model fuses high-level features extracted from individual object detection and classification models associated with thermal, optronic, and radar data. Additionally, emphasis is given to the model's Convolutional Neural Network (CNN) based architecture that combines the features of the three sensor modalities by stacking the extracted image features of the thermal and optronic sensor achieving higher classification accuracy than each sensor alone.


[186] 2412.00107

Virtual Sensing to Enable Real-Time Monitoring of Inaccessible Locations & Unmeasurable Parameters

Real-time monitoring of safety-critical interior states remains an open problem in energy systems where physical instrumentation is infeasible. Existing approaches rely on explicit governing equations, finite-dimensional state vectors, or per-instance retraining, which prevents mesh-independent, field-level inference at arbitrary interior coordinates under real-time constraints. We introduce operator-based virtual sensing for nuclear-grade thermal-fluid systems: we use the neural-operator framework to learn solution operators that map sparse boundary measurements to coupled internal fields in physically inaccessible regions, framing the problem class explicitly to distinguish it from classical state estimation and pointwise soft sensing. We instantiate this framework with MIMONet, a branch-trunk operator extended with three practical choices: multi-modal branch encoders for heterogeneous (scalar and function-valued) inputs; multiplicative branch fusion to preserve the bilinear PDE coupling structure; and shared-latent multi-field decoding with per-channel basis projections at the trunk's final layer. Evaluated across escalating complexity, from canonical lid-driven cavity flow to pressurized water reactor subchannels to fully coupled heat exchangers, MIMONet achieves below 5% relative errors and sub-millisecond inference on data-center accelerators (0.35 ms / 46 mJ per heat-exchanger inference on an NVIDIA H200, and sub-millisecond across the A40-H200-GH200 range), while remaining stable under 50% sensor noise. By staying accurate as geometric confinement and physics coupling intensify, MIMONet shows that operator-based virtual sensing can restore observability where physical instrumentation fails, establishing simulation-based feasibility within the evaluated operating envelopes as a step toward future experimental and cross-solver validation for safety-critical energy systems.


[187] 2501.01908

Training-Free Adversarial Robustness in Computational MRI

Deep learning (DL) methods have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining. In this work, we propose a novel approach for mitigating adversarial attacks on MRI reconstruction models without any retraining. Based on the idea of cyclic measurement consistency, we devise a novel mitigation objective that is minimized in a small ball around the attack input. Results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods. We also introduce a practically relevant scenario for small adversarial perturbations that models impulse noise in raw data, which relates to herringbone artifacts, and show the applicability of our approach in this setting. Finally, we show our mitigation approach remains effective in two realistic extension scenarios: a blind setup, where the attack strength or algorithm is not known to the user; and an adaptive attack setup, where the attacker has full knowledge of the defense strategy.


[188] 2501.04988

Intelligent Sailing Model for Open Sea Navigation

Autonomous vessels potentially enhance safety and reliability of seaborne trade. To facilitate the development of autonomous vessels, simulations are required to model realistic interactions with other vessels. However, modeling realistic interactive maritime traffic is challenging due to the unstructured environment, coarsely specified traffic rules, and largely varying vessel types. Currently, there is no standard for simulating interactive maritime environments in order to rigorously benchmark autonomous vessel algorithms. In this paper, we introduce the first intelligent sailing model (ISM), which simulates rule-compliant vessels for navigation on the open sea. An ISM vessel reacts to other traffic participants according to maritime traffic rules while at the same time solving a motion planning task characterized by waypoints. In particular, the ISM monitors the applicable rules, generates rule-compliant waypoints accordingly, and utilizes a model predictive control for tracking the waypoints. We evaluate the ISM in two environments: interactive traffic with only ISM vessels and mixed traffic where some vessel trajectories are from recorded real-world maritime traffic data or handcrafted for criticality. Our results show that simulations with many ISM vessels of different vessel types are rule-compliant and scalable. We tested 4,049 critical traffic scenarios. For interactive traffic with ISM vessels, no collisions occurred while goal-reaching rates of about 97 percent were achieved.


[189] 2501.17615

Cross-lingual Embedding Clustering for Hierarchical Softmax in Low-Resource Multilingual Speech Recognition

We present a novel approach centered on the decoding stage of Automatic Speech Recognition (ASR) that enhances multilingual performance, especially for low-resource languages. It utilizes a cross-lingual embedding clustering method to construct a hierarchical Softmax (H-Softmax) decoder, which enables similar tokens across different languages to share similar decoder representations. It addresses the limitations of the previous Huffman-based H-Softmax method, which relied on shallow features in token similarity assessments. Through experiments on a downsampled dataset of 15 languages, we demonstrate the effectiveness of our approach in improving low-resource multilingual ASR accuracy.


[190] 2505.04397

PURe: A Plug-and-Play Product-Unit Residual Module for Vision Networks

Modern vision networks are dominated by additive local transformations, whereas explicit multiplicative local interactions remain underexplored. Product units offer a direct approach to modeling such interactions, but their use in deep architectures has been limited by optimization instability. In this work, we propose PURe, a Product-Unit Residual Module for deep vision networks. PURe is built around a 2D Product Unit with a real-valued log-domain formulation that makes multiplicative local aggregation practical within deep residual hierarchies. The resulting module serves as a drop-in replacement for native residual units. We instantiate PURe in residual CNNs for image classification and in 2D residual encoder-decoder networks for slice-based segmentation on volumetric CT data. Across Galaxy10 DECaLS, ImageNet, and CIFAR-10, PURe consistently improves residual CNNs and yields a more favorable accuracy-parameter trade-off, allowing moderately deep models to match or surpass substantially deeper ResNet baselines with much smaller parameter budgets. On the AMOS benchmark, PURe also improves slice-based CT segmentation under 3D case-level evaluation. These results show that explicit multiplicative local interaction is a practical and effective design primitive for deep residual vision networks.


[191] 2506.16738

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization

With the rapid progress of speech language models (SLMs), discrete speech tokens have emerged as a core interface between speech and text, enabling unified modeling across modalities. Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models (LMs). In particular, previous methods use self-supervised learning (SSL) teachers such as HuBERT to extract semantic representations, which are then distilled into a semantic quantizer to suppress acoustic redundancy as well as capture content-related latent structures. However, these tokenizers often operate at relatively high frame rates, producing token sequences significantly longer than their textual counterparts and hindering seamless integration with pretrained LMs. Although recent methods attempt to reduce the token rate by applying uniform average pooling to SSL features, this can over-smooth content-bearing regions and dilute the structural information, thereby potentially limiting the LM alignment. To address this, we propose LM-SPT, an LM-aligned speech tokenization method based on semantic speech-resynthesis distillation. Instead of directly matching teacher and student features via pooling, LM-SPT resynthesizes speech from semantic tokens only and minimizes the discrepancy between representations extracted from the original and resynthesized waveforms using a frozen, LM-aligned speech encoder. This indirect supervision avoids rigid temporal alignment and encourages dedicated semantic units that are more semantically aligned with LMs under reduced frame rates. Experimental results show that the proposed LM-SPT consistently outperforms previous semantic-enhanced speech tokenizers when applied to SLMs for the tasks of automatic speech recognition and text-to-speech, even without compromising the speech reconstruction fidelity at the codec level.


[192] 2506.21613

ChildGuard: A Specialized Dataset for Combatting Child-Targeted Hate Speech

Mental health industry faces growing concerns regarding hate speech directed at children's on social media, as exposure to such content can contribute to adverse psychological outcomes during critical stages of development. Current hate speech datasets and detection systems provide limited support for child-focused applications because they are primarily designed for adults and lack dedicated representations of age-specific characteristics associated with hate speech directed at children's. To address this gap, we introduce ChildGuard, a large-scale English dataset for child-targeted hate speech containing 351,877 annotated instances collected from X (formerly Twitter), Reddit, and YouTube. The dataset covers three age groups such as younger children's (under 11), pre-teens (11-12), and teens (13-17). ChildGuard contains two subsets such as a contextual subset (157K) and a lexical subset (194K). Evaluation using recent transformer-based models and LLMs achieves a best Macro-F1 of 82.07%, decreasing to 79.41%, 79.24%, 76.04%, and 74.88% on younger children's, contextual, implicit hate, and cross-subset settings, respectively.


[193] 2507.01113

Stannic: Systolic STochAstic ONliNe SchedulIng AcCelerator

Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workloads due to scheduling overhead, lack of adaptability to stochastic workloads, and suboptimal resource utilization. The scheduling problem further compounds in the context of shared HPC clusters, where job arrivals and processing times are inherently stochastic. Prediction of these elements is possible, but it introduces additional overhead. To perform this complex scheduling, we developed two FPGA-assisted hardware accelerator microarchitectures, Hercules and Stannic. Hercules adopts a task-centric abstraction of stochastic scheduling, whereas Stannic inherits a schedule-centric abstraction. These hardware-assisted solutions leverage parallelism, pre-calculation, and spatial memory access to significantly accelerate scheduling. We accelerate a non-preemptive stochastic online scheduling algorithm to produce heterogeneity-aware schedules in near real time. With Hercules, we achieved a speedup of up to 1060x over a baseline C/C++ implementation, demonstrating the efficacy of a hardware-assisted acceleration for heterogeneity-aware stochastic scheduling. With Stannic, we further improved efficiency, achieving a 7.5x reduction in latency per computation iteration and a 14x increase in the target heterogeneous system size. Experimental results show that the resulting schedules demonstrate efficient machine utilization and low average job latency in stochastic contexts.


[194] 2507.07879

LISTEN: Lightweight Industrial Sound-representable Transformer for Edge Notification

Deep learning-based machine listening is broadening the scope of industrial acoustic analysis, yet its widespread implementation on live shop floors is hindered by the reliance on large, task-specific annotated datasets for every new task. While emerging general-purpose sound foundation models aim to alleviate data dependency, they reveal critical dilemmas in practice. General-purpose sound foundation models are computationally expensive and fail in industrial scenarios characterized by tonal harmonics, broadband noise, and transient fault events, making instant, on-site deployment impractical. These challenges combined mean that a practical, end-to-end system for deploying a sound foundation model on a live shop floor has remained elusive. To address this challenge, this study introduces LISTEN (Lightweight Industrial Sound-representable Transformer for Edge Notification), the first lightweight foundation model specialized for industrial sound. Through Knowledge Distillation (KD) from the large-scale teacher model IMPACT (Industrial Machine Perception via Acoustic Cognitive Transformer), we construct LISTEN optimized for resource-constrained edge environments. By freezing the backbone and training only a shallow head on minimal target-process data, rather than performing full fine-tuning or retraining, LISTEN achieves nearly identical performance to IMPACT across diverse manufacturing processes. This study further demonstrates a complete system for real-time machine monitoring, encompassing data acquisition with Industrial Internet of Things (IIoT) devices, rapid model adaptation using minimal annotated data, and real-time monitoring on a low-cost edge device. By validating the entire system on a live CNC machine, this work establishes the first feasible end-to-end system for deploying a lightweight industrial sound foundation model in an active industrial environment.


[195] 2509.16975

Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs

Automatic mean opinion score (MOS) prediction serves as a principled alternative to both subjective listening tests and objective metrics, providing scalable and consistent audio evaluation. Inspired by the LLM-as-Judge paradigm, recent multimodal large language models offer strong perceptual modeling and reasoning capabilities, enabling audio quality assessment. In this work, we address the challenging problem of audio editing evaluation and propose the first natural language-based automated evaluation framework built upon Qwen2-Audio. Two caption-based fine-tuning tasks are introduced to enhance multi-audio understanding, together with a designed Chain-of-Thought prompting strategy to encourage structured, step-by-step reasoning. Experiments show that our framework produces interpretable and logically consistent text-based evaluations, aligning closely with human judgments while outperforming existing baselines. The code and demo are available at this https URL.


[196] 2510.01175

On the Benefits of Weight Normalization for Overparameterized Matrix Sensing

While normalization techniques are widely used in deep learning, their theoretical understanding remains relatively limited. In this work, we establish the benefits of (generalized) weight normalization (WN) applied to the overparameterized matrix sensing problem. We prove that WN with Riemannian optimization achieves linear convergence, yielding an exponential speedup over standard methods that do not use WN. Our analysis further demonstrates that both iteration and sample complexity improve polynomially as the level of overparameterization increases. To the best of our knowledge, this work provides the first characterization of how WN leverages overparameterization for faster convergence in matrix sensing.


[197] 2510.07096

Modeling Sarcastic Speech: Semantic and Prosodic Cues in a Speech Synthesis Framework

Sarcasm is a pragmatic phenomenon in which speakers convey meanings that diverge from literal content, relying on an interaction between semantics and prosodic expression. However, how these cues jointly contribute to the recognition of sarcasm remains poorly understood. We propose a computational framework that models sarcasm as the integration of semantic interpretation and prosodic realization. Semantic cues are derived from an LLaMA 3 model fine-tuned to capture discourse-level markers of sarcastic intent, while prosodic cues are extracted through semantically aligned utterances drawn from a database of sarcastic speech, providing prosodic exemplars of sarcastic delivery. Using a speech synthesis testbed, perceptual evaluations show that semantic and prosodic cues enhance perceived sarcasm, with the combined system achieving the best downstream F1 while maintaining high subjective sarcasm ratings. These findings highlight the complementary roles of semantics and prosody in pragmatic interpretation and illustrate how modeling can shed light on the mechanisms underlying sarcastic communication.


[198] 2511.14071

Deep-Learning Based Super-Resolution Functional Ultrasound Imaging of Transient Brain-Wide Neurovascular Activity on a Microscopic Scale

Transient brain-wide neuroimaging on a microscopic scale is pivotal for brain research, yet existing imaging modalities face challenges in meeting such spatiotemporal requirements. Functional ultrasound (fUS) enables transient neurovascular imaging through red blood cell backscattering, but suffers from diffraction-limited spatial resolution. Functional ultrasound localization microscopy (fULM) has addressed this limitation by integrating ULM with fUS; but this approach requires repeated stimulation and data accumulation. Here, we introduce super-resolution functional ultrasound (SR-fUS), a deep learning-based framework that reconstructs super-resolution ULM images from contrast-free ultrafast Doppler data. By incorporating red blood cell radial gradient fluctuation priors with uncertainty-driven loss, SR-fUS enables microscopic scale hemodynamic imaging with 25-{\mu}m spatial spatial resolution. In rat brains, SR-fUS visualized transient pain-evoked hemodynamic responses, distinguished stimulus-specific microvascular activation patterns during single-whisker stimulation, and dynamically tracked isoflurane anesthesia-induced microvascular dilation. The accuracy of SR-fUS was further preliminarily assessed through a comparative study with two-photon microscopy.


[199] 2601.13565

Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. To systematically eliminate background interference, FiCoP first employs an object-centric disentanglement step to isolate the target from macro-level environmental noise. Building upon this localized region, our core methodological innovations are twofold. Firstly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning and text-guided semantic injection. Secondly, we design a Patch Correlation Predictor (PCP) that leverages a patch-to-patch correlation matrix as a structural prior. This generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at this https URL.


[200] 2602.13344

FireRed-Image-Edit-1.0 Technical Report

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. To support future research, our code, models, and benchmark suite are publicly available at this https URL .


[201] 2602.14780

ROSA: Roundabout Optimized Speed Advisory with Multi-Agent Trajectory Prediction in Multimodal Traffic

We present ROSA -- Roundabout Optimized Speed Advisory -- a system that combines multi-agent trajectory prediction with coordinated speed guidance for multimodal, mixed traffic at roundabouts. Using a Transformer-based model, ROSA jointly predicts the future trajectories of vehicles and Vulnerable Road Users (VRUs) at roundabouts. Trained for single-step prediction and deployed autoregressively, it generates deterministic outputs, enabling actionable speed advisories. Incorporating motion dynamics, the model achieves high accuracy (ADE: 1.29m, FDE: 2.99m at a five-second prediction horizon), surpassing prior work. Adding route intention further improves performance (ADE: 1.10m, FDE: 2.36m), demonstrating the value of connected vehicle data. Based on predicted conflicts with VRUs and circulating vehicles, ROSA provides real-time, proactive speed advisories for approaching and entering the roundabout. Despite prediction uncertainty, ROSA significantly improves vehicle efficiency and safety, with positive effects even on perceived safety from a VRU perspective. The source code of this work is available under: this http URL.


[202] 2602.24012

InfoNCE Induces Gaussian Distribution

Contrastive learning has become a cornerstone of modern representation learning, allowing training with massive unlabeled data for both task-specific and general (foundation) models. A prototypical loss in contrastive training is InfoNCE and its variants. In this work, we show that the InfoNCE objective induces Gaussian structure in representations that emerge from contrastive training. We establish this result in two complementary regimes. First, we show that under certain alignment and concentration assumptions, projections of the high-dimensional representation asymptotically approach a multivariate Gaussian distribution. Next, under less strict assumptions, we show that adding a small asymptotically vanishing regularization term that promotes low feature norm and high feature entropy leads to similar asymptotic results. We support our analysis with experiments on synthetic and CIFAR-10 datasets across multiple encoder architectures and sizes, demonstrating consistent Gaussian behavior. This perspective provides a principled explanation for commonly observed Gaussianity in contrastive representations. The resulting Gaussian model enables principled analytical treatment of learned representations and is expected to support a wide range of applications in contrastive learning.


[203] 2603.01016

Implementation of Licensed Plate Detection and Noise Removal in Image Processing

Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.


[204] 2603.05373

MSpoofTTS: Multi-Resolution Spoof-Guided Inference for Discrete Speech Synthesis

Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and high-quality codec-based speech generation. Audio samples and code are available.


[205] 2603.07676

A Primer on Evolutionary Optimization Frameworks for Near-Field Multi-Source Localization

This paper introduces evolutionary optimization as a grid-free training-free continuous-domain search mechanism for near-field multi-source localization, addressing the major limitations of grid-based subspace methods such as MUSIC and data-driven deep learning approaches. To this end, we develop two complementary evolutionary localization frameworks that operate directly on the continuous spherical-wave signal model and support arbitrary array geometries without requiring labeled data, discretized angle-range grids, or architectural constraints. The first framework, termed NEar-field MultimOdal DE (NEMO-DE) associates each individual in the evolutionary population to a single source and optimizes a residual least-squares objective in a sequential manner, updating the data residual and enforcing spatial separation to estimate multiple source locations. To overcome the limitation of NEMO-DE under large power imbalances among the sources, we propose the second framework, named NEar-field Eigen-subspace Fitting DE (NEEF-DE), which jointly encodes all source locations and minimizes a subspace-fitting criterion that aligns a model-based array response subspace with the received signal subspace. The proposed formulations are not intrinsically tied to a specific optimizer; however, this work adopts differential evolution (DE) as a representative evolutionary search strategy because of its simple implementation, small number of control parameters, and strong empirical performance in continuous nonconvex optimization problems. Numerical results show that the proposed frameworks provide competitive accuracy compared with MUSIC-type baselines while avoiding pre-defined grid construction and labeled training data. This work establishes evolutionary computation as a powerful and flexible paradigm for model-based near-field localization, paving the way for future innovations in this domain.


[206] 2603.10562

Quantization Robustness of Monotone Operator Equilibrium Networks

Monotone operator equilibrium networks are implicit-layer models whose output is the unique equilibrium of a monotone operator, guaranteeing existence, uniqueness, and convergence. When deployed on low-precision hardware, weights are quantized, potentially destroying these guarantees. We analyze weight quantization as a spectral perturbation of the underlying monotone inclusion. Convergence of the quantized solver is guaranteed whenever the spectral-norm weight perturbation is smaller than the monotonicity margin; the displacement between quantized and full-precision equilibria is bounded in terms of the perturbation size and margin; and a condition number characterizing the ratio of the operator norm to the margin links quantization precision to forward error. MNIST experiments confirm a phase transition at the predicted threshold: three- and four-bit post-training quantization diverge, while five-bit and above converge. The backward-pass guarantee enables quantization-aware training, which recovers provable convergence at four bits.


[207] 2603.15606

Saddle Point Evasion via Curvature-Regularized Gradient Dynamics

Nonconvex optimization underlies many modern machine learning and control tasks, where saddle points pose the dominant obstacle to reliable convergence in high-dimensional settings. Escaping these saddle points deterministically using continuous-time optimization remains an open challenge: gradient descent is blind to curvature, stochastic perturbation methods lack deterministic guarantees, and Newton-type approaches suffer from Hessian singularity. Adopting the perspective of viewing optimization algorithms as dynamical systems, we present Curvature-Regularized Gradient Dynamics (CRGD), which augments the objective with a smooth penalty on the negative Hessian eigenvalues, yielding an augmented cost that serves as an optimization Lyapunov function with user-selectable convergence rates to second-order stationary points. Numerical experiments confirm that CRGD converges to second-order stationary points, even in regimes where gradient descent fails.


[208] 2605.01101

Virtual Speech Therapist: A Clinician-in-the-Loop AI Speech Therapy Agent for Personalized and Supervised Therapy

This paper develops Virtual Speech Therapist (VST), an intelligent agent-based platform that streamlines stuttering assessment and delivers customized therapy planning through automated and adaptive AI-driven workflows. VST integrates state-of-the-art deep learning-based stuttering classification, and multi-agent large language model (LLM) reasoning to support evidence-based clinical decision-making. The VST begins with the acquisition and feature extraction of patient speech samples, followed by robust classification of stuttering types. Building on these outputs, VST initiates an agentic reasoning process in which specialized LLM agents autonomously generate, critique, and iteratively refine individualized therapy plans. A dedicated critic agent evaluates all generated therapy plans to ensure clinical safety, methodological soundness, and alignment with peer-reviewed evidence and established professional guidelines. The resulting output is a comprehensive, patient-specific therapy draft intended for clinician review. Incorporating clinician feedback, the system then produces a finalized therapy plan suitable for patient delivery, thereby maintaining a clinician-in-the-loop paradigm. Experimental evaluation by expert speech therapists confirms that VST consistently generates high-quality, evidence-based therapy recommendations. These findings demonstrate the system's potential to augment clinical workflows, reduce clinician burden, and improve therapeutic outcomes for individuals with speech impairments. An interactive user interface for the proposed system is available online at: this https URL , facilitating real-time stuttering assessment and personalized therapy planning.


[209] 2605.18909

Descriptive versus Regulatory Uncertainty in Bounded Predictive Systems

Any system that models the world under finite representational capacity must compress; any compression entails a prior; and the prior is the system's bias. What has not been established is whether uncertainty participates in the dynamics governing future behavior, or merely describes the output distribution without consequence. We introduce a structural distinction between descriptive uncertainty, which does not recursively modulate the system's policy, and regulatory uncertainty, which directly enters the optimization landscape and drives persistent adaptive restructuring. We prove formally that current transformer architectures are confined to descriptive uncertainty at inference. We ground this in thermodynamics via Landauer's principle: for uncertainty to be regulatory, epistemic error must cost real energy; in a decoupled system, hallucinations and correct derivations dissipate identical energy. We test this empirically across three locally-deployed language models (3B, 8B, 70B parameters). Token-level Shannon entropy is statistically invariant across tasks spanning pattern retrieval, causal operator application, and out-of-distribution causal generalization in all three models (all pairwise p >= 0.568; within-model ranges 0.011-0.028 nats), while task accuracy varies substantially across the same conditions (0%-100%). Entropy and accuracy are orthogonal. The decoupling is scale-invariant: larger models achieve higher accuracy but identical entropy flatness. This structural incapacity is not resolvable by additional parameters or training data. Genuine epistemic grounding requires physical coupling between thermodynamic substrate state and information processing cost.


[210] 2606.08583

A spectral audit framework reveals task-dependent aperiodic reliance across EEG and ECG deep learning

Deep learning on physiological time series is interpreted through domain-specific features -- oscillatory rhythms in EEG, morphological complexes in ECG -- yet these signals sit atop a broadband aperiodic 1/f-like envelope that covaries with arousal, age, and pathology. We introduce a spectral audit framework combining aperiodic/periodic decomposition, phase-preserving Fourier interventions, sham controls, and simulation validation. Aperiodic reliance was task-dependent and architecture-general: across six neural architectures, flattening drops exceeded 0.42 balanced-accuracy points for sleep-wake classification, reached 0.07-0.13 for clinical abnormality detection, and remained minimal for motor imagery. Six of seven EEG foundation models showed FDR-significant aperiodic reliance on clinical EEG; age/sex and recording-era controls reduced but did not eliminate the effect. Applying the audit to PTB-XL ECG revealed neural drops of 0.32--0.36 persisting after demographic matching, confirming this confound class extends beyond EEG. Aperiodic controls should become standard for interpretable physiological time-series deep learning.


[211] 2606.08594

How Much Capacity Does EEG Denoising Need? Ultra-Compact Networks reveal Benchmark Saturation and Metric-Utility Gap

Deep learning EEG denoising architectures have scaled from tens of thousands to tens of millions of parameters, yet no prior study has isolated model capacity as the experimental variable or tested whether reconstruction metrics predict downstream neural-signal utility. We address both gaps by fixing architecture, loss, data split, and training recipe while sweeping only channel width from 1.05K to 40.26K parameters in a minimal depthwise-separable convolutional U-Net. Models were evaluated on the EEGDenoiseNet benchmark, cross-dataset BCI transfer tests, controlled baseline retraining, and downstream motor-imagery classification with five decoder families across all nine BCI Competition IV-2a subjects. Reconstruction performance saturated by 3-6.5K parameters, with post-elbow gains of at most 0.015 correlation coefficient per log10-parameter unit. An 8.46M-parameter baseline retrained under the same pipeline matched the 40.26K compact variant on EOG--a 200x parameter gap yielding no advantage--while a Patch-Transformer control reproduced the same diminishing-return shape. Downstream evaluation exposed a classifier-dependent metric-utility gap: reconstruction-optimized denoising significantly degraded CSP+LDA classification across all nine subjects and three artifact types (best denoised accuracy 0.547 vs. 0.612 noisy baseline; Bonferroni p=0.0488), persisting on naturally recorded trials (Delta=-0.047; BH-FDR q=0.0049). End-to-end neural decoders showed variable or neutral effects. Standard EEG denoising benchmarks are saturated far below current model capacity, and reconstruction metrics do not predict BCI utility. Ultra-compact models at 33-46 KB and 1.27-2.61M FLOPs/segment are practical for edge deployment. These findings argue for capacity-controlled evaluation, harder task-aware benchmarks, and mandatory downstream validation.


[212] 2606.09717

What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

Prosody plays an important role in sarcasm perception, yet previous studies have relied on naturally produced speech that lacks fine-grained control over individual acoustic dimensions. As prosodic cues co-vary in natural data, isolating their independent contributions remains challenging. We introduce a controlled framework using neural text-to-speech (TTS) with prompt-based prosodic conditioning to manipulate speech rate, pitch variation, and loudness. An orthogonal stimulus set was constructed to enable causal testing of prosodic cue effects. Human listeners rated sarcasm and naturalness, and their judgments were compared with predictions from a foundation model capable of processing audio input. Results show that loudness primarily drives human sarcasm perception, whereas the model assigns greater weight to speech rate, leading to distinct cue-weighting patterns. This study shows how controllable neural TTS enables investigation of prosodic cue weighting in speech perception.


[213] 2606.11474

Mahalanobis-Guided Latent OOD Detection for Hybrid ES-DRL Control in Time-Varying Systems

In this paper, we study Mahalanobis-guided latent out-of-distribution (OOD) detection for test-time RL controller switching in nonlinear time-varying systems. RL controllers can quickly control high-dimensional systems within the training distribution, but their performance can degrade when time-varying dynamics produce unseen observations. We consider a combined ES--DRL controller, where RL provides fast in-distribution actions and bounded extremum seeking (ES) provides robust model-independent control under OOD operation. The key challenge is deciding when to switch. We train a variational autoencoder (VAE) on in-distribution beam-profile observations and use Mahalanobis distance in the VAE latent space to detect OOD beam profiles at test time. This OOD decision sets a binary switch that selects either the RL controller or the ES controller. We evaluate the approach in safety-critical particle accelerator control. In this setting, spatial magnet motion creates OOD beam profiles that were not seen during RL training. Visualization of the VAE latent space shows that the proposed method identifies this OOD scenario and provides an interpretable signal for switching between RL and ES in the combined controller.


[214] 2606.12978

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: this https URL


[215] 2606.14027

Same-Origin Policy for Agentic Browsers

Agentic browsers integrate autonomous AI agents into web browsers, enabling users to accomplish web tasks through natural-language instructions. The same-origin policy (SOP) is a fundamental browser security mechanism that prevents unauthorized automated cross-origin data flows induced by scripts. However, whether SOP remains effective in agentic browsers is an open question that has not been systematically studied. In this work, we bridge this gap. We first observe that an agentic browser can itself serve as an automated channel for cross-origin data flows, potentially leading to SOP violations. To investigate this phenomenon, we construct SOPBench, a benchmark for evaluating SOP violations in agentic browsers. Our evaluation shows that existing agentic browsers frequently violate SOP, both in benign settings and under attacks. To address this problem, we propose SOPGuard, an SOP enforcement mechanism tailored to agentic browsers. We implement SOPGuard in BrowserOS, an open-source agentic browser. Extensive evaluations demonstrate that SOPGuard effectively enforces SOP while preserving utility and incurring only a small runtime overhead. Our code and data are available at this https URL.