LLM-based automatic speech recognition (ASR), a well-established approach, connects speech foundation models to large language models (LLMs) through a speech-to-LLM projector, yielding promising results. A common design choice in these architectures is the use of a fixed, manually defined prompt during both training and inference. This setup not only enables applicability across a range of practical scenarios, but also helps maximize model performance. However, the impact of prompt design remains underexplored. This paper presents a comprehensive analysis of commonly used prompts across diverse datasets, showing that prompt choice significantly affects ASR performance and introduces instability, with no single prompt performing best across all cases. Inspired by the speech-to-LLM projector, we propose a prompt projector module, a simple, model-agnostic extension that learns to project prompt embeddings to more effective regions of the LLM input space, without modifying the underlying LLM-based ASR model. Experiments on four datasets show that the addition of a prompt projector consistently improves performance, reduces variability, and outperforms the best manually selected prompts.
Cardiac Magnetic Resonance (CMR) imaging provides a comprehensive assessment of cardiac structure and function but remains constrained by high acquisition costs and reliance on expert annotations, limiting the availability of large-scale labeled datasets. In contrast, electrocardiograms (ECGs) are inexpensive, widely accessible, and offer a promising modality for conditioning the generative synthesis of cine CMR. To this end, we propose ECGFlowCMR, a novel ECG-to-CMR generative framework that integrates a Phase-Aware Masked Autoencoder (PA-MAE) and an Anatomy-Motion Disentangled Flow (AMDF) to address two fundamental challenges: (1) the cross-modal temporal mismatch between multi-beat ECG recordings and single-cycle CMR sequences, and (2) the anatomical observability gap due to the limited structural information inherent in ECGs. Extensive experiments on the UK Biobank and a proprietary clinical dataset demonstrate that ECGFlowCMR can generate realistic cine CMR sequences from ECG inputs, enabling scalable pretraining and improving performance on downstream cardiac disease classification and phenotype prediction tasks.
High-quality Fourier Transform Infrared (FTIR) imaging usually needs extensive signal averaging to reduce noise and drift which severely limits clinical speed. Deep learning can accelerate imaging by reconstructing spectra from rapid, single-scan inputs. However, separating noise and baseline drift simultaneously without ground truth is an ill-posed inverse problem. Standard black-box architectures often rely on statistical approximations that introduce spectral hallucinations or fail to generalize to unstable atmospheric conditions. To solve these issues we propose a physics-informed cascade Unet that separates denoising and baseline correction tasks using a new, deterministic Physics Bridge. This architecture forces the network to separate random noise from chemical signals using an embedded SNIP layer to enforce spectroscopic constraints instead of learning statistical approximations. We benchmarked this approach against a standard single Unet and a traditional Savitzky-Golay/SNIP workflow. We used a dataset of human hypopharyngeal carcinoma cells (FaDu). The cascade model outperformed all other methods, achieving a 51.3% reduction in RMSE compared to raw single-scan inputs, surpassing both the single Unet (40.2%) and the traditional workflow (33.7%). Peak-aware metrics show that the cascade architecture eliminates spectral hallucinations found in standard deep learning. It also preserves peak intensity with much higher fidelity than traditional smoothing. These results show that the cascade Unet is a robust solution for diagnostic-grade FTIR imaging. It enables imaging speeds 32 times faster than current methods.
In this paper, we propose a method for designing sparse Grassmannian codes for noncoherent multiple-input multiple-output systems. Conventional pairwise error probability formulations under uncorrelated Rayleigh fading channels fail to account for rank deficiency induced by sparse configurations. We revise these formulations to handle such cases in a unified manner. Furthermore, we derive a closed-form metric that effectively maximizes the noncoherent average mutual information (AMI) at a given signal-to-noise ratio. We focus on the fact that the Schubert cell decomposition of the Grassmann manifold provides a mathematically sparse property, and establish design criteria for sparse noncoherent codes based on our analyses. In numerical results, the proposed sparse noncoherent codes outperform conventional methods in terms of both symbol error rate and AMI, and asymptotically approach the performance of the optimal Grassmannian constellations in the high-signal-to-noise ratio regime. Moreover, they reduce the time and space complexity, which does not scale with the number of transmit antennas.
Distributed energy storage devices can be pooled and coordinated by aggregators to participate in power system operations and market clearings. This requires representing a massive device population as a single, tractable surrogate that is computationally efficient, accurate, and compatible with market participation requirements. However, surrogate identification is challenging due to heterogeneity, nonconvexity, and high dimensionality of storage devices. To address these challenges, this paper develops a mean-field learning framework for storage aggregation. We interpret aggregation as the average behavior of a large storage population and show that, as the population grows, aggregate performance converges to a unique, convex mean-field limit, enabling tractable population-level modeling. This convexity further yields a price-responsive characterization of aggregate storage behavior and allows us to bound the mean-field approximation error. Leveraging these results, we construct a convex surrogate model that approximates the aggregate behavior of large storage populations and can be embedded directly into power system operations and market clearing. Surrogate parameter identification is formulated as an optimization problem using historical market price-response data, and we adopt a gradient-based algorithm for efficient learning procedure. Case studies validate the theoretical findings and demonstrate the effectiveness of the proposed framework in approximation accuracy, data efficiency, and profit outcomes.
Microtransit systems represent an enhancement to solve the first- and last-mile problem, integrating traditional rail and bus networks with on-demand shuttles into a flexible, integrated system. This type of demand responsive transport provides greater accessibility and higher quality levels of service compared to conventional fixed-route transit services. Advances in technology offer further opportunities to enhance microtransit performance. In particular, shared autonomous vehicles (SAVs) have the potential to transform the mobility landscape by enabling more sustainable operations, enhanced user convenience, and greater system reliability. This paper investigates the integration of SAVs in microtransit systems, advancing the technological capabilities of on-demand shuttles. A shuttle dispatching optimization model is enhanced to accommodate for driver behavior and SAV functionalities. A model predictive control approach is proposed that dynamically rebalances on-demand shuttles towards areas of higher demand without relying on vast historical data. Scenario-driven experiments are conducted using data from the MARTA Reach microtransit pilot. The results demonstrate that SAVs can elevate both service quality and user experience compared to traditional on-demand shuttles in microtransit systems.
Model compression has become an important tool for making image super resolution models more efficient. However, the gap between the best compressed models and the full precision model still remains large and a need for deeper understanding of compression theory on more performant models remains. Prior research on quantization of LLMs has shown that Hadamard transformations lead to weights and activations with reduced outliers, which leads to improved performance. We argue that while the Hadamard transform does reduce the effect of outliers, an empirical analysis on how the transform functions remains needed. By studying the distributions of weights and activations of SwinIR-light, we show with statistical analysis that lower errors is caused by the Hadamard transforms ability to reduce the ranges, and increase the proportion of values around $0$. Based on these findings, we introduce CompSRT, a more performant way to compress the image super resolution transformer network SwinIR-light. We perform Hadamard-based quantization, and we also perform scalar decomposition to introduce two additional trainable parameters. Our quantization performance statistically significantly surpasses the SOTA in metrics with gains as large as 1.53 dB, and visibly improves visual quality by reducing blurriness at all bitwidths. At $3$-$4$ bits, to show our method is compatible with pruning for increased compression, we also prune $40\%$ of weights and show that we can achieve $6.67$-$15\%$ reduction in bits per parameter with comparable performance to SOTA.
We introduce Dataset Concealment (DSC), a rigorous new procedure for evaluating and interpreting objective speech quality estimation models. DSC quantifies and decomposes the performance gap between research results and real-world application requirements, while offering context and additional insights into model behavior and dataset characteristics. We also show the benefits of addressing the corpus effect by using the dataset Aligner from AlignNet when training models with multiple datasets. We demonstrate DSC and the improvements from the Aligner using nine training datasets and nine unseen datasets with three well-studied models: MOSNet, NISQA, and a Wav2Vec2.0-based model. DSC provides interpretable views of the generalization capabilities and limitations of models, while allowing all available data to be used at training. An additional result is that adding the 1000 parameter dataset Aligner to the 94 million parameter Wav2Vec model during training does significantly improve the resulting model's ability to estimate speech quality for unseen data.
The number of active sound sources is a key parameter in many acoustic signal processing tasks, such as source localization, source separation, and multi-microphone speech enhancement. This paper proposes a novel method for online source counting by detecting changes in the number of active sources based on spatial coherence. The proposed method exploits the fact that a single coherent source in spatially white background noise yields high spatial coherence, whereas only noise results in low spatial coherence. By applying a spatial whitening operation, the source counting problem is reformulated as a change detection task, aiming to identify the time frames when the number of active sources changes. The method leverages the generalized magnitude-squared coherence as a measure to quantify spatial coherence, providing features for a compact neural network trained to detect source count changes framewise. Simulation results with binaural hearing aids in reverberant acoustic scenes with up to 4 speakers and background noise demonstrate the effectiveness of the proposed method for online source counting.
Vehicular Ad-hoc Networks (VANETs) serve as a critical enabler for intelligent transportation systems. However, their practical deployment faces a core governance dilemma: the network topology requires a dynamic trade-off between robustness against targeted attacks and ensuring low-latency information transmission. Most existing models generate fixed degree distributions, lacking the ability to adapt autonomously to the demands of diverse traffic scenarios. To address this challenge, this paper innovatively proposes a schedulable degree distribution model for localized VANETs. The core of this model lies in introducing a hybrid connection mechanism. When establishing connections, newly joining nodes do not follow a single rule but instead collaboratively perform random attachment and preferential attachment. Through theoretical derivation and simulation validation, this study demonstrates that by adjusting the cooperative weighting between these two mechanisms, the overall network degree distribution can achieve a continuous and controllable transition between a uniform distribution and a power-law distribution. The former effectively disperses attack risks and enhances robustness, while the latter facilitates the formation of hub nodes, shortening transmission paths to reduce latency. Experimental results based on the real-world road network of Beijing indicate that this model can precisely regulate node connection heterogeneity, attack resistance, and average transmission path length through the reshaping of the underlying topology. This provides a forward-looking and practical governance paradigm for constructing next-generation VANETs capable of dynamically adapting to complex environments.
This paper proposes a Koopman-based framework for modeling, prediction, and control of unknown nonlinear time-varying systems. We present a novel Koopman-based learning method for predicting the state of unknown nonlinear time-varying systems, upon which a robust controller is designed to ensure that the resulting closed-loop system is input-to-state stable with respect to the Koopman approximation error. The error of the lifted system model learned through the Koopman-based method increases over time due to the time-varying nature of the nonlinear time-varying system. To address this issue, an online iterative update scheme is incorporated into the learning process to update the lifted system model, aligning it more precisely with the time-varying nonlinear system by integrating the updated data and discarding the outdated data. A necessary condition for the feasibility of the proposed iterative learning method is derived. In order to reduce unnecessary system updates while ensuring the prediction accuracy of the lifted system, the update mechanism is enhanced to determine whether to update the lifted system and meanwhile to reduce updates that deteriorate the fitting performance. Furthermore, based on the online-updated lifted system, a controller is designed to ensure the closed-loop controlled system be input-to-state stable with respect to the Koopman approximation error. Numerical simulations on the Duffing oscillator, the serial manipulator, and the synthetic biological network system are presented to demonstrate the effectiveness of the proposed method for the approximation and control of unknown nonlinear time-varying systems. The results show that the proposed approach outperforms existing methods in terms of approximation accuracy and computational efficiency, even under significant system variations.
In this paper, we address the problem of circumnavigation of a stationary target by a heterogeneous group comprising of $\textbf{n}$ autonomous agents, having unicycle kinematics. The agents are assumed to have constant linear speeds, we control only the angular speeds. Assuming limited sensing capabilities of the agents, in the proposed framework, only a subset of agents, termed as \textit{leaders}, know the target location. The rest, termed as \textit{followers}, do not. We propose a distributed guidance law which drives all the agents towards the desired objective; global asymptotic stability (GAS) is ensured by using Zubov's theorem. The efficacy of the approach is demonstrated through both numerical simulations and hardware experiments on the TurtleBots utilising OptiTrack motion capture system.
In this paper, we develop a tractable analytical framework for a three-dimensional (3D) indoor terahertz (THz) communication system to theoretically assess the impact of the pointing error on its coverage performance. Specifically, we model the locations of access points (APs) using a Poisson point process, human blockages as random cylinder processes, and wall blockages through a Boolean straight line process. A pointing error refers to beamforming gain and direction mismatch between the transmitter and receiver. We characterize it based on the inaccuracy of location estimate. We then analyze the impact of this pointing error on the received signal power and derive a tractable expression for the coverage probability, incorporating the multi-cluster fluctuating two-ray distribution to accurately model small-scale fading in THz communications. Aided by simulation results, we corroborate our analysis and demonstrate that the pointing error has a pronounced impact on the coverage probability. Specifically, we find that merely increasing the antenna array size is insufficient to improve the coverage probability and mitigate the detrimental impact of the pointing error, highlighting the necessity of advanced estimation techniques in THz communication systems.
Time-domain ADCs are attractive for high-speed wireline receivers, as time resolution scales favorably with advanced CMOS technologies, enabling multi-GS/s single-channel sampling rates. However, conventional time-domain ADCs require explicit reset of voltage-to-time and time-domain signal paths between samples, introducing dead time that fundamentally limits resolution, speed, and energy efficiency. This paper introduces a dual-edge reset-free quantization concept for asynchronous pipelined SAR time-domain ADCs, in which both rising and falling signal edges are exploited to enable reset-free quantization within a single conversion period. By eliminating explicit reset phases, the proposed approach expands the effective conversion window and relaxes the resolution-speed tradeoff at high sampling rates. An 8-bit dual-edge asynchronous pipelined SAR time-domain ADC is implemented in 22-nm FD-SOI, incorporating a linearity-compensated dual-edge voltage-to-time converter and a dual-edge time-to-digital converter with independently tunable rising- and falling-edge delays. The prototype occupies a core area of 0.0089 mm^2 and achieves continuous single-channel operation at 3.5 GS/s, with architectural scalability demonstrated through intermittent operation at 10.5 GS/s and higher. At 3.5 GS/s, the ADC achieves 21.6 dB SNDR and 32.2 dB SFDR. The measured performance is primarily limited by identifiable implementation-level factors rather than by architectural constraints, demonstrating the feasibility of dual-edge reset-free quantization for high-speed time-domain ADCs.
While Automatic Speech Recognition (ASR) is typically benchmarked by word error rate (WER), real-world applications ultimately hinge on semantic fidelity. This mismatch is particularly problematic for dysarthric speech, where articulatory imprecision and disfluencies can cause severe semantic distortions. To bridge this gap, we introduce a Large Language Model (LLM)-based agent for post-ASR correction: a Judge-Editor over the top-k ASR hypotheses that keeps high-confidence spans, rewrites uncertain segments, and operates in both zero-shot and fine-tuned modes. In parallel, we release SAP-Hypo5, the largest benchmark for dysarthric speech correction, to enable reproducibility and future exploration. Under multi-perspective evaluation, our agent achieves a 14.51% WER reduction alongside substantial semantic gains, including a +7.59 pp improvement in MENLI and +7.66 pp in Slot Micro F1 on challenging samples. Our analysis further reveals that WER is highly sensitive to domain shift, whereas semantic metrics correlate more closely with downstream task performance.
The growing prevalence of extreme weather events driven by climate change poses significant challenges to power system resilience. Infrastructure damage and prolonged power outages highlight the urgent need for effective grid-hardening strategies. While some measures provide long-term protection against specific hazards, they can become counterproductive under conflicting threats. In this work, we develop an adaptive two-stage stochastic optimization framework to support dynamic decision-making for hardening critical grid components under multiple hazard exposures. Unlike traditional approaches, our model adapts to evolving climate conditions, enabling more resilient investment strategies. Furthermore, we integrate long-term (undergrounding) and short-term (vegetation management) hardening actions to jointly minimize total system costs. Extensive simulation results validate the effectiveness of the proposed framework in reducing outage and repair costs while enhancing the adaptability and robustness of grid infrastructure planning.
In the last few years, energy efficiency has become a challenge. Not only mitigating environmental impact but reducing energy waste can lead to financial advantages. Buildings play an important role in this: they are among the biggest consumers. So, finding manners to reduce energy consumption is a way to minimise energy waste, and a technique for that is creating Demand Response (DR) strategies. This paper proposes a novel way to decrease computational effort of simulating the behaviour of a building using surrogate models based on active learning. Before going straight to the problem of a building, which is complex and computationally costly, the paper proposes the approach of active learning to a smaller problem: with reduced simulations, regress the curve of voltage versus current of a thermo-resistor. Then, the paper implements a surrogate model of energy consumption of a building. The goal is to be able to learn the consumption pattern based on a limited number of simulations. The result given by the surrogate can be used to set the reference temperature, maximising the PV self-consumption, and reducing energy usage from the grid. Thanks to the surrogate, the total time spent to map all possible consumption scenarios is reduced around 7 times.
The presence of interharmonics in power systems can lead to asynchronous sampling, a phenomenon further aggravated by shifts in the fundamental frequency, which significantly degrades the accuracy of power measurements. Under such asynchronous conditions, interharmonics lose orthogonality with the fundamental and harmonic components, giving rise to additional power components. To address these challenges, this paper introduces a linearization algorithm based on DFT spectrum analysis for precise power measurement in systems containing interharmonics. The proposed approach constructs a system of linear equations from the DFT spectrum and solves it through efficient matrix operations, enabling accurate extraction of interharmonic components near the fundamental and harmonic frequencies (with a frequency interval $\geq$1 Hz). This allows for precise measurement of power across the fundamental, harmonic, interharmonic, and cross-power bands, as well as total power. Test results demonstrate that the proposed method accurately computes various power components under diverse conditions--including varying interharmonic/fundamental/harmonic intervals, fundamental frequency deviations, and noise. Compared to existing methods such as fast Fourier transform (FFT), Windowed interpolation FFT, and Matrix pencil-Singular value decomposition, the proposed technique reduces estimation error by several times to multiple folds and exhibits improved robustness, while maintaining a computational time of only 7 ms for processing 10-power-line-cycle (200 ms) data.
In recent years, Text-to-Audio Generation has achieved remarkable progress, offering sound creators powerful tools to transform textual inspirations into vivid audio. However, existing models predominantly operate directly in the acoustic latent space of a Variational Autoencoder (VAE), often leading to suboptimal alignment between generated audio and textual descriptions. In this paper, we introduce SemanticAudio, a novel framework that conducts both audio generation and editing directly in a high-level semantic space. We define this semantic space as a compact representation capturing the global identity and temporal sequence of sound events, distinct from fine-grained acoustic details. SemanticAudio employs a two-stage Flow Matching architecture: the Semantic Planner first generates these compact semantic features to sketch the global semantic layout, and the Acoustic Synthesizer subsequently produces high-fidelity acoustic latents conditioned on this semantic plan. Leveraging this decoupled design, we further introduce a training-free text-guided editing mechanism that enables precise attribute-level modifications on general audio without retraining. Specifically, this is achieved by steering the semantic generation trajectory via the difference of velocity fields derived from source and target text prompts. Extensive experiments demonstrate that SemanticAudio surpasses existing mainstream approaches in semantic alignment. Demo available at: this https URL
Integrated sensing and communication is a key feature in next-generation wireless networks, enabling joint data transmission and environmental radar sensing on shared spectrum. In multi-user scenarios, simultaneous transmissions cause mutual interference on overlapping frequencies, leading to spurious target detections and degraded sensing accuracy. This paper proposes an interference detection and exploitation algorithm for sensing using spectrally interleaved orthogonal frequency division multiplexing. A statistically rigorous procedure is introduced to detect interference while controlling the familywise error rate. We propose an algorithm that estimates the angle by exploiting interference, while estimating the delay by avoiding the interference. Numerical experiments demonstrate that the proposed method reliably detects interference, and that the delay and angle estimation error approaches the Cramér-Rao lower bound.
Near-field localization for ISAC requires large-aperture arrays, making fully-digital implementations prohibitively complex and costly. While sparse subarray architectures can reduce cost, they introduce severe estimation ambiguity from grating lobes. To address both issues, we propose SHARE (Sparse Hierarchical Angle-Range Estimation), a novel two-stage sparse recovery algorithm. SHARE operates in two stages. It first performs coarse, unambiguous angle estimation using individual subarrays to resolve the grating lobe ambiguity. It then leverages the full sparse aperture to perform a localized joint angle-range search. This hierarchical approach avoids an exhaustive and computationally intensive two-dimensional grid search while preserving the high resolution of the large aperture. Simulation results show that SHARE significantly outperforms conventional one-shot sparse recovery methods, such as Orthogonal Matching Pursuit (OMP), in both localization accuracy and robustness. Furthermore, we show that SHARE's overall localization accuracy is comparable to or even surpasses that of the fully-digital 2D-MUSIC algorithm, despite MUSIC having access to the complete, uncompressed data from every antenna element. SHARE therefore provides a practical path for high-resolution near-field ISAC systems.
Acquiring channel state information (CSI) through traditional methods, such as channel estimation, is increasingly challenging for the emerging sixth generation (6G) mobile networks due to high overhead. To address this issue, channel extrapolation techniques have been proposed to acquire complete CSI from a limited number of known CSIs. To improve extrapolation accuracy, environmental information, such as visual images or radar data, has been utilized, which poses challenges including additional hardware, privacy and multi-modal alignment concerns. To this end, this paper proposes a novel channel extrapolation framework by leveraging environment-related multi-path characteristics induced directly from CSI without integrating additional modalities. Specifically, we propose utilizing the multi-path characteristics in the form of power-delay profile (PDP), which is acquired using a CSI-to-PDP module. CSI-to-PDP module is trained in an AE-based framework by reconstructing the PDPs and constraining the latent low-dimensional features to represent the CSI. We further extract the total power & power-weighted delay of all the identified paths in PDP as the multi-path information. Building on this, we proposed a MAE architecture trained in a self-supervised manner to perform channel extrapolation. Unlike standard MAE approaches, our method employs separate encoders to extract features from the masked CSI and the multi-path information, which are then fused by a cross-attention module. Extensive simulations demonstrate that this framework improves extrapolation performance dramatically, with a minor increase in inference time (around 0.1 ms). Furthermore, our model shows strong generalization capabilities, particularly when only a small portion of the CSI is known, outperforming existing benchmarks.
In the evolving landscape of sixth-generation (6G) mobile communication, multiple-input multiple-output (MIMO) systems are incorporating an unprecedented number of antenna elements, advancing towards Extremely large-scale multiple-input-multiple-output (XL-MIMO) systems. This enhancement significantly increases the spatial degrees of freedom, offering substantial benefits for wireless positioning. However, the expansion of the near-field range in XL-MIMO challenges the traditional far-field assumptions used in previous MIMO models. Among various configurations, uniform circular arrays (UCAs) demonstrate superior performance by maintaining constant angular resolution, unlike linear planar arrays. Addressing how to leverage the expanded aperture and harness the near-field effects in XL-MIMO systems remains an area requiring further investigation. In this paper, we introduce an attention-enhanced deep learning approach for precise positioning. We employ a dual-path channel attention mechanism and a spatial attention mechanism to effectively integrate channel-level and spatial-level features. Our comprehensive simulations show that this model surpasses existing benchmarks such as attention-based positioning networks (ABPN), near-field positioning networks (NFLnet), convolutional neural networks (CNN), and multilayer perceptrons (MLP). The proposed model achieves superior positioning accuracy by utilizing covariance metrics of the input signal. Also, simulation results reveal that covariance metric is advantageous for positioning over channel state information (CSI) in terms of positioning accuracy and model efficiency.
This letter proposes a decentralized local gain condition (LGC) to guarantee oscillation damping in inverter-based resource (IBR)-dominated power systems. The LGC constrains the dynamic gain between each IBR and the network at its point of connection. By satisfying the LGC locally, the closed-loop poles are confined to a desired region, thereby yielding system-wide oscillation damping without requiring global information. Notably, the LGC is agnostic to different IBR dynamics, well-suited for systems with heterogeneous IBRs, and flexible to various damping requirements. Moreover, a low-complexity algorithm is proposed to parameterize LGC, providing scalable and damping-constrained parameter tuning guidance for IBRs.
Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning its predictions with high-quality semantic representations from frozen, pre-trained external encoders. Experimental results demonstrate that CAT significantly outperforms baselines on audio understanding benchmarks. Notably, it achieves competitive performance on the AudioSet 20k dataset with 5 times faster convergence than existing methods. Codes and checkpoints will be released soon at this https URL.
Speculative decoding accelerates autoregressive generation by separating token proposal from verification, but most existing approaches are designed for single-node execution and do not scale well to multi-accelerator clusters used for serving modern Large Language Models (LLMs). We present StarSD, a one-for-many speculative decoding framework that uses a single draft model to serve multiple target models across distributed nodes via a star topology. StarSD decouples drafting and verification, enabling effective sharing of draft computation, and preventing distributed accelerators from remaining idle under bursty workloads. We provide a system-level analysis that characterizes when and why a single draft model can remain fully utilized by multiple verifiers, yielding predictable latency and utilization gains. Extensive experiments in real-world distributed inference settings demonstrate that StarSD simplifies deployment and supports flexible resource allocation across heterogeneous accelerators, while maintaining output quality. These results indicate that StarSD is a practical and scalable framework for bringing speculative decoding to modern cloud and edge inference infrastructures.
The growing electrification of transportation and heating through Electric Vehicles (EVs) and Heat Pumps (HPs) introduces both flexibility and complexity to Active Distribution Networks (ADNs). These resources provide substantial operational flexibility but also create tightly coupled thermal-electrical dynamics that challenge conventional network management. This paper proposes a unified co-optimization framework that integrates a calibrated 3R2C grey-box building thermal model into a network-constrained Optimal Power Flow (OPF). The framework jointly optimizes EVs, HPs, and photovoltaic systems while explicitly enforcing thermal comfort, Distributed Energy Resource (DER) limits, and full power flow physics. To maintain computational tractability, Second-Order Cone Programming (SOCP) relaxations are evaluated on a realistic low-voltage feeder. The analysis shows that, despite network heterogeneity violating some theoretical exactness conditions, the relaxation remains exact in practice. Comparative assessments of convex DistFlow, bus injection, and branch flow formulations reveal that convex DistFlow achieves sub-second runtimes and near-optimal performance even at high DER penetration levels. Simulations confirm the effectiveness of coordinated scheduling, yielding reductions of 41% in transformer aging, 54% in losses, and complete elimination of voltage violations, demonstrating the value of integrated thermal-electrical coordination in future smart grids.
Navigating urban intersections, especially when interacting with heterogeneous traffic participants, presents a formidable challenge for autonomous vehicles (AVs). In such environments, safety risks arise simultaneously from multiple sources, each carrying distinct priority levels and sensitivities that necessitate differential protection preferences. While safe reinforcement learning (RL) offers a robust paradigm for constrained decision-making, existing methods typically model safety as a single constraint or employ static, heuristic weighting schemes for multiple constraints. These approaches often fail to address the dynamic nature of multi-source risks, leading to gradient cancellation that hampers learning, and suboptimal trade-offs in critical dilemma zones. To address this, we propose a Bayesian adaptive priority safe reinforcement learning (BAP-SRL) based motion planning framework. Unlike heuristic weighting schemes, BAP formulates constraint prioritization as a probabilistic inference task. By modeling historical optimization difficulty as a Bayesian prior and instantaneous risk evidence as a likelihood, BAP dynamically gates gradient updates using a Bayesian inference mechanism on latent constraint criticality. Extensive experiments demonstrate that our approach outperforms state-of-the-art baselines in handling interactions with stochastic, heterogeneous agents, achieving lower collision rates and smoother conflict resolution.
Coverage control algorithms have traditionally focused on static target densities, where agents are deployed to optimally cover a fixed spatial distribution. However, many applications involve time-varying densities, including environmental monitoring, surveillance, and adaptive sensor deployment. Although time-varying coverage strategies have been studied within Voronoi-based frameworks, recent works have reformulated static coverage control as a semi-discrete optimal transport problem. Extending this optimal transport perspective to time-varying scenarios has remained an open challenge. This paper presents a rigorous optimal transport formulation for time-varying coverage control, in which agents minimize the instantaneous Wasserstein distance to a continuously evolving target density. The proposed solution relies on a coupled system of differential equations governing agent positions and the dual variables that define Laguerre regions. In one-dimensional domains, the resulting system admits a closed-form analytical solution, offering both computational benefits and theoretical insight into the structure of optimal time-varying coverage. Numerical simulations demonstrate improved tracking performance compared to quasi-static and Voronoi-based methods, validating the proposed framework.
Ultrasound (US) interpretation is hampered by multiplicative speckle, acquisition blur from the point-spread function (PSF), and scanner- and operator-dependent artifacts. Supervised enhancement methods assume access to clean targets or known degradations; conditions rarely met in practice. We present a blind, self-supervised enhancement framework that jointly deconvolves and denoises B-mode images using a Swin Convolutional U-Net trained with a \emph{physics-guided} degradation model. From each training frame, we extract rotated/cropped patches and synthesize inputs by (i) convolving with a Gaussian PSF surrogate and (ii) injecting noise via either spatial additive Gaussian noise or complex Fourier-domain perturbations that emulate phase/magnitude distortions. For US scans, clean-like targets are obtained via non-local low-rank (NLLR) denoising, removing the need for ground truth; for natural images, the originals serve as targets. Trained and validated on UDIAT~B, JNU-IFM, and XPIE Set-P, and evaluated additionally on a 700-image PSFHS test set, the method achieves the highest PSNR/SSIM across Gaussian and speckle noise levels, with margins that widen under stronger corruption. Relative to MSANN, Restormer, and DnCNN, it typically preserves an extra $\sim$1--4\,dB PSNR and 0.05--0.15 SSIM in heavy Gaussian noise, and $\sim$2--5\,dB PSNR and 0.05--0.20 SSIM under severe speckle. Controlled PSF studies show reduced FWHM and higher peak gradients, evidence of resolution recovery without edge erosion. Used as a plug-and-play preprocessor, it consistently boosts Dice for fetal head and pubic symphysis segmentation. Overall, the approach offers a practical, assumption-light path to robust US enhancement that generalizes across datasets, scanners, and degradation types.
A large number of works view the automatic assessment of speech from an utterance- or system-level perspective. While such approaches are good in judging overall quality, they cannot adequately explain why a certain score was assigned to an utterance. frame-level scores can provide better interpretability, but models predicting them are harder to tune and regularize since no strong targets are available during training. In this work, we show that utterance-level speech quality predictors can be regularized with a segment-based consistency constraint which notably reduces frame-level stochasticity. We then demonstrate two applications involving frame-level scores: The partial spoof scenario and the detection of synthesis artefacts in two state-of-the-art text-to-speech systems. For the latter, we perform listening tests and confirm that listeners rate segments to be of poor quality more often in the set defined by low frame-level scores than in a random control set.
We design a variational state estimation (VSE) method that provides a closed-form Gaussian posterior of an underlying complex dynamical process from (noisy) nonlinear measurements. The complex process is model-free. That is, we do not have a suitable physics-based model characterizing the temporal evolution of the process state. The closed-form Gaussian posterior is provided by a recurrent neural network (RNN). The use of RNN is computationally simple in the inference phase. For learning the RNN, an additional RNN is used in the learning phase. Both RNNs help each other learn better based on variational inference principles. The VSE is demonstrated for a tracking application - state estimation of a stochastic Lorenz system (a benchmark process) using a 2-D camera measurement model. The VSE is shown to be competitive against a particle filter that knows the Lorenz system model and a recently proposed data-driven state estimation method that does not know the Lorenz system model.
Low Earth orbit (LEO) mega-constellations greatly extend the coverage and resilience of future wireless systems. Within the mega-constellations, laser inter-satellite links (LISLs) enable high-capacity, long-range connectivity. Existing LISL schemes often overlook mechanical limitations of laser communication terminals (LCTs) and non-uniform global traffic profiles caused by uneven user and gateway distributions, leading to suboptimal throughput and underused LCTs/LISLs -- especially when each satellite carries only a few LCTs. This paper investigates the joint optimization of LCT connections and traffic routing to maximize the constellation throughput, considering the realistic LCT mechanics and the global traffic profile. The problem is formulated as an NP-hard mixed-integer program coupling LCT connections with flow-rate variables under link capacity constraints. Due to its intractability, we resort to relaxing the coupling constraints via Lagrangian duality, decomposing the problem into a weighted graph-matching for LCT connections, weighted shortest-path routing tasks, and a linear program for rate allocation. Here, Lagrange multipliers reflect congestion weights between satellites, jointly guiding the matching, routing, and rate allocation. Subgradient descent optimizes the multipliers, with provable convergence. Simulations using real-world constellation and terrestrial data show that our methods substantially improve network throughput by up to $35\%$--$145\%$ over existing non-joint approaches.
Laser inter-satellite links (LISLs) of low Earth orbit (LEO) mega-constellations enable high-capacity backbone connectivity in non-terrestrial networks, but their management is challenged by limited laser communication terminals, mechanical pointing constraints, and rapidly time-varying network topologies. This paper studies the joint problem of LISL connection establishment, traffic routing, and flow-rate allocation under heterogeneous global traffic demand and gateway availability. We formulate the problem as a mixed-integer optimization over large-scale, time-varying constellation graphs and develop a Lagrangian dual decomposition that interprets per-link dual variables as congestion prices coordinating connectivity and routing decisions. To overcome the prohibitive latency of iterative dual updates, we propose DeepLaDu, a Lagrangian duality-guided deep learning framework that trains a graph neural network (GNN) to directly infer per-link (edge-level) congestion prices from the constellation state in a single forward pass. We enable scalable and stable training using a subgradient-based edge-level loss in DeepLaDu. We analyze the convergence and computational complexity of the proposed approach and evaluate it using realistic Starlink-like constellations with optical and traffic constraints. Simulation results show that DeepLaDu achieves up to 20\% higher network throughput than non-joint or heuristic baselines, while matching the performance of iterative dual optimization with orders-of-magnitude lower computation time, suitable for real-time operation in dynamic LEO networks.
Diffusion speech enhancement on discrete audio codec features gain immense attention due to their improved speech component reconstruction capability. However, they usually suffer from high inference computational complexity due to multiple reverse process iterations. Furthermore, they generally achieve promising results on non-intrusive metrics but show poor performance on intrusive metrics, as they may struggle in reconstructing the correct phones. In this paper, we propose DisContSE, an efficient diffusion-based speech enhancement model on joint discrete codec tokens and continuous embeddings. Our contributions are three-fold. First, we formulate both a discrete and a continuous enhancement module operating on discrete audio codec tokens and continuous embeddings, respectively, to achieve improved fidelity and intelligibility simultaneously. Second, a semantic enhancement module is further adopted to achieve optimal phonetic accuracy. Third, we achieve a single-step efficient reverse process in inference with a novel quantization error mask initialization strategy, which, according to our knowledge, is the first successful single-step diffusion speech enhancement based on an audio codec. Trained and evaluated on URGENT 2024 Speech Enhancement Challenge data splits, the proposed DisContSE excels top-reported time- and frequency-domain diffusion baseline methods in PESQ, POLQA, UTMOS, and in a subjective ITU-T P.808 listening test, clearly achieving an overall top rank.
The performance of speaker verification systems degrades significantly under language mismatch, a critical challenge exacerbated by the field's reliance on English-centric data. To address this, we propose the TidyVoice Challenge for cross-lingual speaker verification. The challenge leverages the TidyVoiceX dataset from the novel TidyVoice benchmark, a large-scale, multilingual corpus derived from Mozilla Common Voice, and specifically curated to isolate the effect of language switching across approximately 40 languages. Participants will be tasked with building systems robust to this mismatch, with performance primarily evaluated using the Equal Error Rate on cross-language trials. By providing standardized data, open-source baselines, and a rigorous evaluation protocol, this challenge aims to drive research towards fairer, more inclusive, and language-independent speaker recognition technologies, directly aligning with the Interspeech 2026 theme, "Speaking Together."
Movable antennas (MA) have gained significant attention in recent years to overcome the limitations of extremely large antenna arrays in terms of cost and power consumption. In this paper, we investigate the use of MA arrays at the base station (BS) for angle-of-departure (AoD) estimation under uncertainty in the user equipment (UE) location. Specifically, we (i) derive the theoretical performance limits through the Cramér-Rao bound (CRB) and (ii) optimize the antenna positions to ensure robust performance within the UE's uncertainty region. Numerical results show that dynamically optimizing antenna placement by explicitly considering the uncertainty region yields superior performance compared to fixed arrays, demonstrating the ability of MA systems to adapt and outperform conventional arrays.
Urban mobility systems are transitioning toward electric, on-demand services, creating operational challenges for fleet management under energy and service-quality constraints. The Electric Dial-a-Ride Problem (E-DARP) extends the classical dial-a-ride problem by incorporating limited battery capacity and nonlinear charging dynamics, increasing computational complexity and limiting the scalability of exact methods for real-time use. This paper proposes a deep reinforcement learning approach based on a graph neural network encoder and an attention-driven route construction policy. By operating directly on edge attributes such as travel time and energy consumption, the method captures non-Euclidean, asymmetric, and energy-dependent routing costs in real road networks. The learned policy jointly optimizes routing, charging, and service quality without relying on Euclidean assumptions or handcrafted heuristics. The approach is evaluated on two case studies using ride-sharing data from San Francisco. On benchmark instances, the method achieves solutions within 0.4% of best-known results while reducing computation times by orders of magnitude. A second case study considers large-scale instances with up to 250 request pairs, realistic energy models, and nonlinear charging. On these instances, the learned policy outperforms Adaptive Large Neighborhood Search (ALNS) by 9.5% in solution quality while achieving 100% service completion, with sub-second inference times compared to hours for the metaheuristic. Finally, sensitivity analyses quantify the impact of battery capacity, fleet size, ride-sharing capacity, and reward weights, while robustness experiments show that deterministically trained policies generalize effectively under stochastic conditions.
Feature coding for machines (FCM) is a lossy compression paradigm for split-inference. The transmitter encodes the outputs of the first part of a neural network before sending them to the receiver for completing the inference. Practical FCM methods ``sandwich'' a traditional codec between pre- and post-processing neural networks, called wrappers, to make features easier to compress using video codecs. Since traditional codecs are non-differentiable, the wrappers are trained using a proxy codec, which is later replaced by a standard codec after training. These codecs perform rate-distortion optimization (RDO) based on the sum of squared errors (SSE). Because the RDO does not consider the post-processing wrapper, the inner codec can invest bits in preserving information that the post-processing later discards. In this paper, we modify the bit-allocation in the inner codec via a wrapper-aware weighted SSE metric. To make wrapper-aware RDO (WA-RDO) practical for FCM, we propose: 1) temporal reuse of weights across a group of pictures and 2) fixed, architecture- and task-dependent weights trained offline. Under MPEG test conditions, our methods implemented on HEVC match the VVC-based FCM state-of-the-art, effectively bridging a codec generation gap with minimal runtime overhead relative to SSE-RDO HEVC.
Probabilistic resource adequacy assessment is a cornerstone of modern capacity accreditation. This paper develops a gradient-based framework, in which capacity accreditation is interpreted as the directional derivative of a probabilistic resource adequacy metric with respect to resource capacity, that unifies two widely used accreditation approaches: Effective Load Carrying Capability (ELCC) and Marginal Reliability Impact (MRI). Under mild regularity conditions, we show that marginal ELCC and MRI yield equivalent accreditation factors, while their numerical implementations exhibit markedly different computational characteristics. Building on this framework, we demonstrate how infinitesimal perturbation analysis enables up to a $1000\times$ speedup in gradient estimation for capacity accreditation, and we implement gradient-informed search algorithms that significantly accelerate ELCC computations relative to standard bisection methods. Large-scale Monte Carlo experiments show that MRI achieves substantial runtime reductions compared to ELCC and exhibits greater robustness to perturbation step-size selection. These results provide practical guidance for implementing efficient and scalable capacity accreditation in large-scale power systems.
To enhance the efficiency of capacity markets, many electricity markets in the U.S. are adopting or planning to implement marginal capacity accreditation reforms. This paper provides new insights into energy storage capacity accreditation using Marginal Reliability Impact (MRI). We reformulate the commonly used reliability-based storage dispatch model as an optimization problem, enabling direct calculation of the MRI from the Lagrange multipliers, rather than using brute-force perturbation analysis. The analysis demonstrates that the EUE is a piecewise linear function and the storage MRI retains a non-negative property across various system scenarios. We further explore the influence of qualified capacity (QC), storage dispatch rules, and other key factors on storage accreditation, providing practical insights for system operators. Additionally, comparisons of storage capacity accreditation under different reliability criteria offer valuable guidance for policymakers in setting future standards. Numerical results from a modified California system validate our findings and highlight several important phenomena associated with the MRI-based accreditation scheme.
Unmanned aerial vehicles (UAVs) integrated into cellular networks face significant challenges from air-to-ground interference. To address this, we propose a downlink UAV communication system that leverages a fluid antenna system (FAS)- assisted reconfigurable intelligent surface (RIS) to enhance signal quality. By jointly optimizing the FAS port positions and RIS phase shifts, we maximize the achievable rate. The resulting nonconvex optimization problem is solved using successive convex approximation (SCA) based on second-order cone programming (SOCP), which reformulates the constraints into a tractable form. Simulation results show that the proposed algorithm significantly improves both outage probability and achievable rate over conventional fixed-position antenna (FPA) schemes, with particularly large gains in large-scale RIS configurations. Moreover, the algorithm converges rapidly, making it suitable for real-time applications
High renewable penetration increases the frequency and magnitude of net-load ramps, stressing real-time flexibility. Two commonly deployed remedies are look-ahead economic dispatch (LAED) and ramp products (RPs), yet their operational equivalence under the industry-standard rolling-window dispatch implementation is not well understood. This paper develops linear optimization models for multi-interval LAED and RP-based co-optimization, and proves that an enhanced RP formulation can match LAED's dispatch feasible region at a single time step when additional intertemporal deliverability constraints are enforced. We then show that this equivalence does not generally persist under rolling-window operation because LAED and RP formulations optimize different intertemporal objectives, leading to divergent end-of-window states. Using different test systems under stressed ramping conditions and multiple load levels, we show LAED achieves similar or lower load shedding than RP implementations with the same look-ahead horizon, with the most pronounced differences under high-load, ramp-limited conditions. The study highlights the limitations of current ramp product implementations and suggests enhancements, such as introducing more mid-duration RPs.
Prompt tuning has achieved remarkable progress in vision-language models (VLMs) and is recently being adopted for audio-language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base-New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)-a plug-and-play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra-class compactness and inter-class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base-to-new generalization and cross-dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference. Codes are available in this https URL.
The color of skin lesions is an important diagnostic feature for identifying malignant melanoma and other skin diseases. Typical colors associated with melanocytic lesions include tan, brown, black, red, white, and blue gray. This study introduces a novel feature: the number of colors present in a lesion, which can indicate the severity of disease and help distinguish melanomas from benign lesions. We propose a color histogram analysis method to examine lesion pixel values from three publicly available datasets: PH2, ISIC2016, and Med Node. The PH2 dataset contains ground truth annotations of lesion colors, while ISIC2016 and Med Node do not; our algorithm estimates the ground truth using color histogram analysis based on PH2. We then design and train a 19 layer Convolutional Neural Network (CNN) with residual skip connections to classify lesions into three categories based on the number of colors present. DeepDream visualization is used to interpret features learned by the network, and multiple CNN configurations are tested. The best model achieves a weighted F1 score of 75 percent. LIME is applied to identify important regions influencing model decisions. The results show that the number of colors in a lesion is a significant feature for describing skin conditions, and the proposed CNN with three skip connections demonstrates strong potential for clinical diagnostic support.
Morphing techniques generate artificial biometric samples that combine features from multiple individuals, allowing each contributor to be verified against a single enrolled template. While extensively studied in face recognition, this vulnerability remains largely unexplored in voice biometrics. Prior work on voice morphing is computationally expensive, non-scalable, and limited to acoustically similar identity pairs, constraining practical deployment. Moreover, existing sound-morphing methods target audio textures, music, or environmental sounds and are not transferable to voice identity manipulation. We propose VoxMorph, a zero-shot framework that produces high-fidelity voice morphs from as little as five seconds of audio per subject without model retraining. Our method disentangles vocal traits into prosody and timbre embeddings, enabling fine-grained interpolation of speaking style and identity. These embeddings are fused via Spherical Linear Interpolation (Slerp) and synthesized using an autoregressive language model coupled with a Conditional Flow Matching network. VoxMorph achieves state-of-the-art performance, delivering a 2.6x gain in audio quality, a 73% reduction in intelligibility errors, and a 67.8% morphing attack success rate on automated speaker verification systems under strict security thresholds. This work establishes a practical and scalable paradigm for voice morphing with significant implications for biometric security. The code and dataset are available on our project page: this https URL
Single-word Automatic Speech Recognition (ASR) is a challenging task due to the lack of linguistic context and sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains such as healthcare and emergency response. This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. The system combines denoising and normalization with a hybrid ASR front end (Whisper + Vosk) and a verification layer designed to handle out-of-vocabulary words and degraded audio. The verification layer supports multiple matching strategies, including embedding similarity, edit distance, and LLM-based matching with optional contextual guidance. We evaluate the framework on the Google Speech Commands dataset and a curated real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions. Results show that while the hybrid ASR front end performs well on clean audio, the verification layer significantly improves accuracy on noisy and compressed channels. Context-guided and LLM-based matching yield the largest gains, demonstrating that lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.
Self-supervised learning (SSL) has transformed speech processing, yet its reliance on massive pre-training datasets remains a bottleneck. While robustness is often attributed to scale and diversity, the role of the data distribution is less understood. We systematically examine how curated subsets of pre-training data influence Automatic Speech Recognition (ASR) performance. Surprisingly, optimizing for acoustic, speaker, or linguistic diversity yields no clear improvements over random sampling. Instead, we find that prioritizing the longest utterances achieves superior ASR results while using only half the original dataset, reducing pre-training time by 24% on a large corpora. These findings suggest that for pre-training speech SSL models, data length is a more critical factor than either data diversity or overall data quantity for performance and efficiency, offering a new perspective for data selection strategies in SSL speech processing.
Adapting automatic speech recognition (ASR) systems based on large language models (LLMs) to new domains using text-only data is a significant yet underexplored challenge. Standard fine-tuning of the LLM on target-domain text often disrupts the critical alignment between speech and text modalities learned by the projector, degrading performance. We introduce a novel text-only adaptation method that emulates the audio projection task by treating it as a text denoising task. Our approach thus trains the LLM to recover clean transcripts from noisy inputs. This process effectively adapts the model to a target domain while preserving cross-modal alignment. Our solution is lightweight, requiring no architectural changes or additional parameters. Extensive evaluation on two datasets demonstrates up to 22.1% relative improvement, outperforming recent state-of-the-art text-only adaptation methods.
We propose several improvements to the speech recognition evaluation. First, we propose a string alignment algorithm that supports both multi-reference labeling, arbitrary-length insertions and better word alignment. This is especially useful for non-Latin languages, those with rich word formation, to label cluttered or longform speech. Secondly, we collect a novel test set DiverseSpeech-Ru of longform in-the-wild Russian speech with careful multi-reference labeling. We also perform multi-reference relabeling of popular Russian tests set and study fine-tuning dynamics on its corresponding train set. We demonstrate that the model often adopts to dataset-specific labeling, causing an illusion of metric improvement. Based on the improved word alignment, we develop tools to evaluate streaming speech recognition and to align multiple transcriptions to compare them visually. Additionally, we provide uniform wrappers for many offline and streaming speech recognition models. Our code will be made publicly available.
Matching mechanisms play a central role in operations management across diverse fields including education, healthcare, and online platforms. However, experimentally comparing a new matching algorithm against a status quo presents some fundamental challenges due to matching interference, where assigning a unit in one matching may preclude its assignment in the other. In this work, we take a design-based perspective to study the design of randomized experiments to compare two predetermined matching plans on a finite population, without imposing outcome or behavioral models. We introduce the notation of a disagreement set, which captures the difference between the two matching plans, and show that it admits a unique decomposition into disjoint alternating paths and cycles with useful structural properties. Based on these properties, we propose the Alternating Path Randomized Design, which sequentially randomizes along these paths and cycles to effectively manage interference. Within a minimax framework, we optimize the conditional randomization probability and show that, for long paths, the optimal choice converges to $\sqrt{2}-1$, minimizing worst-case variance. We establish the unbiasedness of the Horvitz-Thompson estimator and derive a finite-population Central Limit Theorem that accommodates complex and unstable path and cycle structures as the population grows. Furthermore, we extend the design to many-to-one matchings, where capacity constraints fundamentally alter the structure of the disagreement set. Using graph-theoretic tools, including finding augmenting paths and Euler-tour decomposition on an auxiliary unbalanced directed graph, we construct feasible alternating path and cycle decompositions that allow the design and inference results to carry over.
Integrating front-end speech enhancement (SE) models with self-supervised learning (SSL)-based speech models is effective for downstream tasks in noisy conditions. SE models are commonly fine-tuned using SSL representations with mean squared error (MSE) loss between enhanced and clean speech. However, MSE is prone to exploiting positional embeddings in SSL models, allowing the objective to be minimised through positional correlations instead of content-related information. This work frames the problem as a general limitation of self-supervised representation fine-tuning and investigates it through representation-guided SE. Two strategies are considered: (1) zero-padding, previously explored in SSL pre-training but here examined in the fine-tuning setting, and (2) speed perturbations with a soft-DTW loss. Experiments show that the soft-DTW-based approach achieves faster convergence and improved downstream performance, underscoring the importance of position-invariant fine-tuning in SSL-based speech modelling.
Safe Reinforcement Learning (RL) algorithms are typically evaluated under fixed training conditions. We investigate whether training-time safety guarantees transfer to deployment under distribution shift, using diabetes management as a safety-critical testbed. We benchmark safe RL algorithms on a unified clinical simulator and reveal a safety generalization gap: policies satisfying constraints during training frequently violate safety requirements on unseen patients. We demonstrate that test-time shielding, which filters unsafe actions using learned dynamics models, effectively restores safety across algorithms and patient populations. Across eight safe RL algorithms, three diabetes types, and three age groups, shielding achieves Time-in-Range gains of 13--14\% for strong baselines such as PPO-Lag and CPO while reducing clinical risk index and glucose variability. Our simulator and benchmark provide a platform for studying safety under distribution shift in safety-critical control domains. Code is available at this https URL and this https URL.
Current multimodal LLMs process audio as a mono stream, ignoring the rich spatial information essential for embodied AI. Existing spatial audio models, conversely, are constrained to fixed microphone geometries, preventing deployment across diverse devices. We present PhaseCoder, a transformer-only spatial audio encoder that is agnostic to microphone geometry. PhaseCoder takes raw multichannel audio and microphone coordinates as inputs to perform localization and produces robust spatial embeddings. We demonstrate that Gemma 3n LLM can be fine-tuned to reason over "Spatial Audio Tokens" produced by PhaseCoder. We show our encoder achieves state-of-the-art results on microphone-invariant localization benchmarks and, for the first time, enables an LLM to perform complex spatial reasoning and targeted transcription tasks from an arbitrary microphone array.
The growing prevalence of neurological disorders associated with dysarthria motivates the need for automated intelligibility assessment methods that are applicalbe across languages. However, most existing approaches are either limited to a single language or fail to capture language-specific factors shaping intelligibility. We present a multilingual phoneme-production assessment framework that integrates universal phone recognition with language-specific phoneme interpretation using contrastive phonological feature distances for phone-to-phoneme mapping and sequence alignment. The framework yields three metrics: phoneme error rate (PER), phonological feature error rate (PFER), and a newly proposed alignment-free measure, phoneme coverage (PhonCov). Analysis on English, Spanish, Italian, and Tamil show that PER benefits from the combination of mapping and alignment, PFER from alignment alone, and PhonCov from mapping. Further analyses demonstrate that the proposed framework captures clinically meaningful patterns of intelligibility degradation consistent with established observations of dysarthric speech.
We study how model architecture and temporal context interact in naturalistic EEG decoding. Using the HBN movie-watching dataset, we benchmark five architectures, CNN, LSTM, a stabilized Transformer (EEGXF), S4, and S5, on a 4-class task across segment lengths from 8s to 128s. Accuracy improves with longer context: at 64s, S5 reaches 98.7%+/-0.6 and CNN 98.3%+/-0.3, while S5 uses ~20x fewer parameters than CNN. To probe real-world robustness, we evaluate zero-shot cross-frequency shifts, cross-task OOD inputs, and leave-one-subject-out generalization. S5 achieves stronger cross-subject accuracy but makes over-confident errors on OOD tasks; EEGXF is more conservative and stable under frequency shifts, though less calibrated in-distribution. These results reveal a practical efficiency-robustness trade-off: S5 for parameter-efficient peak accuracy; EEGXF when robustness and conservative uncertainty are critical.
Recently, the problem of music plagiarism has emerged as an even more pressing social issue. As music information retrieval research advances, there is a growing effort to address issues related to music plagiarism. However, many studies, including our previous work, have conducted research without clearly defining what the music plagiarism detection task actually involves. This lack of a clear definition has slowed research progress and made it hard to apply results to real-world scenarios. To fix this situation, we defined how Music Plagiarism Detection is different from other MIR tasks and explained what problems need to be solved. We introduce the Similar Music Pair dataset to support this newly defined task. In addition, we propose a method based on segment transcription as one way to solve the task. Our demo and dataset are available at this https URL.
In time-critical eXtended reality (XR) scenarios where users must rapidly reorient their attention to hazards, alerts, or instructions while engaged in a primary task, spatial audio can provide an immediate directional cue without occupying visual bandwidth. However, such scenarios can afford only a brief auditory exposure, requiring users to interpret sound direction quickly and without extended listening or head-driven refinement. This paper reports a controlled exploratory study of rapid spatial-audio localization in XR. Using HRTF-rendered broadband stimuli presented from a semi-dense set of directions around the listener, we quantify how accurately users can infer coarse direction from brief audio alone. We further examine the effects of short-term visuo-auditory feedback training as a lightweight calibration mechanism. Our findings show that brief spatial cues can convey coarse directional information, and that even short calibration can improve users' perception of aural signals. While these results highlight the potential of spatial audio for rapid attention guidance, they also show that auditory cues alone may not provide sufficient precision for complex or high-stakes tasks, and that spatial audio may be most effective when complemented by other sensory modalities or visual cues, without relying on head-driven refinement. We leverage this study on spatial audio as a preliminary investigation into a first-stage attention-guidance channel for wearable XR (e.g., VR head-mounted displays and AR smart glasses), and provide design insights on stimulus selection and calibration for time-critical use.
We introduce Deep QP Safety Filter, a fully data-driven safety layer for black-box dynamical systems. Our method learns a Quadratic-Program (QP) safety filter without model knowledge by combining Hamilton-Jacobi (HJ) reachability with model-free learning. We construct contraction-based losses for both the safety value and its derivatives, and train two neural networks accordingly. In the exact setting, the learned critic converges to the viscosity solution (and its derivative), even for non-smooth values. Across diverse dynamical systems -- even including a hybrid system -- and multiple RL tasks, Deep QP Safety Filter substantially reduces pre-convergence failures while accelerating learning toward higher returns than strong baselines, offering a principled and practical route to safe, model-free control.
Complex networks are powerful representations of complex systems across scales and domains, and the field is experiencing unprecedented growth in data availability. However, real-world network data often suffer from noise, biases, and missing data in the edge weights, which undermine the reliability of downstream network analyses. Standard noise filtering approaches, whether treating individual edges one-by-one or assuming a uniform global noise level, are suboptimal, because in reality both signal and noise can be heterogeneous and correlated across multiple edges. As a solution, we introduce the Network Wiener Filter, a principled method for collective edge-level noise filtering that leverages both network topology and noise characteristics, to reduce error in the observed edge weights and to infer missing edge weights. We demonstrate the broad practical efficacy of the Network Wiener Filter in two distinct settings, the genetic interaction network of the yeast S. cerevisiae and the Enron Corpus email network, noting compelling evidence of successful noise suppression in both applications. With the Network Wiener Filter, we advocate for a shift toward error-aware network science, one that embraces data imperfection as an inherent feature and learns to navigate it effectively.
In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.
Indoor mobile robot navigation requires fast responsiveness and robust semantic understanding, yet existing methods struggle to provide both. Classical geometric approaches such as SLAM offer reliable localization but depend on detailed maps and cannot interpret human-targeted cues (e.g., signs, room numbers) essential for indoor reasoning. Vision-Language-Action (VLA) models introduce semantic grounding but remain strictly reactive, basing decisions only on visible frames and failing to anticipate unseen intersections or reason about distant textual cues. Vision-Language Models (VLMs) provide richer contextual inference but suffer from high computational latency, making them unsuitable for real-time operation on embedded platforms. In this work, we present IROS, a real-time navigation framework that combines VLM-level contextual reasoning with the efficiency of lightweight perceptual modules on low-cost, on-device hardware. Inspired by Dual Process Theory, IROS separates fast reflexive decisions (System One) from slow deliberative reasoning (System Two), invoking the VLM only when necessary. Furthermore, by augmenting compact VLMs with spatial and textual cues, IROS delivers robust, human-like navigation with minimal latency. Across five real-world buildings, IROS improves decision accuracy and reduces latency by 66% compared to continuous VLM-based navigation.
Can classical consensus models predict the group behavior of large language models (LLMs)? We examine multi-round interactions among LLM agents through the DeGroot framework, where agents exchange text-based messages over diverse communication graphs. To track opinion evolution, we map each message to an opinion score via sentiment analysis. We find that agents typically reach consensus and the disagreement between the agents decays exponentially. However, the limiting opinion departs from DeGroot's network-centrality-weighted forecast. The consensus between LLM agents turns out to be largely insensitive to initial conditions and instead depends strongly on the discussion subject and inherent biases. Nevertheless, transient dynamics align with classical graph theory and the convergence rate of opinions is closely related to the second-largest eigenvalue of the graph's combination matrix. Together, these findings can be useful for LLM-driven social-network simulations and the design of resource-efficient multi-agent LLM applications.
We explore how command stack protection requirements outlined in NASA-STD-1006A can be satisfied within the context of emergency space telemetry. Proposed implementation of lightweight authenticated encryption offers strong security without sacrificing performance in resource-constrained environments. It produces fixed-length messages, maintaining compatibility with the underlying data transport protocols. By focusing on predictable properties and robust authentication, we create a scheme that protects the confidentiality, integrity and authenticity of telemetry data in emergency communications while balancing security requirements with the operational constraints.
Smart meter data is the foundation for planning and operating the distribution network. Unfortunately, such data are not always available due to privacy regulations. Meanwhile, the collected data may be corrupted due to sensor or transmission failure, or it may not have sufficient resolution for downstream tasks. A wide range of generative tasks is formulated to address these issues, including synthetic data generation, missing data imputation, and super-resolution. Despite the success of machine learning models on these tasks, dedicated models need to be designed and trained for each task, leading to redundancy and inefficiency. In this paper, by recognizing the powerful modeling capability of flow matching models, we propose a new approach to unify diverse smart meter data generative tasks with a single model trained for conditional generation. The proposed flow matching models are trained to generate challenging, high-dimensional time series data, specifically monthly smart meter data at a 15 min resolution. By viewing different generative tasks as distinct forms of partial data observations and injecting them into the generation process, we unify tasks such as imputation and super-resolution with a single model, eliminating the need for re-training. The data generated by our model not only are consistent with the given observations but also remain realistic, showing better performance against interpolation and other machine learning based baselines dedicated to the tasks.
This paper addresses the critical challenge of coordinating mobile edge UAV networks to maintain robust service in highly dynamic spatiotemporal environments. Conventional Deep Reinforcement Learning (DRL) approaches often suffer from catastrophic forgetting when transitioning between distinct task scenarios, such as moving from dense urban clusters to sparse rural areas. These transitions typically necessitate computationally expensive retraining or model resets to adapt to new user distributions, leading to service interruptions. To overcome these limitations, we propose a computationally efficient Spatiotemporal Continual Learning (STCL) framework realized through a Group-Decoupled Multi-Agent Proximal Policy Optimization (G-MAPPO) algorithm. Our approach integrates a novel Group-Decoupled Policy Optimization (GDPO) mechanism that utilizes dynamic $z$-score normalization to autonomously balance heterogeneous objectives, including energy efficiency, user fairness, and coverage. This mechanism effectively mitigates gradient conflicts induced by concept drifts without requiring offline retraining. Furthermore, the framework leverages the 3D mobility of UAVs as a spatial compensation layer, enabling the swarm to autonomously adjust altitudes to accommodate extreme density fluctuations. Extensive simulations demonstrate that the proposed STCL framework achieves superior resilience, characterized by an elastic recovery of service reliability to approximately 0.95 during phase transitions. Compared to the MADDPG baseline, G-MAPPO not only prevents knowledge forgetting but also delivers an effective capacity gain of 20\% under extreme traffic loads, validating its potential as a scalable solution for edge-enabled aerial swarms.
Massive multiple-input multiple-output (mMIMO) downlink precoding offers high spectral efficiency but remains challenging to deploy in practice because near-optimal algorithms such as the weighted minimum mean squared error (WMMSE) are computationally expensive, and sensitive to SNR and channel-estimation quality, while existing deep learning (DL)-based solutions often lack robustness and require retraining for each deployment site. This paper proposes a plug-and-play precoder (PaPP), a DL framework with a backbone that can be trained for either fully digital (FDP) or hybrid beamforming (HBF) precoding and reused across sites, transmit-power levels, and with varying amounts of channel estimation error, avoiding the need to train a new model from scratch at each deployment. PaPP combines a high-capacity teacher and a compact student with a self-supervised loss that balances teacher imitation and normalized sum-rate, trained using meta-learning domain-generalization and transmit-power-aware input normalization. Numerical results on ray-tracing data from three unseen sites show that the PaPP FDP and HBF models both outperform conventional and deep learning baselines, after fine-tuning with a small set of local unlabeled samples. Across both architectures, PaPP achieves more than 21$\times$ reduction in modeled computation energy and maintains good performance under channel-estimation errors, making it a practical solution for energy-efficient mMIMO precoding.
An agent operating in an unknown dynamical system must learn its dynamics from observations. Active information gathering accelerates this learning, but existing methods derive bespoke costs for specific modeling choices: dynamics models, belief update procedures, observation models, and planners. We present a unifying framework that decouples these choices from the information-gathering cost by explicitly exposing the causal dependencies between parameters, beliefs, and controls. Using this framework, we derive a general information-gathering cost based on Massey's directed information that assumes only Markov dynamics with additive noise and is otherwise agnostic to modeling choices. We prove that the mutual information cost used in existing literature is a special case of our cost. Then, we leverage our framework to establish an explicit connection between the mutual information cost and information gain in linearized Bayesian estimation, thereby providing theoretical justification for mutual information-based active learning approaches. Finally, we illustrate the practical utility of our framework through experiments spanning linear, nonlinear, and multi-agent systems.
Voltage (Volt) and reactive-power (VAR) control in transmission networks is critical for reliability and increasingly needs fast, implementable decisions. This paper presents a transmission Volt/VAR Optimization (VVO) framework that co-optimizes discrete control of on-load tap-changing transformers (OLTCs) and capacitor banks (CBs) with AC power flow (ACPF) physics to improve voltage stability and minimize VAR generation. The framework follows a relax-round-resolve pipeline: a continuous relaxation proposes targets, a rounding step selects feasible discrete settings, and a final solve enforces AC power flow physics. Extensive experiments on IEEE, PEGASE, and RTE systems show consistent improvements in voltage and VAR quality metrics with modest generator redispatch while preserving economic operation and achieving compatible runtimes with real-time transmission operations.
This paper investigates information freshness in a remote estimation system in which the remote information source is a continuous-time Markov chain (CTMC). For such systems, estimators have been mainly restricted to the class of martingale estimators in which the remote estimate at any time is equal to the value of the most recently received update. This is mainly due to the simplicity and ease of analysis of martingale estimators, which however are far from optimal, especially in query-based (i.e., pull-based) update systems. In such systems, maximum a-posteriori probability (MAP) estimators are optimal. However, MAP estimators can be challenging to analyze in continuous-time settings. In this paper, we introduce a new class of estimators, called structured estimators, which can seamlessly shift from a martingale estimator to a MAP estimator, enabling them to retain useful characteristics of the MAP estimate, while still being analytically tractable. Particularly, we introduce a new estimator termed as the $p$-MAP estimator which is a piecewise-constant approximation of the MAP estimator with finitely many discontinuities, bringing us closer to a full characterization of MAP estimators when modeling information freshness. In fact, we show that for time-reversible CTMCs, the MAP estimator reduces to a $p$-MAP estimator. Using the binary freshness (BF) process for the characterization of information freshness, we derive the freshness expressions and provide optimal state-dependent sampling policies (i.e., querying policies) for maximizing the mean BF (MBF) for pull-based remote estimation of a single CTMC information source, when structured estimators are used. Moreover, we provide optimal query rate allocation policies when a monitor pulls information from multiple heterogeneous CTMCs with a constraint on the overall query rate.
Accurate reconstruction of atmospheric wind fields is essential for applications such as weather forecasting, hazard prediction, and wind energy assessment, yet conventional instruments leave spatio-temporal gaps within the lower atmospheric boundary layer. Unmanned aircraft systems (UAS) provide flexible in situ measurements, but individual platforms sample wind only along their flight trajectories, limiting full wind-field recovery. This study presents a framework for reconstructing four-dimensional atmospheric wind fields using measurements obtained from a coordinated UAS swarm. A synthetic turbulence environment and high-fidelity multirotor simulation are used to generate training and evaluation data. Local wind components are estimated from UAS dynamics using a bidirectional long short-term memory network (Bi-LSTM) and assimilated into a physics-informed neural network (PINN) to reconstruct a continuous wind field in space and time. For local wind estimation, the bidirectional LSTM achieves root-mean-square errors (RMSE) of 0.064 and 0.062 m/s for the north and east components in low-wind conditions, increasing to 0.122 to 0.129 m/s under moderate winds and 0.271 to 0.273 m/s in high-wind conditions, while the vertical component exhibits higher error, with RMSE values of 0.029 to 0.091 m/s. The physics-informed reconstruction recovers the dominant spatial and temporal structure of the wind field up to 1000 m altitude while preserving mean flow direction and vertical shear. Under moderate wind conditions, the reconstructed mean wind field achieves an overall RMSE between 0.118 and 0.154 m/s across evaluated UAS configurations, with the lowest error obtained using a five-UAS swarm. These results demonstrate that coordinated UAS measurements enable accurate and scalable four-dimensional wind-field reconstruction without dedicated wind sensors or fixed infrastructure.
Current methods for converting circuit schematic images into machine-readable netlists struggle with component recognition and connectivity inference. In this paper, we present SINA, an open-source, fully automated circuit schematic image-to-netlist generator. SINA integrates deep learning for accurate component detection, Connected-Component Labeling (CCL) for precise connectivity extraction, and Optical Character Recognition (OCR) for component reference designator retrieval, while employing a Vision-Language Model (VLM) for reliable reference designator assignments. In our experiments, SINA achieves 96.47% overall netlist-generation accuracy, which is 2.72x higher than state-of-the-art approaches.
Neural networks have been successfully applied in various resource-constrained edge devices, where usually central processing units (CPUs) instead of graphics processing units exist due to limited power availability. State-of-the-art research still focuses on efficiently executing enormous numbers of multiply-accumulate (MAC) operations. However, CPUs themselves are not good at executing such mathematical operations on a large scale, since they are more suited to execute control flow logic, i.e., computer algorithms. To enhance the computation efficiency of neural networks on CPUs, in this paper, we propose to convert them into logic flows for execution. Specifically, neural networks are first converted into equivalent decision trees, from which decision paths with constant leaves are then selected and compressed into logic flows. Such logic flows consist of if and else structures and a reduced number of MAC operations. Experimental results demonstrate that the latency can be reduced by up to 14.9 % on a simulated RISC-V CPU without any accuracy degradation. The code is open source at this https URL
Self-supervised methods have recently proved to be nearly as effective as supervised ones in various imaging inverse problems, paving the way for learning-based approaches in scientific and medical imaging applications where ground truth data is hard or expensive to obtain. These methods critically rely on invariance to translations and/or rotations of the image distribution to learn from incomplete measurement data alone. However, existing approaches fail to obtain competitive performances in the problems of image super-resolution and deblurring, which play a key role in most imaging systems. In this work, we show that invariance to roto-translations is insufficient to learn from measurements that only contain low-frequency information. Instead, we propose scale-equivariant imaging, a new self-supervised approach that leverages the fact that many image distributions are approximately scale-invariant, enabling the recovery of high-frequency information lost in the measurement process. We demonstrate throughout a series of experiments on real datasets that the proposed method outperforms other self-supervised approaches, and obtains performances on par with fully supervised learning.
Satisfying safety constraints is a priority concern when solving optimal control problems (OCPs). Due to the existence of infeasibility phenomenon, where a constraint-satisfying solution cannot be found, it is necessary to identify a feasible region before implementing a policy. Existing feasibility theories built for model predictive control (MPC) only consider the feasibility of optimal policy. However, reinforcement learning (RL), as another important control method, solves the optimal policy in an iterative manner, which comes with a series of non-optimal intermediate policies. Feasibility analysis of these non-optimal policies is also necessary for iteratively improving constraint satisfaction; but that is not available under existing MPC feasibility theories. This paper proposes a feasibility theory that applies to both MPC and RL by filling in the missing part of feasibility analysis for an arbitrary policy. The basis of our theory is to decouple policy solving and implementation into two temporal domains: virtual-time domain and real-time domain. This allows us to separately define initial and endless, state and policy feasibility, and their corresponding feasible regions. Based on these definitions, we analyze the containment relationships between different feasible regions, which enables us to describe the feasible region of an arbitrary policy. We further provide virtual-time constraint design rules along with a practical design tool called feasibility function that helps to achieve the maximum feasible region. We review most of existing constraint formulations and point out that they are essentially applications of feasibility functions in different forms. We demonstrate our feasibility theory by visualizing different feasible regions under both MPC and RL policies in an emergency braking control task.
Variable renewable energy droughts, so called Dunkelflaute events, emerge as a challenge for climate-neutral energy systems based on variable renewables. Here we characterize European drought events for on- and offshore wind power, solar photovoltaics, and renewable technology portfolios, using 38 historic weather years and an advanced identification method. Their characteristics heavily depend on the chosen drought threshold, questioning the usefulness of single-threshold analyses. Applying a multi-threshold framework, we quantify how the complementarity of wind and solar power temporally and spatially alleviates drought frequency, return periods, duration, and severity within (portfolio effect) and across countries (balancing effect). We identify the most extreme droughts, which drive major discharging periods of long-duration storage in a fully renewable European energy system, based on a policy-relevant decarbonization scenario. Such events comprise sequences of shorter droughts of varying severity. The most extreme event occurred in winter 1996/97 and lasted 55 days in an idealized, perfectly interconnected setting. The average renewable availability during this period was still 47% of its long-run mean. System planners must consider such events when planning for storage and other flexibility technologies. Methodologically, we conclude that using arbitrary single calendar years is not suitable for modeling weather-resilient energy scenarios.
Hybrid radar fusion (HRF), which combines monostatic and bistatic sensing in a common spectrum, offers enhanced spatial diversity, but is particularly vulnerable to quantization error effects due to the large power imbalance between the direct and reflected uplink signals. Although finite-resolution analog-to-digital converters (ADCs) have been considered in the existing literature on integrated sensing and communication (ISAC), their role in HRF architectures has not yet been characterized. This paper develops a finite-resolution quantized sensing-communication framework for HRF systems by deriving a Cramer-Rao bound (CRB) and achievable uplink rate. Tight lower bounds on the Fisher information matrix and the communication rate are obtained, enabling a tractable characterization of finite-resolution quantized HRF. The fundamental sensing-communication trade-off is then characterized through two complementary constrained formulations: CRB minimization subject to per-user uplink rate requirements, and sum-rate maximization subject to a CRB constraint, whose solutions trace the CRB-rate trade-offs in HRF. Numerical results reveal how ADC resolution, dynamic range, and system configuration jointly shape this boundary and show that HRF performance can degrade sharply under coarse quantization due to the weak bistatic component, providing design guidelines for selecting ADC architectures and operating regimes in future HRF-enabled ISAC systems.
Passive sonar signals contain complex characteristics often arising from environmental noise, vessel machinery, and propagation effects. While convolutional neural networks (CNNs) perform well on passive sonar classification tasks, they can struggle with statistical variations that occur in the data. To investigate this limitation, synthetic underwater acoustic datasets are generated that centered on amplitude and period variations. Two metrics are proposed to quantify and validate these characteristics in the context of statistical and structural texture for passive sonar. These measures are applied to real-world passive sonar datasets to assess texture information in the signals and correlate the performances of the models. Results show that CNNs underperform on statistically textured signals, but incorporating explicit statistical texture modeling yields consistent improvements. These findings highlight the importance of quantifying texture information for passive sonar classification.
We study diffusion-based speech enhancement using a Schrodinger bridge formulation and extend the EDM2 framework to this setting. We employ time-dependent preconditioning of network inputs and outputs to stabilize training and explore two skip-connection configurations that allow the network to predict either environmental noise or clean speech. To control activation and weight magnitudes, we adopt a magnitude-preserving architecture and learn the contribution of the noisy input within each network block for improved conditioning. We further analyze the impact of exponential moving average (EMA) parameter smoothing by approximating different EMA profiles post training, finding that, unlike in image generation, short or absent EMA consistently yields better speech enhancement performance. Experiments on VoiceBank-DEMAND and EARS-WHAM demonstrate competitive signal-to-distortion ratios and perceptual scores, with the two skip-connection variants exhibiting complementary strengths. These findings provide new insights into EMA behavior, magnitude preservation, and skip-connection design for diffusion-based speech enhancement.
Deployment complexity and specialized hardware requirements hinder the adoption of deep learning models in neuroimaging. We present MindGrab, a lightweight, fully convolutional model for volumetric skull stripping across all imaging modalities. MindGrab's architecture is designed from first principles using a spectral interpretation of dilated convolutions, and demonstrates state-of-the-art performance (mean Dice score across datasets and modalities: 95.9 with SD 1.6), with up to 40-fold speedups and substantially lower memory demands compared to established methods. Its minimal footprint allows for fast, full-volume processing in resource-constrained environments, including direct in-browser execution. MindGrab is delivered via the BrainChop platform as both a simple command-line tool (pip install brainchop) and a zero-installation web application (this http URL). By removing traditional deployment barriers without sacrificing accuracy, MindGrab makes state-of-the-art neuroimaging analysis broadly accessible.
The growing volume of medical imaging data has increased the need for automated diagnostic tools, especially for musculoskeletal injuries like rib fractures, commonly detected via CT scans. Manual interpretation is time-consuming and error-prone. We propose OrthoInsight, a multi-modal deep learning framework for rib fracture diagnosis and report generation. It integrates a YOLOv9 model for fracture detection, a medical knowledge graph for retrieving clinical context, and a fine-tuned LLaVA language model for generating diagnostic reports. OrthoInsight combines visual features from CT images with expert textual data to deliver clinically useful outputs. Evaluated on 28,675 annotated CT images and expert reports, it achieves high performance across Diagnostic Accuracy, Content Completeness, Logical Coherence, and Clinical Guidance Value, with an average score of 4.28, outperforming models like GPT-4 and Claude-3. This study demonstrates the potential of multi-modal learning in transforming medical image analysis and providing effective support for radiologists.
We introduce a unified framework for analyzing utility regions of wireless networks, with a focus on signal-to-interference-plus-noise-ratio (SINR) and achievable rate regions. The framework provides valuable insights into interference patterns of modern network architectures, including extremely large MIMO and cell-less networks. A central contribution is a simple characterization of feasible utility regions using the concept of spectral radius of nonlinear mappings. This characterization provides a powerful mathematical tool for wireless system design and analysis. For example, it allows us to generalize existing characterizations of the weak Pareto boundary using compact notation. It also allows us to derive tractable sufficient conditions for the identification of convex utility regions. This property is particularly important because, on the weak Pareto boundary, it guarantees that time sharing (or user grouping) cannot simultaneously improve the utilities of all users. Beyond geometrical insights, these sufficient conditions have two key implications. First, they identify a family of (weighted) sum-rate maximization problems that are inherently convex, thus paving the way for the development of efficient, provably optimal solvers for this family. Second, they provide justification for formulating sum-rate maximization problems directly in terms of achievable rates, rather than SINR levels. Our theoretical insights also motivate an alternative to the concept of favorable propagation in the massive MIMO literature -- one that explicitly accounts for self-interference and the beamforming strategy.
This paper proposes a new technique for computer modeling linear filters based on the spectral form of mathematical description of linear systems. It assumes the representation of input and output signals of the filter as orthogonal expansions, while filters themselves are described by two-dimensional non-stationary transfer functions. This technique allows one to model the output signal in continuous time, and it is successfully tested on the Butterworth, Linkwitz-Riley, and Chebyshev filters with different orders.
This paper proposes a spatiotemporal (ST) fusion framework robust against diverse noise for satellite images, named Temporally-Similar Structure-Aware ST fusion (TSSTF). ST fusion is a promising approach to address the trade-off between the spatial and temporal resolution of satellite images. In real-world scenarios, observed satellite images are severely degraded by noise due to measurement equipment and environmental conditions. Consequently, some recent studies have focused on enhancing the robustness of ST fusion methods against noise. However, existing noise-robust ST fusion approaches often fail to capture fine spatial structure, leading to oversmoothing and artifacts. To address this issue, TSSTF introduces two key mechanisms: Temporally-Guided Total Variation (TGTV) and Temporally-Guided Edge Constraint (TGEC). TGTV is a weighted total variation-based regularization that promotes spatial piecewise smoothness while preserving structural details, guided by a reference high spatial resolution image acquired on a nearby date. TGEC enforces consistency in edge locations between two temporally adjacent images, while allowing for spectral variations. We formulate the ST fusion task as a constrained optimization problem incorporating TGTV and TGEC, and develop an efficient algorithm based on a preconditioned primal-dual splitting method. Experimental results demonstrate that TSSTF performs comparably to state-of-the-art methods under noise-free conditions and outperforms them under noisy conditions.
The cochlear implant (CI) is a successful biomedical device that enables individuals with severe-to-profound hearing loss to perceive sound through electrical stimulation, yet listening in noise remains challenging. Recent deep learning advances offer promising potential for CI sound coding by integrating visual cues. In this study, an audio-visual speech enhancement (AVSE) module is integrated with the ElectrodeNet-CS (ECS) model to form the end-to-end CI system, AVSE-ECS. Simulations show that the AVSE-ECS system with joint training achieves high objective speech intelligibility and improves the signal-to-error ratio (SER) by 7.4666 dB compared to the advanced combination encoder (ACE) strategy. These findings underscore the potential of AVSE-based CI sound coding.
We propose a novel diffusion model, termed the non-identical diffusion model, and investigate its application to wireless orthogonal frequency division multiplexing (OFDM) channel generation. Unlike the standard diffusion model that uses a scalar-valued time index to represent the global noise level, we extend this notion to an element-wise time indicator to capture local error variations more accurately. Non-identical diffusion enables us to characterize the reliability of each element (e.g., subcarriers in OFDM) within the noisy input, leading to improved generation results when the initialization is biased. Specifically, we focus on the recovery of wireless multi-input multi-output (MIMO) OFDM channel matrices, where the initial channel estimates exhibit highly uneven reliability across elements due to the pilot scheme. Conventional time embeddings, which assume uniform noise progression, fail to capture such variability across pilot schemes and noise levels. We introduce a matrix that matches the input size to control element-wise noise progression. Following a similar diffusion procedure to existing methods, we show the correctness and effectiveness of the proposed non-identical diffusion scheme both theoretically and numerically. For MIMO-OFDM channel generation, we propose a dimension-wise time embedding strategy. We also develop and evaluate multiple training and generation methods and compare them through numerical experiments.
In this work, we consider the problem of identifying an unknown linear dynamical system given a finite hypothesis class. In particular, we analyze the effect of the excitation input on the sample complexity of identifying the true system with high probability. To this end, we present sample complexity lower bounds that capture the choice of the selected excitation input. The sample complexity lower bound gives rise to a system theoretic condition to determine the potential benefit of experiment design. Informed by the analysis of the sample complexity lower bound, we propose a persistent excitation (PE) condition tailored to the considered setting, which we then use to establish sample complexity upper bounds. Notably, the PE condition is weaker than in the case of an infinite hypothesis class and allows analyzing different excitation inputs modularly. Crucially, the lower and upper bounds share the same dependency on key problem parameters. Finally, we leverage these insights to propose an active learning algorithm that sequentially excites the system optimally with respect to the current estimate, and provide sample complexity guarantees for the presented algorithm. Concluding simulations showcase the effectiveness of the proposed algorithm.
A persistent challenge in generative audio models is data replication, where the model unintentionally generates parts of its training data during inference. In this work, we address this issue in text-to-audio diffusion models by exploring the use of anti-memorization strategies. We adopt Anti-Memorization Guidance (AMG), a technique that modifies the sampling process of pre-trained diffusion models to discourage memorization. Our study explores three types of guidance within AMG, each designed to reduce replication while preserving generation quality. We use Stable Audio Open as our backbone, leveraging its fully open-source architecture and training dataset. Our comprehensive experimental analysis suggests that AMG significantly mitigates memorization in diffusion-based text-to-audio generation without compromising audio fidelity or semantic alignment.
Real-time Magnetic Resonance Imaging (rtMRI) visualizes vocal tract action, offering a comprehensive window into speech articulation. However, its signals are high dimensional and noisy, hindering interpretation. We investigate compact representations of spatiotemporal articulatory dynamics for phoneme recognition from midsagittal vocal tract rtMRI videos. We compare three feature types: (1) raw video, (2) optical flow, and (3) six linguistically-relevant regions of interest (ROIs) for articulator movements. We evaluate models trained independently on each representation, as well as multi-feature combinations. Results show that multi-feature models consistently outperform single-feature baselines, with the lowest phoneme error rate (PER) of 0.34 obtained by combining ROI and raw video. Temporal fidelity experiments demonstrate a reliance on fine-grained articulatory dynamics, while ROI ablation studies reveal strong contributions from tongue and lips. Our findings highlight how rtMRI-derived features provide accuracy and interpretability, and establish strategies for leveraging articulatory data in speech processing.
Recent work reports gains in neural text-to-speech (TTS) with Group Relative Policy Optimization (GRPO). However, in the absence of a verifiable reward for \textit{prosody}, GRPO trained on transcription-oriented signals (CER/NLL) lowers error rates yet collapses prosody into monotone, unnatural speech; adding speaker-similarity further destabilizes training and degrades CER. We address this with an \textit{iterative Direct Preference Optimization (DPO)} scheme that uses only a few hundred human-labeled preference pairs per round to directly optimize prosodic naturalness while regularizing to the current model. On \textbf{KoCC-TTS}, a curated dataset of authentic Korean call center interactions capturing task-oriented dialogues, our method attains the highest human preference (ELO) with competitive CER, outperforming GRPO and strong commercial baselines. These results suggest that when prosody cannot be rewarded automatically, \textit{human preference optimization} offers a practical and data-efficient path to natural and robust TTS. The demo page is available at \href{this https URL}
The goal of this paper is to introduce SPADE, a framework for Structured Pruning and Adaptive Distillation for Efficient Large Language Model-based text-to-speech (LLM-TTS). Recent LLM-TTS systems achieve strong controllability and zero-shot generalization, but their large parameter counts and high latency limit real-world deployment. SPADE addresses this by combining (i) a pruning step guided by a word-error-rate-based layer importance index to remove non-essential Transformer layers, with (ii) multi-level knowledge distillation to restore autoregressive coherence. On zero-shot benchmarks, SPADE preserves near-parity perceptual quality while halving Transformer depth, reducing VRAM usage by up to 20%, and achieving up to 1.7x faster real-time factor with less than 5% of the original training data. These results show that compact LLM-TTS models can maintain naturalness and speaker similarity while enabling practical real-time speech generation. Audio samples are available at this https URL.
Audiovisual instance segmentation (AVIS) requires accurately localizing and tracking sounding objects throughout video sequences. Existing methods suffer from visual bias stemming from two fundamental issues: uniform additive fusion prevents queries from specializing to different sound sources, while visual-only training objectives allow queries to converge to arbitrary salient objects. We propose Audio-Centric Query Generation using cross-attention, enabling each query to selectively attend to distinct sound sources and carry sound-specific priors into visual decoding. Additionally, we introduce Sound-Aware Ordinal Counting (SAOC) loss that explicitly supervises sounding object numbers through ordinal regression with monotonic consistency constraints, preventing visual-only convergence during training. Experiments on AVISeg benchmark demonstrate consistent improvements: +1.64 mAP, +0.6 HOTA, and +2.06 FSLA, validating that query specialization and explicit counting supervision are crucial for accurate audiovisual instance segmentation.
Recent advances in text-to-speech (TTS) technology have enabled systems to generate speech that is often indistinguishable from human speech, bringing benefits to accessibility, content creation, and human-computer interaction. However, current evaluation practices are increasingly inadequate for capturing the full range of capabilities, limitations, and societal impacts of modern TTS systems. This position paper introduces the concept of Responsible Evaluation and argues that it is essential and urgent for the next phase of TTS development, structured through three progressive levels: (1) ensuring the faithful and accurate reflection of a model's true capabilities and limitations, with more robust, discriminative, and comprehensive objective and subjective scoring methodologies; (2) enabling comparability, standardization, and transferability through standardized benchmarks, transparent reporting, and transferable evaluation metrics; and (3) assessing and mitigating ethical risks associated with forgery, misuse, privacy violations, and security vulnerabilities. Through this concept, we critically examine current evaluation practices, identify systemic shortcomings, and propose actionable recommendations. We hope this concept will not only foster more reliable and trustworthy TTS technology but also guide its development toward ethically sound and societally beneficial applications.
We introduce mathematical tools and fixed point algorithms for optimal statistical max-min power control in cellular and cell-less massive MIMO systems. Unlike previous studies that rely on the use-and-then-forget (UatF) lower bound on Shannon achievable (ergodic) rates, our proposed framework can deal with alternative bounds that explicitly consider perfect or imperfect channel state information (CSI) at the decoder. In doing so, we address limitations of UatF-based power control algorithms, which inherit the shortcomings of the UatF bound. For example, the UatF bound can be overly conservative: in extreme cases, under fully statistical (nonadaptive) beamforming in zero-mean channels, the UatF bound produces trivial (zero) rate bounds. It also lacks scale invariance: merely scaling the beamformers can change the bound drastically. In contrast, our framework is compatible with information-theoretic bounds that do not suffer from the above drawbacks. We illustrate the framework by solving a max-min power control problem considering a standard bound that exploits instantaneous CSI at the decoder.
We introduce CASTELLA, a human-annotated audio benchmark for the task of audio moment retrieval (AMR). Although AMR has various useful potential applications, there is still no established benchmark with real-world data. The initial study of AMR trained the models solely on synthetic datasets. Moreover, the evaluation is based on an annotated dataset of fewer than 100 samples. This resulted in less reliable reported performance. To ensure performance for applications in real-world environments, we present CASTELLA, a large-scale manually annotated AMR dataset. CASTELLA consists of 1009, 213, and 640 audio recordings for training, validation, and test splits, respectively, which is 24 times larger than the previous dataset. We also establish a baseline model for AMR using CASTELLA. Our experiments demonstrate that a model fine-tuned on CASTELLA after pre-training on the synthetic data outperformed a model trained solely on the synthetic data by 10.4 points in Recall1@0.7. CASTELLA is publicly available in this https URL.
This letter describes how to improve performance of cellular systems by combining non-equiprobable signaling with low-density parity check (LDPC) coding for an orthogonal frequency division multiplexing system. We focus on improving performance of wireless transmission with rate $3/4$ LDPC codes employing $4$-QAM at the cell edge. We double the size of the $4$-QAM constellation by introducing a second shell of signal points, and we implement non-equiprobable signaling through a shaping code which selects the high energy shell less frequently than the low energy shell. We emphasize that the shaping code is also used to transmit information. We employ rate $1/2$ LDPC coding and select the rate of the shaping code to match that of rate $3/4$ LDPC coding using $4$-QAM. We describe how to combine coding and shaping by integrating shaping into the calculation of log-likelihood ratios (LLRs) necessary for decoding LDPC codes. We present simulation results for a representative Veh-A channel showing gains of $4$ dB at a bit error rate (BER) of $10^{-3}$. Our results suggest that it is better to increase rate at the cell edge by introducing shaping, rather than by increasing the rate of the LDPC code.
This paper presents preliminary results toward a conceptual foundation for Next Generation Grid Codes (NGGCs) based on decentralized stability and performance certification for dynamic ancillary services. The proposed NGGC framework targets two core outcomes: (i) guaranteed closed-loop stability and (ii) explicit performance assurances for power-system frequency and voltage dynamics. Stability is addressed using loop-shifting and passivity-based methods that yield local frequency-domain certificates for individual devices, enabling fully decentralized verification of the interconnected system. Performance is characterized by deriving quantitative bounds on key time-domain metrics (e.g., nadirs, rate-of-change-of-frequency (RoCoF), steady-state deviations, and oscillation damping) through frequency-domain constraints on local device behavior. The framework is non-parametric and model-agnostic, accommodating a broad class of device dynamics under mild assumptions, and provides an initial unified approach to stability and performance certification without explicit device-model parameterization. As such, these results offer a principled starting point for the development of future grid codes and control design methodologies in modern power systems.
Vision Language Action (VLA) models promise an open-vocabulary interface that can translate perceptual ambiguity into semantically grounded driving decisions, yet they still treat language as a static prior fixed at inference time. As a result, the model must infer continuously shifting objectives from pixels alone, yielding delayed or overly conservative maneuvers. We argue that effective VLAs for autonomous driving need an online channel in which users can influence driving with specific intentions. To this end, we present EchoVLA, a user-aware VLA that couples camera streams with in situ audio instructions. We augment the nuScenes dataset with temporally aligned, intent-specific speech commands generated by converting ego-motion descriptions into synthetic audios. Further, we compose emotional speech-trajectory pairs into a multimodal Chain-of-Thought (CoT) for fine-tuning a Multimodal Large Model (MLM) based on Qwen2.5-Omni. Specifically, we synthesize the audio-augmented dataset with different emotion types paired with corresponding driving behaviors, leveraging the emotional cues embedded in tone, pitch, and speech tempo to reflect varying user states, such as urgent or hesitant intentions, thus enabling our EchoVLA to interpret not only the semantic content but also the emotional context of audio commands for more nuanced and emotionally adaptive driving behavior. In open-loop benchmarks, our approach reduces the average L2 error by $59.4\%$ and the collision rate by $74.4\%$ compared to the baseline of vision-only perception. More experiments on nuScenes dataset validate that EchoVLA not only steers the trajectory through audio instructions, but also modulates driving behavior in response to the emotions detected in the user's speech.
This paper studies the capability of a Reconfigurable Intelligent Surface (RIS), when transparently covering a User Equipment (UE), to deceive an adversary monostatic radar system. A compact RIS kernel model that explicitly links the radar's angular response to the RIS phase profile is introduced, enabling an analytical investigation of the Angle of Arrival (AoA) estimation accuracy with respect to the kernel's power. This model is also leveraged to formulate an RIS-based spoofing design with the dual objective to enforce strict nulls around the UE's true reflection AoA and maximize the channel gain towards a decoy direction. The RIS's deception capability is quantified using point-wise and angle-range robust criteria, and a configuration-independent placement score guiding decoy selection is proposed. Selected numerical results confirm deep nulls at the true reflection AoA together with a pronounced decoy peak, rendering maximum-likelihood sensing at the adversary radar unreliable.
Stabilizing a dynamical system is a fundamental problem that serves as a cornerstone for many complex tasks in the field of control systems. The problem becomes challenging when the system model is unknown. Among the Reinforcement Learning (RL) algorithms that have been successfully applied to solve problems pertaining to unknown linear dynamical systems, the policy gradient (PG) method stands out due to its ease of implementation and can solve the problem in a model-free manner. However, most of the existing works on PG methods for unknown linear dynamical systems assume full-state feedback. In this paper, we take a step towards model-free learning for partially observable linear dynamical systems with output feedback and focus on the fundamental stabilization problem of the system. We propose an algorithmic framework that stretches the boundary of PG methods to the problem without global convergence guarantees. We show that by leveraging zeroth-order PG update based on system trajectories and its convergence to stationary points, the proposed algorithms return a stabilizing output feedback policy for discrete-time linear dynamical systems. We also explicitly characterize the sample complexity of our algorithm and verify the effectiveness of the algorithm using numerical examples.
Speaker diarization aims to segment audio recordings into regions corresponding to individual speakers. Although unsupervised speaker diarization is inherently challenging, the prospect of identifying speaker regions without pretraining or weak supervision motivates research on clustering techniques. In this work, we share the notable observation that measuring multiple kernel similarities of speaker embeddings to thereafter craft a sparse graph for spectral clustering in a principled manner is sufficient to achieve state-of-the-art performances in a fully unsupervised setting. Specifically, we consider four polynomial kernels and a degree one arccosine kernel to measure similarities in speaker embeddings, using which sparse graphs are constructed in a principled manner to emphasize local similarities. Experiments show the proposed approach excels in unsupervised speaker diarization over a variety of challenging environments in the DIHARD-III, AMI, and VoxConverse corpora. To encourage further research, our implementations are available at this https URL.
This paper presents an end-to-end deep learning framework for electromagnetically reconfigurable antenna (ERA)-aided user localization with active sensing, where ERAs provide additional electromagnetic reconfigurability to diversify the received measurements and enhance localization informativeness. To balance sensing flexibility and overhead, we adopt a two-timescale design: the digital combiner is updated at each stage, while the ERA patterns are reconfigured at each substage via a spherical-harmonic representation. The proposed mechanism integrates attention-based feature extraction and LSTM-based temporal learning, enabling the system to learn an optimized sensing strategy and progressively refine the UE position estimate from sequential observations. Simulation results show that the proposed approach consistently outperforms conventional digital beamforming-only and single-stage sensing baselines in terms of localization accuracy. These results highlight the effectiveness of ERA-enabled active sensing for user localization in future wireless systems.
Recent advances in reconstructing speech envelopes from Electroencephalogram (EEG) signals have enabled continuous auditory attention decoding (AAD) in multi-speaker environments. Most Deep Neural Network (DNN)-based envelope reconstruction models are trained to maximize the Pearson correlation coefficients (PCC) between the attended envelope and the reconstructed envelope (attended PCC). While the difference between the attended PCC and the unattended PCC plays an essential role in auditory attention decoding, existing methods often focus on maximizing the attended PCC. We therefore propose a contrastive PCC loss which represents the difference between the attended PCC and the unattended PCC. The proposed approach is evaluated on three public EEG AAD datasets using four DNN architectures. Across many settings, the proposed objective improves envelope separability and AAD accuracy, while also revealing dataset- and architecture-dependent failure cases.
Training learned image compression (LIC) models entails navigating a challenging optimization landscape defined by the fundamental trade-off between rate and distortion. Standard first-order optimizers, such as SGD and Adam, struggle with \emph{gradient conflicts} arising from competing objectives, leading to slow convergence and suboptimal rate-distortion performance. In this work, we demonstrate that a simple utilization of a second-order quasi-Newton optimizer, \textbf{SOAP}, dramatically improves both training efficiency and final performance across diverse LICs. Our theoretical and empirical analyses reveal that Newton preconditioning inherently resolves the intra-step and inter-step update conflicts intrinsic to the R-D objective, facilitating faster, more stable convergence. Beyond acceleration, we uncover a critical deployability benefit: second-order trained models exhibit significantly fewer activation and latent outliers. This substantially enhances robustness to post-training quantization. Together, these results establish second-order optimization, achievable as a seamless drop-in replacement of the imported optimizer, as a powerful, practical tool for advancing the efficiency and real-world readiness of LICs.
Despite the rapid development of AI models in medical image analysis, their validation in real-world clinical settings remains limited. To address this, we introduce a generic framework designed for deploying image-based AI models in such settings. Using this framework, we deployed a trained model for fetal ultrasound standard plane detection, and evaluated it in real-time sessions with both novice and expert users. Feedback from these sessions revealed that while the model offers potential benefits to medical practitioners, the need for navigational guidance was identified as a key area for improvement. These findings underscore the importance of early deployment of AI models in real-world settings, leading to insights that can guide the refinement of the model and system based on actual user feedback.
Emerging universal Computational Aberration Correction (CAC) paradigms provide an inspiring solution to light-weight and high-quality imaging with a universal model trained on a lens library (LensLib) to address arbitrary lens optical aberrations blindly. However, the limited coverage of existing LensLibs leads to poor generalization of the trained models to unseen lenses, whose fine-tuning pipeline is also confined to the lens-descriptions-known case. In this work, we introduce OmniLens, a flexible solution to universal CAC via (i) establishing a convincing LensLib with comprehensive coverage for pre-training a robust base model, and (ii) adapting the model to any specific lens designs with unknown lens descriptions via fast LensLib-to-specific domain adaptation. To achieve these, an Evolution-based Automatic Optical Design (EAOD) pipeline is proposed to generate a rich variety of lens samples with realistic aberration behaviors. Then, we design an unsupervised regularization term for efficient domain adaptation on a few easily accessible real-captured images based on the statistical observation of dark channel priors in degradation induced by lens aberrations. Extensive experiments demonstrate that the LensLib generated by EAOD effectively develops a universal CAC model with strong generalization capabilities, which can also improve the non-blind lens-specific methods by 0.35~1.81dB in PSNR. Additionally, the proposed domain adaptation method significantly improves the base model, especially in severe aberration cases (at most 2.59dB in PSNR). The code and data will be available at this https URL.
Markov chain Monte Carlo (MCMC) methods are a powerful but computationally expensive way of performing non-parametric Bayesian inference. MCMC proposals which utilise gradients, such as Hamiltonian Monte Carlo (HMC), can better explore the parameter space of interest if the additional hyper-parameters are chosen well. The No-U-Turn Sampler (NUTS) is a variant of HMC which is extremely effective at selecting these hyper-parameters but is slow to run and is not suited to GPU architectures. An alternative to NUTS, Change in the Estimator of the Expected Square HMC (ChEES-HMC) was shown not only to run faster than NUTS on GPU but also sample from posteriors more efficiently. Sequential Monte Carlo (SMC) samplers are another sampling method which instead output weighted samples from the posterior. They are very amenable to parallelisation and therefore being run on GPUs while having additional flexibility in their choice of proposal over MCMC. We incorporate (ChEEs-HMC) as a proposal into SMC samplers and demonstrate competitive but faster performance than NUTS on a number of tasks.
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans.
Soft everting robots present significant advantages over traditional rigid robots, including enhanced dexterity, improved environmental interaction, and safe navigation in unpredictable environments. While soft everting robots have been widely demonstrated for exploration type tasks, their potential to move and deploy payloads in such tasks has been less investigated, with previous work focusing on sensors and tools for the robot. Leveraging the navigation capabilities, and deployed body, of the soft everting robot to deliver payloads in hazardous areas, e.g. carrying a water bottle to a person stuck under debris, would represent a significant capability in many applications. In this work, we present an analysis of how soft everting robots can be used to deploy larger, heavier payloads through the inside of the robot. We analyze both what objects can be deployed and what terrain features they can be carried through. Building on existing models, we present methods to quantify the effects of payloads on robot growth and self-support, and develop a model to predict payload slip. We then experimentally quantify payload transport using soft everting robot with a variety of payload shapes, sizes, and weights and though a series of tasks: steering, vertical transport, movement through holes, and movement across gaps. Overall, the results show that we can transport payloads in a variety of shapes and up to 1.5kg in weight and that we can move through circular apertures with as little as 0.01cm clearance around payloads, carry out discrete turns up to 135 degrees, and move across unsupported gaps of 1.15m in length.
This paper studies the adversarial robustness of conformal novelty detection. In particular, we focus on two powerful learning-based frameworks that come with finite-sample false discovery rate (FDR) control: one is AdaDetect (by Marandon et al., 2024) that is based on the positive-unlabeled classifier, and the other is a one-class classifier-based approach (by Bates et al., 2023). While they provide rigorous statistical guarantees under benign conditions, their behavior under adversarial perturbations remains underexplored. We first formulate an oracle attack setup, under the AdaDetect formulation, that quantifies the worst-case degradation of FDR, deriving an upper bound that characterizes the statistical cost of attacks. This idealized formulation directly motivates a practical and effective attack scheme that only requires query access to the output labels of both frameworks. Coupling these formulations with two popular and complementary black-box adversarial algorithms, we systematically evaluate the vulnerability of both frameworks on synthetic and real-world datasets. Our results show that adversarial perturbations can significantly increase the FDR while maintaining high detection power, exposing fundamental limitations of current error-controlled novelty detection methods and motivating the development of more robust alternatives.
Real-world deployment often exposes models to distribution shifts, making test-time adaptation (TTA) critical for robustness. Yet most TTA methods are unfriendly to edge deployment, as they rely on backpropagation, activation buffering, or test-time mini-batches, leading to high latency and memory overhead. We propose $\textbf{ELaTTA}$ ($\textit{Efficient Latent Test-Time Adaptation}$), a gradient-free framework for single-instance TTA under strict on-device constraints. ELaTTA freezes model weights and adapts each test sample by optimizing a low-dimensional coefficient vector in a source-induced principal latent subspace, pre-computed offline via truncated SVD and stored with negligible overhead. At inference, ELaTTA encourages prediction confidence by optimizing the $k$-D coefficients with CMA-ES, effectively optimizing a Gaussian-smoothed objective and improving stability near decision boundaries. Across six benchmarks and multiple architectures, ELaTTA achieves state-of-the-art accuracy under both strict and continual single-instance protocols, while reducing compute by up to $\textit{63$\times$}$ and peak memory by $\textit{11$\times$}$. We further demonstrate on-device deployment on a ZYNQ-7020 platform. Code will be released upon acceptance.
Text-to-audio (TTA) generation is advancing rapidly, but evaluation remains challenging because human listening studies are expensive and existing automatic metrics capture only limited aspects of perceptual quality. We introduce AudioEval, a large-scale TTA evaluation dataset with 4,200 generated audio samples (11.7 hours) from 24 systems and 126,000 ratings collected from both experts and non-experts across five dimensions: enjoyment, usefulness, complexity, quality, and text alignment. Using AudioEval, we benchmark diverse automatic evaluators to compare perspective- and dimension-level differences across model families. We also propose Qwen-DisQA as a strong reference baseline: it jointly processes prompts and generated audio to predict multi-dimensional ratings for both annotator groups, modeling rater disagreement via distributional prediction and achieving strong performance. We will release AudioEval to support future research in TTA evaluation.
The ability to compose acquired skills to plan and execute behaviors is a hallmark of natural intelligence. Yet, despite remarkable cross-disciplinary efforts, a principled account of how task structure shapes gating and how such computations could be delivered in neural circuits, remains elusive. Here we introduce GateMod, an interpretable theoretically grounded computational model linking the emergence of gating to the underlying decision-making task, and to a neural circuit architecture. We first develop GateFrame, a normative framework casting policy gating into the minimization of the free energy. This framework, relating gating rules to task, applies broadly across neuroscience, cognitive and computational sciences. We then derive GateFlow, a continuous-time energy based dynamics that provably converges to GateFrame optimal solution. Convergence, exponential and global, follows from a contractivity property that also yields robustness and other desirable properties. Finally, we derive a neural circuit from GateFlow, GateNet. This is a soft-competitive recurrent circuit whose components perform local and contextual computations consistent with known dendritic and neural processing motifs. We evaluate GateMod across two different settings: collective behaviors in multi-agent systems and human decision-making in multi-armed bandits. In all settings, GateMod provides interpretable mechanistic explanations of gating and quantitatively matches or outperforms established models. GateMod offers a unifying framework for neural policy gating, linking task objectives, dynamical computation, and circuit-level mechanisms. It provides a framework to understand gating in natural agents beyond current explanations and to equip machines with this ability.
Expectation Propagation (EP) is a widely used message-passing algorithm that decomposes a global inference problem into multiple local ones. It approximates marginal distributions (beliefs) using intermediate functions (messages). While beliefs must be proper probability distributions that integrate to one, messages may have infinite integral values. In Gaussian-projected EP, such messages take a Gaussian form and appear as if they have "negative" variances. Although allowed within the EP framework, these negative-variance messages can impede algorithmic progress. In this paper, we investigate EP in linear models and analyze the relationship between the corresponding beliefs. Based on the analysis, we propose both non-persistent and persistent approaches that prevent the algorithm from being blocked by messages with infinite integral values. Furthermore, by examining the relationship between the EP messages in linear models, we develop an additional approach that avoids the occurrence of messages with infinite integral values.
In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored: only a small subset of FAEs has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using self-supervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in FAE and MSA.
This paper presents a large language model (LLM)-based framework that adapts and fine-tunes compact LLMs for detecting cyberattacks on transformer current differential relays (TCDRs), which can otherwise cause false tripping of critical power transformers. The core idea is to textualize multivariate time-series current measurements from TCDRs, across phases and input/output sides, into structured natural-language prompts that are then processed by compact, locally deployable LLMs. Using this representation, we fine-tune DistilBERT, GPT-2, and DistilBERT+LoRA to distinguish cyberattacks from genuine fault-induced disturbances while preserving relay dependability. The proposed framework is evaluated against a broad set of state-of-the-art machine learning and deep learning baselines under nominal conditions, complex cyberattack scenarios, and measurement noise. Our results show that LLM-based detectors achieve competitive or superior cyberattack detection performance, with DistilBERT detecting up to 97.62% of attacks while maintaining perfect fault detection accuracy. Additional evaluations demonstrate robustness to prompt formulation variations, resilience under combined time-synchronization and false-data injection attacks, and stable performance under realistic measurement noise levels. The attention mechanisms of LLMs further enable intrinsic interpretability by highlighting the most influential time-phase regions of relay measurements. These results demonstrate that compact LLMs provide a practical, interpretable, and robust solution for enhancing cyberattack detection in modern digital substations. We provide the full dataset used in this study for reproducibility.
Quantum hardware suffers from intrinsic device heterogeneity and environmental drift, forcing practitioners to choose between suboptimal non-adaptive controllers or costly per-device recalibration. We derive a scaling law lower bound for meta-learning showing that the adaptation gain (expected fidelity improvement from task-specific gradient steps) saturates exponentially with gradient steps and scales linearly with task variance, providing a quantitative criterion for when adaptation justifies its overhead. Validation on quantum gate calibration shows negligible benefits for low-variance tasks but $>40\%$ fidelity gains on two-qubit gates under extreme out-of-distribution conditions (10$\times$ the training noise), with implications for reducing per-device calibration time on cloud quantum processors. Further validation on classical linear-quadratic control confirms these laws emerge from general optimization geometry rather than quantum-specific physics. Together, these results offer a transferable framework for decision-making in adaptive control.