New articles on Electrical Engineering and Systems Science


[1] 2601.07956

Human as an Actuator Dynamic Model Identification

This paper presents a method for estimating parameters that form a general model for human pilot response for specific tasks. The human model is essential for the dynamic analysis of piloted vehicles. Data are generated on a simulator with multiple trials being incorporated to find the single model that best describes the data. The model is found entirely in the time domain by constructing a constrained optimization problem. This optimization problem implicitly represents the state of the underlying system, making it robust to natural variation in human responses. It is demonstrated by estimating the human response model for a position control task with a quadcopter drone.


[2] 2601.07969

Tuberculosis Screening from Cough Audio: Baseline Models, Clinical Variables, and Uncertainty Quantification

In this paper, we propose a standardized framework for automatic tuberculosis (TB) detection from cough audio and routinely collected clinical data using machine learning. While TB screening from audio has attracted growing interest, progress is difficult to measure because existing studies vary substantially in datasets, cohort definitions, feature representations, model families, validation protocols, and reported metrics. Consequently, reported gains are often not directly comparable, and it remains unclear whether improvements stem from modeling advances or from differences in data and evaluation. We address this gap by establishing a strong, well-documented baseline for TB prediction using cough recordings and accompanying clinical metadata from a recently compiled dataset from several countries. Our pipeline is reproducible end-to-end, covering feature extraction, multimodal fusion, cougher-independent evaluation, and uncertainty quantification, and it reports a consistent suite of clinically relevant metrics to enable fair comparison. We further quantify performance for cough audio-only and fused (audio + clinical metadata) models, and release the full experimental protocol to facilitate benchmarking. This baseline is intended to serve as a common reference point and to reduce methodological variance that currently holds back progress in the field.


[3] 2601.07976

Application of Ideal Observer for Thresholded Data in Search Task

This study advances task-based image quality assessment by developing an anthropomorphic thresholded visual-search model observer. The model is an ideal observer for thresholded data inspired by the human visual system, allowing selective processing of high-salience features to improve discrimination performance. By filtering out irrelevant variability, the model enhances diagnostic accuracy and computational efficiency. The observer employs a two-stage framework: candidate selection and decision-making. Using thresholded data during candidate selection refines regions of interest, while stage-specific feature processing optimizes performance. Simulations were conducted to evaluate the effects of thresholding on feature maps, candidate localization, and multi-feature scenarios. Results demonstrate that thresholding improves observer performance by excluding low-salience features, particularly in noisy environments. Intermediate thresholds often outperform no thresholding, indicating that retaining only relevant features is more effective than keeping all features. Additionally, the model demonstrates effective training with fewer images while maintaining alignment with human performance. These findings suggest that the proposed novel framework can predict human visual search performance in clinically realistic tasks and provide solutions for model observer training with limited resources. Our novel approach has applications in other areas where human visual search and detection tasks are modeled such as in computer vision, machine learning, defense and security image analysis.


[4] 2601.07997

Can Inherent Communication Noise Guarantee Privacy in Distributed Cooperative Control ?

This paper investigates privacy-preserving distributed cooperative control for multi-agent systems within the framework of differential privacy. In cooperative control, communication noise is inevitable and is usually regarded as a disturbance that impairs coordination. This work revisits such noise as a potential privacy-enhancing factor. A linear quadratic regulator (LQR)-based framework is proposed for agents communicating over noisy channels, \textcolor{black}{where the noise variance depends on the relative state differences between neighbouring agents.} The resulting controller achieves formation while protecting the reference signals from inference attacks. It is analytically proven that the inherent communication noise can guarantee bounded $(\epsilon,\delta)$-differential privacy without adding dedicated privacy noise, while the \textcolor{black}{system cooperative tracking error} remains bounded and convergent in both the mean-square and almost-sure sense.


[5] 2601.08050

Formalizing the Relationship between Hamilton-Jacobi Reachability and Reinforcement Learning

We unify Hamilton-Jacobi (HJ) reachability and Reinforcement Learning (RL) through a proposed running cost formulation. We prove that the resultant travel-cost value function is the unique bounded viscosity solution of a time-dependent Hamilton-Jacobi Bellman (HJB) Partial Differential Equation (PDE) with zero terminal data, whose negative sublevel set equals the strict backward-reachable tube. Using a forward reparameterization and a contraction inducing Bellman update, we show that fixed points of small-step RL value iteration converge to the viscosity solution of the forward discounted HJB. Experiments on a classical benchmark compare learned values to semi-Lagrangian HJB ground truth and quantify error.


[6] 2601.08060

DRL-based Power Allocation in LiDAL-Assisted RLNC-NOMA OWC Systems

Non-orthogonal multiple access (NOMA) is a promising technique for optical wireless communication (OWC), enabling multiple users to share the optical spectrum simultaneously through the power domain. However, the imperfection of channel state information (CSI) and residual errors in decoding process deteriorate the performance of NOMA, especially when multi-parameteric and realistic dense-user indoor scenarios are considered. In this work, we model a LiDAL-assisted RLNC-NOMA OWC system, where the light detection and localization (LiDAL) technique exploits spatio-temporal information to improve user CSI, while random linear network coding (RLNC) enhances data resilience in the NOMA successive decoding process. Power allocation (PA) is a crucial issue in communication systems, particularly in the modeled system, due to the complex interactions between multiple users and the coding and detection processes. However, optimizing continuous PA dynamically requires advanced techniques to avoid excessive computational complexity. Therefore, we adopt a deep reinforcement learning (DRL) framework to efficiently learn near-optimal power allocation strategies, enabling enhanced system performance. In particular, a DRL-based normalized advantage function (NAF) algorithm is proposed to maximize the average sum rate of the system, and its performance is analyzed and compared to other widely used DRL-based and conventional PA schemes, such as deep deterministic policy gradient (DDPG), gain ratio PA (GRPA), and exhaustive search.


[7] 2601.08113

Coordinated Cooling and Compute Management for AI Datacenters

The AI datacenters are currently being deployed on a large scale to support the training and deployment of power-intensive large-language models (LLMs). Extensive amount of computation and cooling required in datacenters increase concerns about the energy use and carbon emissions of AI datacenters. Although current state-of-the-art has examined the energy efficiency of LLM inference, most prior research focused on optimizing compute-side scheduling without considering thermal objectives or constraints. Since GPU-intensive inference generates substantial heat that can degrade datacenter performance, ignoring thermal effects can increase total energy consumption and reduce the efficiency of LLM serving. To fill this gap, we profile the characteristics of GPU servers under varying cooling and AI jobs, and develop a joint cooling and computing modeling approach for AI datacenters. Built upon such workload and thermal dynamics models, a novel hierarchical control framework is proposed to co-optimize computing and thermal management by identifying the optimal GPU parallelism, frequency (DVFS), and cooling control knobs. Using real Azure inference traces and detailed GPU profiling, our model balances serving latency and thermal constraints in AI datacenters while significantly improving AI datacenters' energy efficiency.


[8] 2601.08123

Modal Parameter Extraction via Propeller-Driven Vibration Testing

Ground Vibration Testing (GVT) supports aircraft certification but often requires lengthy and costly campaigns. Propeller-driven Vibration Testing (PVT) is assessed here as an output-only alternative, in line with Operational Modal Analysis approaches such as Taxi Vibration Testing and Flight Vibration Testing. A cantilever Aluminium 7075-T6 wing spar is instrumented with seven accelerometers and excited by an outboard electric motor and propeller. Seven runs are carried out: a motor-off baseline, five constant-throttle cases, and a manual up-down throttle sweep. The acquired spectra indicate that the dominant resonances remain observable under propeller excitation, while low-throttle conditions introduce narrowband harmonics that may mask structural peaks; the sweep reduces persistent overlap. Modal parameters are identified for the baseline and sweep cases using the Natural Excitation Technique with the Loewner Framework (NExT-LF). The first two modes remain closely matched (Modal Assurance Criterion (MAC) > 0.99), whereas the third mode shows reduced repeatability (MAC = 0.827) and a larger frequency shift, consistent with propeller-induced bending--torsion coupling and non-ideal sweep control. Overall, PVT provides a viable complement to GVT for extracting low-frequency modal information and motivates pursuing future work on automated throttle scheduling and coupling-aware test planning.


[9] 2601.08168

Memetic Covariance Matrix Adaptation Evolution Strategy for Bilinear Matrix Inequality Problems in Control System Design

Bilinear Matrix Inequalities (BMIs) are fundamental to control system design but are notoriously difficult to solve due to their nonconvexity. This study addresses BMI-based control optimization problems by adapting and integrating advanced evolutionary strategies. Specifically, a memetic Covariance Matrix Adaptation Evolution Strategy (memetic CMA-ES) is proposed, which incorporates a local refinement phase via a (1+1)-CMA-ES within the global search process. While these algorithmic components are established in evolutionary computing, their tailored integration and specific tuning for control design tasks represent a novel application in this context. Experimental evaluations on $H_{\infty}$ controller synthesis and spectral abscissa optimization demonstrate that the proposed method achieves superior performance compared to existing BMI solvers in terms of both solution quality and robustness. This work bridges the gap between evolutionary computation and control theory, providing a practical and effective approach to tackling challenging BMI-constrained problems.


[10] 2601.08177

Research on Mechanical Properties and Deformation-Fracture Energy Consumption Characteristics of Plateau Frozen Rocks

The exploitation of mineral resources in plateau regions is confronted with critical challenges including low blasting efficiency, excessive energy consumption,and compromised operational safety when dealing with low-temperature water-bearing frozen rock this http URL study systematically investigates the dynamic-static mechanical properties,deformation-fracture behaviors,and energy consumption characteristics of plateau frozen sandstone under the coupled effects of temperature and moisture content (5%-15%).The research methodology integrates field sampling, low-pressure low-temperature simulation tests, graded impact loading tests, and numerical inversion analysis. Results demonstrate that freezing significantly enhances the dynamic strength and brittleness of saturated this http URL pore structure undergoes substantial evolution with decreasing temperature, with the porosity increasing by 63.15%.Based on PFC3D microscopic simulations, the mechanism of frost heave damage and the regulatory effect of water-ice phase transition on rock mechanical behaviors are elucidated.A quantitative analysis method for energy dissipation is proposed, revealing that the energy absorption increment of frozen rocks is higher than that of room-temperature this http URL findings provide a theoretical basis and technical support for optimizing blasting parameters, realizing directional energy release,and promoting green construction of frozen rock masses in high-altitude areas.


[11] 2601.08214

Hybrid Centralized Distributed Control for Lifelong MAPF over Wireless Connections

In lifelong multi-agent path finding (MAPF) with many robots, unreliable wireless links and stochastic executions are the norm. Existing approaches typically either rely on centralized planning under idealized communication, or run fully distributed local controllers with fixed communication patterns; they rarely couple communication scheduling with policy learning, and thus struggle when bandwidth is scarce or packets are frequently dropped. We address this joint control--communication problem and propose a hybrid centralized--distributed scheme: a centralized cloud policy sends small residual corrections only when selected, while a lightweight on-board Gated recurrent unit (GRU) policy provides a safe default fallback when wireless connection is not available.


[12] 2601.08240

Temporal-Enhanced Interpretable Multi-Modal Prognosis and Risk Stratification Framework for Diabetic Retinopathy (TIMM-ProRS)

Diabetic retinopathy (DR), affecting millions globally with projections indicating a significant rise, poses a severe blindness risk and strains healthcare systems. Diagnostic complexity arises from visual symptom overlap with conditions like age-related macular degeneration and hypertensive retinopathy, exacerbated by high misdiagnosis rates in underserved regions. This study introduces TIMM-ProRS, a novel deep learning framework integrating Vision Transformer (ViT), Convolutional Neural Network (CNN), and Graph Neural Network (GNN) with multi-modal fusion. TIMM-ProRS uniquely leverages both retinal images and temporal biomarkers (HbA1c, retinal thickness) to capture multi-modal and temporal dynamics. Evaluated comprehensively across diverse datasets including APTOS 2019 (trained), Messidor-2, RFMiD, EyePACS, and Messidor-1 (validated), the model achieves 97.8\% accuracy and an F1-score of 0.96, demonstrating state-of-the-art performance and outperforming existing methods like RSG-Net and DeepDR. This approach enables early, precise, and interpretable diagnosis, supporting scalable telemedical management and enhancing global eye health sustainability.


[13] 2601.08300

Variable-Length Wideband CSI Feedback via Loewner Interpolation and Deep Learning

In this paper, we propose a variable-length wideband channel state information (CSI) feedback scheme for Frequency Division Duplex (FDD) massive multiple-input multipleoutput (MIMO) systems in U6G band (6425MHz-7125MHz). Existing compressive sensing (CS)-based and deep learning (DL)- based schemes preprocess the channel by truncating it in the angular-delay domain. However, the energy leakage effect caused by the Discrete Fourier Transform (DFT) basis will be more serious and leads to a bottleneck in recovery accuracy when applied to wideband channels such as those in U6G. To solve this problem, we introduce the Loewner Interpolation (LI) framework which generates a set of dynamic bases based on the current CSI matrix, enabling highly efficient compression in the frequency domain. Then, the LI basis is further compressed in the spatial domain through a neural network. To achieve a flexible trade-off between feedback overhead and recovery accuracy, we design a rateless auto-encoder trained with tail dropout and a multi-objective learning schedule, supporting variable-length feedback with a singular model. Meanwhile, the codewords are ranked by importance, ensuring that the base station (BS) can still maintain acceptable reconstruction performance under limited feedback with tail erasures. Furthermore, an adaptive quantization strategy is developed for the feedback framework to enhance robustness. Simulation results demonstrate that the proposed scheme could achieve higher CSI feedback accuracy with less or equal feedback overhead, and improve spectral efficiency compared with baseline schemes.


[14] 2601.08307

Meta-Backscatter: Long-Distance Battery-Free Metamaterial-Backscatter Sensing and Communication

Battery-free Internet of Things (BF-IoT) enabled by backscatter communication is a rapidly evolving technology offering advantages of low cost, ultra-low power consumption, and robustness. However, the practical deployment of BF-IoT is significantly constrained by the limited communication range of common backscatter tags, which typically operate with a range of merely a few meters due to inherent round-trip path loss. Meta-backscatter systems that utilize metamaterial tags present a promising solution, retaining the inherent advantages of BF-IoT while breaking the critical communication range barrier. By leveraging densely paved sub-wavelength units to concentrate the reflected signal power, metamaterial tags enable a significant communication range extension over existing BF-IoT tags that employ omni-directional antennas. In this paper, we synthesize the principles and paradigms of metamaterial sensing to establish a unified design framework and a forward-looking research roadmap. Specifically, we first provide an overview of backscatter communication, encompassing its development history, working principles, and tag classification. We then introduce the design methodology for both metamaterial tags and their compatible transceivers. Moreover, we present the implementation of a meta-backscatter system prototype and report the experimental results based on it. Finally, we conclude by highlighting key challenges and outlining potential avenues for future research.


[15] 2601.08335

On Robust Fixed-Time Stabilization of the Cauchy Problem in Hilbert Spaces

This paper presents finite-time and fixed-time stabilization results for inhomogeneous abstract evolution problems, extending existing theories. We prove well-posedness for strong and weak solutions, and estimate upper bounds for settling times for both homogeneous and inhomogeneous systems. We generalize finite-dimensional results to infinite-dimensional systems and demonstrate partial state stabilization with actuation on a subset of the domain. The interest of these results are illustrated through an application of a heat equation with memory term.


[16] 2601.08338

Minimal Actuator Selection

Selecting a few available actuators to ensure the controllability of a linear system is a fundamental problem in control theory. Previous works either focus on optimal performance, simplifying the controllability issue, or make the system controllable under structural assumptions, such as in graphs or when the input matrix is a design parameter. We generalize these approaches to offer a precise characterization of the general minimal actuator selection problem where a set of actuators is given, described by a fixed input matrix, and goal is to choose the fewest actuators that make the system controllable. We show that this problem can be equivalently cast as an integer linear program and, if actuation channels are sufficiently independent, as a set multicover problem under multiplicity constraints. The latter equivalence is always true if the state matrix has all distinct eigenvalues, in which case it simplifies to the set cover problem. Such characterizations hold even when a robust selection that tolerates a given number of faulty actuators is desired. Our established connection legitimates a designer to use algorithms from the rich literature on the set multicover problem to select the smallest subset of actuators, including exact solutions that do not require brute-force search.


[17] 2601.08339

Blockchain-Enabled Renewable Energy Certificate Trading: A Secure and Privacy-Preserving Approach

In the 21st century, transitioning to renewable energy sources is imperative, with fossil fuel reserves depleting rapidly and recognizing critical environmental issues such as climate change, air pollution, water pollution, and habitat destruction. Embracing renewable energy is not only an environmental necessity but also a strategic move with multiple benefits. By shifting to renewable energy sources and supporting their production through the acquisition of renewable energy certificates, we foster innovation and drive economic growth in the renewable energy sector. This, in turn, reduces greenhouse gas emissions, aligning with global efforts to mitigate climate change. Additionally, renewable energy certificates ensure compliance with regulations that mandate the use of renewable energy, enhancing legal adherence while promoting transparency and trust in energy sourcing. To monitor the uptake of renewable energy, governments have implemented Renewable Energy Certificates (RECs) as a tracking mechanism for the production and consumption of renewable energy. However, there are two main challenges to the existing REC schema: 1) The RECs have not been globally adopted due to inconsistent design; 2) The consumer privacy has not been well incorporated in the design of blockchain. In this study, we investigate the trading of RECs between suppliers and consumers using the directed acyclic graph (DAG) blockchain system and introduce a trading schema to help protect consumer information. Our results demonstrate lower transaction time by 41\% and energy consumption by 65\% compared to proof-of-stake.


[18] 2601.08372

Data-Driven Time-Limited h2 Optimal Model Reduction for Linear Discrete-Time Systems

This paper develops a data-driven h2 model reduction method for discrete-time linear time-invariant systems. Specifically, we solve the h2 model reduction problem defined over a finite horizon using only impulse response data. Furthermore, we show that the proposed data-driven algorithm converges to a stationary point under certain assumptions. Numerical experiments demonstrate that the proposed method constructs a good reduced-order model in terms of the h2 norm defined over the finite horizon using a SLICOT benchmark (the CD player model).


[19] 2601.08428

Bio-RV: Low-Power Resource-Efficient RISC-V Processor for Biomedical Applications

This work presents Bio-RV, a compact and resource-efficient RISC-V processor intended for biomedical control applications, such as accelerator-based biomedical SoCs and implantable pacemaker systems. The proposed Bio-RV is a multi-cycle RV32I core that provides explicit execution control and external instruction loading with capabilities that enable controlled firmware deployment, ASIC bring-up, and post-silicon testing. In addition to coordinating accelerator configuration and data transmission in heterogeneous systems, Bio-RV is designed to function as a lightweight host controller, handling interfaces with pacing, sensing, electrogram (EGM), telemetry, and battery management modules. With 708 LUTs and 235 flip-flops on FPGA prototypes, Bio-RV, implemented in a 180 nm CMOS technology, operate at 50 MHz and feature a compact hardware footprint. According to post-layout results, the proposed architectural decisions align with minimal energy use. Ultimately, Bio-RV prioritises deterministic execution, minimal hardware complexity, and integration flexibility over peak computing speed to meet the demands of ultra-low-power, safety-critical biomedical systems.


[20] 2601.08436

Effective outdoor pathloss prediction: A multi-layer segmentation approach with weighting map

Predicting pathloss by considering the physical environment is crucial for effective wireless network planning. Traditional methods, such as ray tracing and model-based approaches, often face challenges due to high computational complexity and discrepancies between models and real-world environments. In contrast, deep learning has emerged as a promising alternative, offering accurate path loss predictions with reduced computational complexity. In our research, we introduce a ResNet-based model designed to enhance path loss prediction. We employ innovative techniques to capture key features of the environment by generating transmission (Tx) and reception (Rx) depth maps, as well as a distance map from the geographic data. Recognizing the significant attenuation caused by signal reflection and diffraction, particularly at high frequencies, we have developed a weighting map that emphasizes the areas adjacent to the direct path between Tx and Rx for path loss prediction. {Extensive simulations demonstrate that our model outperforms PPNet, RPNet, and Vision Transformer (ViT) by 1.2-3.0 dB using dataset of ITU challenge 2024 and ICASSP 2023. In addition, the floating point operations (FLOPs) of the proposed model is 60\% less than those of benchmarks.} Additionally, ablation studies confirm that the inclusion of the weighting map significantly enhances prediction performance.


[21] 2601.08445

Multiobjective Model Predictive Control for Residential Demand Response Management Under Uncertainty

Residential users in demand response programs must balance electricity costs and user dissatisfaction under real-time pricing. This study proposes a multiobjective model predictive control approach for home energy management systems with battery storage, aiming to minimize both objectives while mitigating uncertainties. Laguerre functions parameterize control signals, transforming the optimization problem into one with linear inequalities for efficient exploration. A constrained multiobjective evolutionary algorithm, incorporating convex sampler-based crossover and mutation, is developed to ensure feasible solutions. Simulations show that the proposed method outperforms existing approaches, limiting cost increases to 0.52\% under uncertainties, compared to at least 2.3\% with other methods.


[22] 2601.08459

Current and temperature imbalances in parallel-connected grid storage battery modules

A key challenge with large battery systems is heterogeneous currents and temperatures in modules with parallel-connected cells. Although extreme currents and temperatures are detrimental to the performance and lifetime of battery cells, there is not a consensus on the scale of typical imbalances within grid storage modules. Here, we quantify these imbalances through simulations and experiments on an industrially representative grid storage battery module consisting of prismatic lithium iron phosphate cells, elucidating the evolution of current and temperature imbalances and their dependence on individual cell and module parameter variations. Using a sensitivity analysis, we find that varying contact resistances and cell resistances contribute strongly to temperature differences between cells, from which we define safety thresholds on cell-to-cell variability. Finally, we investigate how these thresholds change for different applications, to outline a set of robustness metrics that show how cycling at lower C-rates and narrower SOC ranges can mitigate failures.


[23] 2601.08463

SDP: A Unified Protocol and Benchmarking Framework for Reproducible Wireless Sensing

Learning-based wireless sensing has made rapid progress, yet the field still lacks a unified and reproducible experimental foundation. Unlike computer vision, wireless sensing relies on hardware-dependent channel measurements whose representations, preprocessing pipelines, and evaluation protocols vary significantly across devices and datasets, hindering fair comparison and reproducibility. This paper proposes the Sensing Data Protocol (SDP), a protocol-level abstraction and unified benchmark for scalable wireless sensing. SDP acts as a standardization layer that decouples learning tasks from hardware heterogeneity. To this end, SDP enforces deterministic physical-layer sanitization, canonical tensor construction, and standardized training and evaluation procedures, decoupling learning performance from hardware-specific artifacts. Rather than introducing task-specific models, SDP establishes a principled protocol foundation for fair evaluation across diverse sensing tasks and platforms. Extensive experiments demonstrate that SDP achieves competitive accuracy while substantially improving stability, reducing inter-seed performance variance by orders of magnitude on complex activity recognition tasks. A real-world experiment using commercial off-the-shelf Wi-Fi hardware further illustrating the protocol's interoperability across heterogeneous hardware. By providing a unified protocol and benchmark, SDP enables reproducible and comparable wireless sensing research and supports the transition from ad hoc experimentation toward reliable engineering practice.


[24] 2601.08480

Quantitative Analysis of Proxy Tasks for Anomalous Sound Detection

Anomalous sound detection (ASD) typically involves self-supervised proxy tasks to learn feature representations from normal sound data, owing to the scarcity of anomalous samples. In ASD research, proxy tasks such as AutoEncoders operate under the explicit assumption that models trained on normal data will increase the reconstruction errors related to anomalies. A natural extension suggests that improved proxy task performance should improve ASD capability; however, this relationship has received little systematic attention. This study addresses this research gap by quantitatively analyzing the relationship between proxy task metrics and ASD performance across five configurations, namely, AutoEncoders, classification, source separation, contrastive learning, and pre-trained models. We evaluate the learned representations using linear probe (linear separability) and Mahalanobis distance (distributional compactness). Our experiments reveal that strong proxy performance does not necessarily improve anomalous sound detection performance. Specifically, classification tasks experience performance saturation owing to insufficient task difficulty, whereas contrastive learning fails to learn meaningful features owing to limited data diversity. Notably, source separation is the only task demonstrating a strong positive correlation, such that improved separation consistently improves anomaly detection. Based on these findings, we highlight the critical importance of task difficulty and objective alignment. Finally, we propose a three-stage alignment verification protocol to guide the design of highly effective proxy tasks for ASD systems.


[25] 2601.08483

Drone Surveillance via Coordinated Beam Sweeping in MIMO-ISAC Networks

This paper introduces a scheme for drone surveillance coordinated with the fifth generation (5G) synchronization signal block (SSB) cell-search procedure to simultaneously detect low-altitude drones within a volumetric surveillance grid. Herein, we consider a multistatic configuration where multiple access points (APs) collaboratively illuminate the volume while independently transmitting SSB broadcast signals. Both tasks are performed through a beam sweeping. In the proposed scheme, coordinated APs send sensing beams toward a grid of voxels within the volumetric surveillance region simultaneously with the 5G SSB burst. To prevent interference between communication and sensing signals, we propose a precoder design that guarantees orthogonality of the sensing beam and the SSB in order to maximize the sensing signal-to-interference-plus-noise ratio (SINR) while ensuring a specified SINR for users, as well as minimizing the impact of the direct link. The results demonstrate that the proposed precoder outperforms the non-coordinated precoder and is minimally affected by variations in drone altitude.


[26] 2601.08488

Disturbance observer-based tracking control for roll-to-roll slot die coating systems under gap and pump rate disturbances

Roll-to-roll slot die coating is a widely used industrial manufacturing technique applied in a diverse range of applications such as the production of lithium-ion batteries, solar cells and optical films. The efficiency of roll-to-roll slot die coating depends on the precise control of various input parameters such as pump rate, substrate velocity and coating gap. However, these inputs are sensitive to disturbances in process conditions, leading to inconsistencies in the various characteristics of the produced film. To address this challenge, a \gls{DO} is utilized for detecting disturbances, which may occur in the same or different channels as the control signal within the system. A generalized compensator is then implemented to mitigate the impact of these disturbances on the output, thereby enhancing uncertainty suppression. Additionally, integrating the disturbance rejection system with an output tracking controller enables the coating system to maintain the desired thickness under varying input conditions and disturbances. The effectiveness of this approach is then validated using a test rig equipped with a camera system, which facilitates the development of a data-driven model of the dynamic process, represented by state-space equations. The simulation results were demonstrated to showcase the effectiveness of the DOBOTC system, which provides a resilient solution for the output tracking issue in a data-driven model with generalized disturbances.


[27] 2601.08518

Improving the GMAW process through current control

A control strategy for the electrical current in GMAW processes is proposed. The control is in closed-loop, designed by formal methods, based on a mathematical model of the electrical behavior of the GMAW process, and implemented in C+ language in a microcontroller. The model consists of a switched equivalent electrical circuit whose parameters are obtained in a data-driven manner. The strategy is tested in numerous experiments with both manual and robot welding, showing improvements in the overall welding process.


[28] 2601.08534

Airborne Particle Communication Through Time-varying Diffusion-Advection Channels

Particle based communication using diffusion and advection has emerged as an alternative signaling paradigm recently. While most existing studies assume constant flow conditions, real macro scale environments such as atmospheric winds exhibit time varying behavior. In this work, airborne particle communication under time varying advection is modeled as a linear time varying (LTV) channel, and a closed form, time dependent channel impulse response is derived using the method of moving frames. Based on this formulation, the channel is characterized through its power delay profile, leading to the definition of channel dispersion time as a physically meaningful measure of channel memory and a guideline for symbol duration selection. System level simulations under directed, time varying wind conditions show that waveform design is critical for performance, enabling multi symbol modulation using a single particle type when dispersion is sufficiently controlled. The results demonstrate that time varying diffusion advection channels can be systematically modeled and engineered using communication theoretic tools, providing a realistic foundation for particle based communication in complex flow environments.


[29] 2601.08537

Weakly Supervised Tabla Stroke Transcription via TI-SDRM: A Rhythm-Aware Lattice Rescoring Framework

Tabla Stroke Transcription (TST) is central to the analysis of rhythmic structure in Hindustani classical music, yet remains challenging due to complex rhythmic organization and the scarcity of strongly annotated data. Existing approaches largely rely on fully supervised learning with onset-level annotations, which are costly and impractical at scale. This work addresses TST in a weakly supervised setting, using only symbolic stroke sequences without temporal alignment. We propose a framework that combines a CTC-based acoustic model with sequence-level rhythmic rescoring. The acoustic model produces a decoding lattice, which is refined using a \textbf{$T\bar{a}la$}-Independent Static--Dynamic Rhythmic Model (TI-SDRM) that integrates long-term rhythmic structure with short-term adaptive dynamics through an adaptive interpolation mechanism. We curate a new real-world tabla solo dataset and a complementary synthetic dataset, establishing the first benchmark for weakly supervised TST in Hindustani classical music. Experiments demonstrate consistent and substantial reductions in stroke error rate over acoustic-only decoding, confirming the importance of explicit rhythmic structure for accurate transcription.


[30] 2601.08683

Region of interest detection for efficient aortic segmentation

Thoracic aortic dissection and aneurysms are the most lethal diseases of the aorta. The major hindrance to treatment lies in the accurate analysis of the medical images. More particularly, aortic segmentation of the 3D image is often tedious and difficult. Deep-learning-based segmentation models are an ideal solution, but their inability to deliver usable outputs in difficult cases and their computational cost cause their clinical adoption to stay limited. This study presents an innovative approach for efficient aortic segmentation using targeted region of interest (ROI) detection. In contrast to classical detection models, we propose a simple and efficient detection model that can be widely applied to detect a single ROI. Our detection model is trained as a multi-task model, using an encoder-decoder architecture for segmentation and a fully connected network attached to the bottleneck for detection. We compare the performance of a one-step segmentation model applied to a complete image, nnU-Net and our cascade model composed of a detection and a segmentation step. We achieve a mean Dice similarity coefficient of 0.944 with over 0.9 for all cases using a third of the computing power. This simple solution achieves state-of-the-art performance while being compact and robust, making it an ideal solution for clinical applications.


[31] 2601.08685

Stable Filtering for Efficient Dimensionality Reduction of Streaming Manifold Data

Many areas in science and engineering now have access to technologies that enable the rapid collection of overwhelming data volumes. While these datasets are vital for understanding phenomena from physical to biological and social systems, the sheer magnitude of the data makes even simple storage, transmission, and basic processing highly challenging. To enable efficient and accurate execution of these data processing tasks, we require new dimensionality reduction tools that 1) do not need expensive, time-consuming training, and 2) preserve the underlying geometry of the data that has the information required to understand the measured system. Specifically, the geometry to be preserved is that induced by the fact that in many applications, streaming high-dimensional data evolves on a low-dimensional attractor manifold. Importantly, we may not know the exact structure of this manifold a priori. To solve these challenges, we present randomized filtering (RF), which leverages a specific instantiation of randomized dimensionality reduction to provably preserve non-linear manifold structure in the embedded space while remaining data-independent and computationally efficient. In this work we build on the rich theoretical promise of randomized dimensionality reduction to develop RF as a real, practical approach. We introduce novel methods, analysis, and experimental verification to illuminate the practicality of RF in diverse scientific applications, including several simulated and real-data examples that showcase the tangible benefits of RF.


[32] 2601.08749

A Single-Parameter Factor-Graph Image Prior

We propose a novel piecewise smooth image model with piecewise constant local parameters that are automatically adapted to each image. Technically, the model is formulated in terms of factor graphs with NUP (normal with unknown parameters) priors, and the pertinent computations amount to iterations of conjugate-gradient steps and Gaussian message passing. The proposed model and algorithms are demonstrated with applications to denoising and contrast enhancement.


[33] 2601.08758

M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding

Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning, and recent advances have extended this paradigm to Multimodal Large Language Models (MLLMs). In the medical domain, where diagnostic decisions depend on nuanced visual cues and sequential reasoning, CoT aligns naturally with clinical thinking processes. However, Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path. An opaque process lacks reliable bases for judgment, making it difficult to assist doctors in diagnosis. To address this gap, we introduce a new M3CoTBench benchmark specifically designed to evaluate the correctness, efficiency, impact, and consistency of CoT reasoning in medical image understanding. M3CoTBench features 1) a diverse, multi-level difficulty dataset covering 24 examination types, 2) 13 varying-difficulty tasks, 3) a suite of CoT-specific evaluation metrics (correctness, efficiency, impact, and consistency) tailored to clinical reasoning, and 4) a performance analysis of multiple MLLMs. M3CoTBench systematically evaluates CoT reasoning across diverse medical imaging tasks, revealing current limitations of MLLMs in generating reliable and clinically interpretable reasoning, and aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare. Project page at this https URL.


[34] 2601.07958

LJ-Spoof: A Generatively Varied Corpus for Audio Anti-Spoofing and Synthesis Source Tracing

Speaker-specific anti-spoofing and synthesis-source tracing are central challenges in audio anti-spoofing. Progress has been hampered by the lack of datasets that systematically vary model architectures, synthesis pipelines, and generative parameters. To address this gap, we introduce LJ-Spoof, a speaker-specific, generatively diverse corpus that systematically varies prosody, vocoders, generative hyperparameters, bona fide prompt sources, training regimes, and neural post-processing. The corpus spans one speakers-including studio-quality recordings-30 TTS families, 500 generatively variant subsets, 10 bona fide neural-processing variants, and more than 3 million utterances. This variation-dense design enables robust speaker-conditioned anti-spoofing and fine-grained synthesis-source tracing. We further position this dataset as both a practical reference training resource and a benchmark evaluation suite for anti-spoofing and source tracing.


[35] 2601.07998

Predicting Region of Interest in Human Visual Search Based on Statistical Texture and Gabor Features

Understanding human visual search behavior is a fundamental problem in vision science and computer vision, with direct implications for modeling how observers allocate attention in location-unknown search tasks. In this study, we investigate the relationship between Gabor-based features and gray-level co-occurrence matrix (GLCM) based texture features in modeling early-stage visual search behavior. Two feature-combination pipelines are proposed to integrate Gabor and GLCM features for narrowing the region of possible human fixations. The pipelines are evaluated using simulated digital breast tomosynthesis images. Results show qualitative agreement among fixation candidates predicted by the proposed pipelines and a threshold-based model observer. A strong correlation is observed between GLCM mean and Gabor feature responses, indicating that these features encode related image information despite their different formulations. Eye-tracking data from human observers further suggest consistency between predicted fixation regions and early-stage gaze behavior. These findings highlight the value of combining structural and texture-based features for modeling visual search and support the development of perceptually informed observer models.


[36] 2601.07999

VoxCog: Towards End-to-End Multilingual Cognitive Impairment Classification through Dialectal Knowledge

In this work, we present a novel perspective on cognitive impairment classification from speech by integrating speech foundation models that explicitly recognize speech dialects. Our motivation is based on the observation that individuals with Alzheimer's Disease (AD) or mild cognitive impairment (MCI) often produce measurable speech characteristics, such as slower articulation rate and lengthened sounds, in a manner similar to dialectal phonetic variations seen in speech. Building on this idea, we introduce VoxCog, an end-to-end framework that uses pre-trained dialect models to detect AD or MCI without relying on additional modalities such as text or images. Through experiments on multiple multilingual datasets for AD and MCI detection, we demonstrate that model initialization with a dialect classifier on top of speech foundation models consistently improves the predictive performance of AD or MCI. Our trained models yield similar or often better performance compared to previous approaches that ensembled several computational methods using different signal modalities. Particularly, our end-to-end speech-based model achieves 87.5% and 85.9% accuracy on the ADReSS 2020 challenge and ADReSSo 2021 challenge test sets, outperforming existing solutions that use multimodal ensemble-based computation or LLMs.


[37] 2601.08074

Elastic overtones: an equal temperament 12 tone music system with "perfect" fifths

The impossibility of a transposable 12 semitone tuning of the octave arises from the mathematical fact that $2 \times 2^{7/12} \neq 3$ i.e., the second harmonic of the fifth can not exactly match the third harmonic of the fundamental. This in turn, stems from the whole number harmonic structure of western music, and the subsequent fundamental character of the octave interval as multiples of 2 in frequency, a property inherited by our music system from the physics of instruments with vibrating elements being to a good approximation one dimensional. In the current era of electronic music, one can relax the above assumptions to construct an analogous music system where all the structural properties of the standard music system are preserved, but where harmonics are not whole number multiples of the fundamental frequency, and the octave is no longer a factor of 2 in frequency. This now allows to construct a transposable 12 semitone music system where the second harmonic of the fifth exactly matches the third harmonic of the fundamental. The enhanced harmonic qualities of this system recover to a good approximation the musical qualities of Just Intonation, whilst retaining by construction all the versatility and modulating ability of 12TET.


[38] 2601.08107

STO-RL: Offline RL under Sparse Rewards via LLM-Guided Subgoal Temporal Order

Offline reinforcement learning (RL) enables policy learning from pre-collected datasets, avoiding costly and risky online interactions, but it often struggles with long-horizon tasks involving sparse rewards. Existing goal-conditioned and hierarchical offline RL methods decompose such tasks and generate intermediate rewards to mitigate limitations of traditional offline RL, but usually overlook temporal dependencies among subgoals and rely on imprecise reward shaping, leading to suboptimal policies. To address these issues, we propose STO-RL (Offline RL using LLM-Guided Subgoal Temporal Order), an offline RL framework that leverages large language models (LLMs) to generate temporally ordered subgoal sequences and corresponding state-to-subgoal-stage mappings. Using this temporal structure, STO-RL applies potential-based reward shaping to transform sparse terminal rewards into dense, temporally consistent signals, promoting subgoal progress while avoiding suboptimal solutions. The resulting augmented dataset with shaped rewards enables efficient offline training of high-performing policies. Evaluations on four discrete and continuous sparse-reward benchmarks demonstrate that STO-RL consistently outperforms state-of-the-art offline goal-conditioned and hierarchical RL baselines, achieving faster convergence, higher success rates, and shorter trajectories. Ablation studies further confirm STO-RL's robustness to imperfect or noisy LLM-generated subgoal sequences, demonstrating that LLM-guided subgoal temporal structures combined with theoretically grounded reward shaping provide a practical and scalable solution for long-horizon offline RL.


[39] 2601.08110

Efficient Incremental SLAM via Information-Guided and Selective Optimization

We present an efficient incremental SLAM back-end that achieves the accuracy of full batch optimization while substantially reducing computational cost. The proposed approach combines two complementary ideas: information-guided gating (IGG) and selective partial optimization (SPO). IGG employs an information-theoretic criterion based on the log-determinant of the information matrix to quantify the contribution of new measurements, triggering global optimization only when a significant information gain is observed. This avoids unnecessary relinearization and factorization when incoming data provide little additional information. SPO executes multi-iteration Gauss-Newton (GN) updates but restricts each iteration to the subset of variables most affected by the new measurements, dynamically refining this active set until convergence. Together, these mechanisms retain all measurements to preserve global consistency while focusing computation on parts of the graph where it yields the greatest benefit. We provide theoretical analysis showing that the proposed approach maintains the convergence guarantees of full GN. Extensive experiments on benchmark SLAM datasets show that our approach consistently matches the estimation accuracy of batch solvers, while achieving significant computational savings compared to conventional incremental approaches. The results indicate that the proposed approach offers a principled balance between accuracy and efficiency, making it a robust and scalable solution for real-time operation in dynamic data-rich environments.


[40] 2601.08136

Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies

Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty in online RL is the lack of direct samples from the target distribution; instead, the target is an unnormalized Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which utilizes a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. Yet, it remains unclear how these objectives relate formally or if they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that effectively reduce importance sampling variance. We show that existing noise-expectation and gradient-expectation methods are two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and enables the principled combination of Q-value and Q-gradient information to derive an optimal, minimum-variance estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL, and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.


[41] 2601.08254

Large Artificial Intelligence Model Guided Deep Reinforcement Learning for Resource Allocation in Non Terrestrial Networks

Large AI Model (LAM) have been proposed to applications of Non-Terrestrial Networks (NTN), that offer better performance with its great generalization and reduced task specific trainings. In this paper, we propose a Deep Reinforcement Learning (DRL) agent that is guided by a Large Language Model (LLM). The LLM operates as a high level coordinator that generates textual guidance that shape the reward of the DRL agent during training. The results show that the LAM-DRL outperforms the traditional DRL by 40% in nominal weather scenarios and 64% in extreme weather scenarios compared to heuristics in terms of throughput, fairness, and outage probability.


[42] 2601.08326

From Antenna Abundance to Antenna Intelligence in 6G Gigantic MIMO Systems

Current cellular systems achieve high spectral efficiency through Massive MIMO, which leverages an abundance of antennas to create favorable propagation conditions for multiuser spatial multiplexing. Looking towards future networks, the extrapolation of this paradigm leads to systems with many hundreds of antennas per base station, raising concerns regarding hardware complexity, cost, and power consumption. This article suggests more intelligent array designs that reduce the need for excessive antenna numbers. We revisit classical uniform array design principles and explain how their uniform spatial sampling leads to unnecessary redundancies in practical deployment scenarios. By exploiting non-uniform sparse arrays with site-specific antenna placements -- based on either pre-optimized irregular arrays or real-time movable antennas -- we demonstrate how superior multiuser MIMO performance can be achieved with far fewer antennas. These principles are inspired by previous works on wireless localization. We explain and demonstrate numerically how these concepts can be adapted for communications to improve the average sum rate and similar metrics. The results suggest a paradigm shift for future antenna array design, where antenna intelligence replaces sheer antenna count. This opens new opportunities for efficient, adaptable, and sustainable Gigantic MIMO systems.


[43] 2601.08357

Movable Antenna for Integrating Near-field Channel Estimation and Localization

Movable antenna (MA) introduces a new degree of freedom for future wireless communication systems by enabling the adaptive adjustment of antenna positions. Its large-range movement renders wireless channels transmission into the near-field region, which brings new performance enhancement for integrated sensing and communication (ISAC). This paper proposes a novel multi-stage design framework for broadband near-field ISAC assisted by MA. The framework first divides the MA movement area into multiple subregions, and employs the Newtonized orthogonal matching pursuit algorithm (NOMP) to achieve high-precision angle estimation in each subregion. Subsequently, a method called near-field localization via subregion ray clustering (LSRC) is proposed for identifying the positions of scatterers. This method finds the coordinates of each scatterer by jointly processing the angle estimates across all subregions. Finally, according to the estimated locations of the scatterers, the near-field channel estimation (CE) is refined for improving communication performance. Simulation results demonstrate that the proposed scheme can significantly enhance MA sensing accuracy and CE, providing an efficient solution for MA-aided near-field ISAC.


[44] 2601.08358

Decodable but not structured: linear probing enables Underwater Acoustic Target Recognition with pretrained audio embeddings

Increasing levels of anthropogenic noise from ships contribute significantly to underwater sound pollution, posing risks to marine ecosystems. This makes monitoring crucial to understand and quantify the impact of the ship radiated noise. Passive Acoustic Monitoring (PAM) systems are widely deployed for this purpose, generating years of underwater recordings across diverse soundscapes. Manual analysis of such large-scale data is impractical, motivating the need for automated approaches based on machine learning. Recent advances in automatic Underwater Acoustic Target Recognition (UATR) have largely relied on supervised learning, which is constrained by the scarcity of labeled data. Transfer Learning (TL) offers a promising alternative to mitigate this limitation. In this work, we conduct the first empirical comparative study of transfer learning for UATR, evaluating multiple pretrained audio models originating from diverse audio domains. The pretrained model weights are frozen, and the resulting embeddings are analyzed through classification, clustering, and similarity-based evaluations. The analysis shows that the geometrical structure of the embedding space is largely dominated by recording-specific characteristics. However, a simple linear probe can effectively suppress this recording-specific information and isolate ship-type features from these embeddings. As a result, linear probing enables effective automatic UATR using pretrained audio models at low computational cost, significantly reducing the need for a large amounts of high-quality labeled ship recordings.


[45] 2601.08405

Large Language Models to Enhance Multi-task Drone Operations in Simulated Environments

Benefiting from the rapid advancements in large language models (LLMs), human-drone interaction has reached unprecedented opportunities. In this paper, we propose a method that integrates a fine-tuned CodeT5 model with the Unreal Engine-based AirSim drone simulator to efficiently execute multi-task operations using natural language commands. This approach enables users to interact with simulated drones through prompts or command descriptions, allowing them to easily access and control the drone's status, significantly lowering the operational threshold. In the AirSim simulator, we can flexibly construct visually realistic dynamic environments to simulate drone applications in complex scenarios. By combining a large dataset of (natural language, program code) command-execution pairs generated by ChatGPT with developer-written drone code as training data, we fine-tune the CodeT5 to achieve automated translation from natural language to executable code for drone tasks. Experimental results demonstrate that the proposed method exhibits superior task execution efficiency and command understanding capabilities in simulated environments. In the future, we plan to extend the model functionality in a modular manner, enhancing its adaptability to complex scenarios and driving the application of drone technologies in real-world environments.


[46] 2601.08450

Decoding Order Matters in Autoregressive Speech Synthesis

Autoregressive speech synthesis often adopts a left-to-right order, yet generation order is a modelling choice. We investigate decoding order through masked diffusion framework, which progressively unmasks positions and allows arbitrary decoding orders during training and inference. By interpolating between identity and random permutations, we show that randomness in decoding order affects speech quality. We further compare fixed strategies, such as \texttt{l2r} and \texttt{r2l} with adaptive ones, such as Top-$K$, finding that fixed-order decoding, including the dominating left-to-right approach, is suboptimal, while adaptive decoding yields better performance. Finally, since masked diffusion requires discrete inputs, we quantise acoustic representations and find that even 1-bit quantisation can support reasonably high-quality speech.


[47] 2601.08467

Zero-Shot Distracted Driver Detection via Vision Language Models with Double Decoupling

Distracted driving is a major cause of traffic collisions, calling for robust and scalable detection methods. Vision-language models (VLMs) enable strong zero-shot image classification, but existing VLM-based distracted driver detectors often underperform in real-world conditions. We identify subject-specific appearance variations (e.g., clothing, age, and gender) as a key bottleneck: VLMs entangle these factors with behavior cues, leading to decisions driven by who the driver is rather than what the driver is doing. To address this, we propose a subject decoupling framework that extracts a driver appearance embedding and removes its influence from the image embedding prior to zero-shot classification, thereby emphasizing distraction-relevant evidence. We further orthogonalize text embeddings via metric projection onto Stiefel manifold to improve separability while staying close to the original semantics. Experiments demonstrate consistent gains over prior baselines, indicating the promise of our approach for practical road-safety applications.


[48] 2601.08491

AUV Trajectory Learning for Underwater Acoustic Energy Transfer and Age Minimization

Internet of underwater things (IoUT) is increasingly gathering attention with the aim of monitoring sea life and deep ocean environment, underwater surveillance as well as maintenance of underwater installments. However, conventional IoUT devices, reliant on battery power, face limitations in lifespan and pose environmental hazards upon disposal. This paper introduces a sustainable approach for simultaneous information uplink from the IoUT devices and acoustic energy transfer (AET) to the devices via an autonomous underwater vehicle (AUV), potentially enabling them to operate indefinitely. To tackle the time-sensitivity, we adopt age of information (AoI), and Jain's fairness index. We develop two deep-reinforcement learning (DRL) algorithms, offering a high-complexity, high-performance frequency division duplex (FDD) solution and a low-complexity, medium-performance time division duplex (TDD) approach. The results elucidate that the proposed FDD and TDD solutions significantly reduce the average AoI and boost the harvested energy as well as data collection fairness compared to baseline approaches.


[49] 2601.08516

Robust CAPTCHA Using Audio Illusions in the Era of Large Language Models: from Evaluation to Advances

CAPTCHAs are widely used by websites to block bots and spam by presenting challenges that are easy for humans but difficult for automated programs to solve. To improve accessibility, audio CAPTCHAs are designed to complement visual ones. However, the robustness of audio CAPTCHAs against advanced Large Audio Language Models (LALMs) and Automatic Speech Recognition (ASR) models remains unclear. In this paper, we introduce AI-CAPTCHA, a unified framework that offers (i) an evaluation framework, ACEval, which includes advanced LALM- and ASR-based solvers, and (ii) a novel audio CAPTCHA approach, IllusionAudio, leveraging audio illusions. Through extensive evaluations of seven widely deployed audio CAPTCHAs, we show that most existing methods can be solved with high success rates by advanced LALMs and ASR models, exposing critical security weaknesses. To address these vulnerabilities, we design a new audio CAPTCHA approach, IllusionAudio, which exploits perceptual illusion cues rooted in human auditory mechanisms. Extensive experiments demonstrate that our method defeats all tested LALM- and ASR-based attacks while achieving a 100% human pass rate, significantly outperforming existing audio CAPTCHA methods.


[50] 2601.08753

Grid-Aware Charging and Operational Optimization for Mixed-Fleet Public Transit

The rapid growth of urban populations and the increasing need for sustainable transportation solutions have prompted a shift towards electric buses in public transit systems. However, the effective management of mixed fleets consisting of both electric and diesel buses poses significant operational challenges. One major challenge is coping with dynamic electricity pricing, where charging costs vary throughout the day. Transit agencies must optimize charging assignments in response to such dynamism while accounting for secondary considerations such as seating constraints. This paper presents a comprehensive mixed-integer linear programming (MILP) model to address these challenges by jointly optimizing charging schedules and trip assignments for mixed (electric and diesel bus) fleets while considering factors such as dynamic electricity pricing, vehicle capacity, and route constraints. We address the potential computational intractability of the MILP formulation, which can arise even with relatively small fleets, by employing a hierarchical approach tailored to the fleet composition. By using real-world data from the city of Chattanooga, Tennessee, USA, we show that our approach can result in significant savings in the operating costs of the mixed transit fleets.


[51] 2601.08764

FusID: Modality-Fused Semantic IDs for Generative Music Recommendation

Generative recommendation systems have achieved significant advances by leveraging semantic IDs to represent items. However, existing approaches that tokenize each modality independently face two critical limitations: (1) redundancy across modalities that reduces efficiency, and (2) failure to capture inter-modal interactions that limits item representation. We introduce FusID, a modality-fused semantic ID framework that addresses these limitations through three key components: (i) multimodal fusion that learns unified representations by jointly encoding information across modalities, (ii) representation learning that brings frequently co-occurring item embeddings closer while maintaining distinctiveness and preventing feature redundancy, and (iii) product quantization that converts the fused continuous embeddings into multiple discrete tokens to mitigate ID conflict. Evaluated on a multimodal next-song recommendation (i.e., playlist continuation) benchmark, FusID achieves zero ID conflicts, ensuring that each token sequence maps to exactly one song, mitigates codebook underutilization, and outperforms baselines in terms of MRR and Recall@k (k = 1, 5, 10, 20).


[52] 2601.08780

LWM-Spectro: A Foundation Model for Wireless Baseband Signal Spectrograms

The received in-phase and quadrature (I/Q) baseband signals inherently encode physical-layer and channel characteristics of wireless links. Learning robust and transferable representations directly from such raw signals, however, remains challenging due to heterogeneous communication systems, diverse propagation environments, and limited labeled data. To address this, we present LWM-Spectro, a transformer-based foundation model pretrained on large-scale I/Q data represented as time-frequency spectrograms. The model leverages self-supervised masked modeling, contrastive learning, and a mixture-of-experts (MoE) architecture to learn general-purpose wireless representations. These representations transfer effectively to downstream tasks such as modulation classification and joint SNR/mobility recognition, even with minimal supervision. Across tasks, LWM-Spectro consistently outperforms state-of-the-art deep learning baselines in both few-shot and data-rich regimes, providing a unified foundation for wireless learning.


[53] 2204.13620

Generative Adversarial Networks for Image Super-Resolution: A Survey

Single image super-resolution (SISR) has played an important role in the field of image processing. Recent generative adversarial networks (GANs) can achieve excellent results on low-resolution images. However, there are little literatures summarizing different GANs in SISR. In this paper, we conduct a comparative study of GANs from different perspectives. We begin by surveying the development of GANs and popular GAN variants for image-related applications, and then analyze motivations, implementations and differences of GANs based optimization methods and discriminative learning for image super-resolution in terms of supervised, semi-supervised and unsupervised manners, where these GANs are analyzed via integrating different network architectures, prior knowledge, loss functions and multiple tasks. Secondly, we compare the performances of these popular GANs on public datasets via quantitative and qualitative analysis in SISR. Finally, we highlight challenges of GANs and potential research points for SISR.


[54] 2410.11339

EEG-based 90-Degree Turn Intention Detection for Brain-Computer Interface

Electroencephalography (EEG)--based turn intention prediction for lower limb movement is important to build an efficient brain-computer interface (BCI) system. This study investigates the feasibility of intention detection of left-turn, right-turn, and straight walk by utilizing EEG signals obtained before the event occurrence. Synchronous data was collected using 31-channel EEG and IMU-based motion capture systems for nine healthy participants while performing left-turn, right-turn, and straight walk movements. EEG data was preprocessed with steps including Artifact Subspace Reconstruction (ASR), re-referencing, and Independent Component Analysis (ICA) to remove data noise. Feature extraction from the preprocessed EEG data involved computing various statistical measures (mean, median, standard deviation, skew, and kurtosis), and Hjorth parameters (activity, mobility, and complexity). Further, the feature selection was performed using the Random forest algorithm for the dimensionality reduction. The feature set obtained was utilized for 3-class classification using XG boost, gradient boosting, and support vector machine (SVM) with RBF kernel classifiers in a five-fold cross-validation scheme. Using the proposed intention detection methodology, the SVM classifier using an EEG window of 1.5 s and 0 s time-lag has the best decoding performance with mean accuracy, precision, and recall of 81.23%, 85.35%, and 83.92%, respectively, across the nine participants. The decoding analysis shows the feasibility of turn intention prediction for lower limb movement using the EEG signal before the event onset.


[55] 2505.18058

A Pre-trained Foundation Model Framework for Multiplanar MRI Classification of Extramural Vascular Invasion and Mesorectal Fascia Invasion in Rectal Cancer

Objectives Accurate MRI-based identification of extramural vascular invasion (EVI) and mesorectal fascia invasion (MFI) is crucial for risk-stratified rectal cancer treatment. However, subjective visual assessment and inter-institutional variability limit diagnostic consistency. This study developed and externally evaluated a multi-centre, foundation model-driven framework that automatically classifies EVI and MFI on axial and sagittal MRI. Methods A total of 331 pre-treatment rectal cancer T2-weighted MRI scans from three European hospitals were retrospectively recruited. A self-supervised frequency domain harmonization strategy was applied to reduce scanner variability. Three classifiers, SeResNet, the universal biomedical pretrained model (UMedPT) with a multilayer perceptron head, and a logistic-regression variant using frozen UMedPT features (UMedPT_LR), were trained (n=265) and tested (n=66). Gradient-weighted class activation mapping (Grad-CAM) visualized model predictions. Results UMedPT_LR achieved the best EVI performance with multiplanar fusion (AUC=0.82, test set). For MFI, UMedPT trained on axial harmonized images yielded the highest performance (AUC = 0.77). Both tasks outperformed the CHAIMELEON 2024 benchmark (EVI: 0.82 vs 0.74; MFI: 0.77 vs 0.75). Harmonization enhanced MFI classification, and multiplanar fusion further boosted EVI performance. Grad-CAM confirmed biologically plausible attention on peritumoral regions (EVI) and mesorectal fascia margins (MFI). Conclusion The proposed foundation model-driven framework, leveraging frequency domain harmonization and multiplanar fusion, achieves state-of-the-art performance for automated EVI and MFI classification on MRI, demonstrating strong generalizability across multiple centers.


[56] 2506.10407

Semi-Tensor-Product Based Convolutional Neural Networks

The semi-tensor product of vectors generalizes the conventional inner product, enabling algebraic operations between vectors of different dimensions. Building upon this foundation, we introduce a domain-based convolutional product and integrate it with the STP to formulate a padding-free convolutional operation. This new operation inherently avoids zero or other artificial padding, thereby eliminating redundant information and boundary artifacts commonly present in conventional convolutional neural networks. Based on this operation, we further develop an STP-based CNN framework that extends convolutional computation to irregular and cross-dimensional data domains. Applications to image processing and third-order signal identification demonstrate the proposed method's effectiveness in handling irregular, incomplete, and high-dimensional data without the distortions caused by padding.


[57] 2507.15958

Quantization-Aware Neuromorphic Architecture for Skin Disease Classification on Resource-Constrained Devices

On-device skin lesion analysis is constrained by the compute and energy cost of conventional CNN inference and by the need to update models as new patient data become available. We propose QANA, a quantization-aware CNN backbone embedded in an end-to-end pipeline engineered for conversion-stable neuromorphic execution. QANA replaces conversion-fragile components with spike-compatible transformations by bounding intermediate activations and aligning normalization with low-bit quantization, reducing conversion-induced distortion that disproportionately impacts rare classes. Efficiency is achieved through Ghost-based feature generation under tight FLOP budgets, while spatially-aware efficient channel attention and squeeze-and-excitation recalibrate channels without heavy global operators that are difficult to map to spiking cores. The resulting quantized projection head produces SNN-ready logits and enables incremental updates on edge hardware without full retraining or data offloading. On HAM10000, QANA achieves 91.6% Top-1 accuracy and 91.0% macro F1, improving the strongest converted SNN baseline by 3.5 percentage points in Top-1 accuracy, corresponding to a 4.0% relative gain, and by 12.0 points in macro F1, corresponding to a 15.2% relative gain. On a clinical dataset, QANA achieves 90.8% Top-1 accuracy and 81.7% macro F1, improving the strongest converted SNN baseline by 3.2 points in Top-1 accuracy, which corresponds to a 3.7% relative gain, and by 3.6 points in macro F1, corresponding to a 4.6% relative gain. When deployed on BrainChip Akida, QANA runs in 1.5 ms per image with 1.7 mJ per image, corresponding to 94.6% lower latency and 99.0% lower energy than its GPU-based CNN implementation.


[58] 2508.09179

HiFi-Mamba: Dual-Stream W-Laplacian Enhanced Mamba for High-Fidelity MRI Reconstruction

Reconstructing high-fidelity MR images from undersampled k-space data remains a challenging problem in MRI. While Mamba variants for vision tasks offer promising long-range modeling capabilities with linear-time complexity, their direct application to MRI reconstruction inherits two key limitations: (1) insensitivity to high-frequency anatomical details; and (2) reliance on redundant multi-directional scanning. To address these limitations, we introduce High-Fidelity Mamba (HiFi-Mamba), a novel dual-stream Mamba-based architecture comprising stacked W-Laplacian (WL) and HiFi-Mamba blocks. Specifically, the WL block performs fidelity-preserving spectral decoupling, producing complementary low- and high-frequency streams. This separation enables the HiFi-Mamba block to focus on low-frequency structures, enhancing global feature modeling. Concurrently, the HiFi-Mamba block selectively integrates high-frequency features through adaptive state-space modulation, preserving comprehensive spectral details. To eliminate the scanning redundancy, the HiFi-Mamba block adopts a streamlined unidirectional traversal strategy that preserves long-range modeling capability with improved computational efficiency. Extensive experiments on standard MRI reconstruction benchmarks demonstrate that HiFi-Mamba consistently outperforms state-of-the-art CNN-based, Transformer-based, and other Mamba-based models in reconstruction accuracy while maintaining a compact and efficient model design.


[59] 2508.16803

A predictive modular approach to constraint satisfaction under uncertainty -- with application to glycosylation in continuous monoclonal antibody biosimilar production

The paper proposes a modular-based approach to constraint handling in process optimization and control. This is partly motivated by the recent interest in learning-based methods, e.g., within bioproduction, for which constraint handling under uncertainty is a challenge. The proposed constraint handler, called predictive filter, is combined with an adaptive constraint margin and a constraint violation cost monitor to minimize the cost of violating soft constraints due to model uncertainty and disturbances. The module can be combined with any controller and is based on minimally modifying the controller output, in a least squares sense, such that constraints are satisfied within the considered horizon. The proposed method is computationally efficient and suitable for real-time applications. The effectiveness of the method is illustrated through a realistic case study of glycosylation constraint satisfaction in continuous monoclonal antibody biosimilar production using Chinese hamster ovary cells, employing a metabolic network model consisting of 23 extracellular metabolites and 126 reactions. In the case study, the average constraint-violation cost is reduced by more than 60% compared to the case without the proposed constraint-handling method.


[60] 2509.20034

Joint Reproduction Number and Spatial Connectivity Structure Estimation via Graph Sparsity-Promoting Penalized Functional

During an epidemic outbreak, decision makers crucially need accurate and robust tools to monitor the pathogen propagation. The effective reproduction number, defined as the expected number of secondary infections stemming from one contaminated individual, is a state-of-the-art indicator quantifying the epidemic intensity. Numerous estimators have been developed to precisely track the reproduction number temporal evolution. Yet, COVID-19 pandemic surveillance raised unprecedented challenges due to the poor quality of worldwide reported infection counts. When monitoring the epidemic in different territories simultaneously, leveraging the spatial structure of data significantly enhances both the accuracy and robustness of reproduction number estimates. However, this requires a good estimate of the spatial structure. To tackle this major limitation, the present work proposes a joint estimator of the reproduction number and connectivity structure. The procedure is assessed through intensive numerical simulations on carefully designed synthetic data and illustrated on real COVID-19 spatiotemporal infection counts.


[61] 2512.04914

Analytical and Cross-Sectional Clinical Validity of a Smartphone-Based U-Turn Test in Multiple Sclerosis

Background: Gait and balance impairment can profoundly impact people with multiple sclerosis (PwMS). Objectives: To evaluate the analytical and clinical validity of the U-Turn Test (UTT), a smartphone-based assessment of dynamic balance in PwMS. Methods: The GaitLab study (ISRCTN15993728) enrolled adult PwMS (EDSS 0.0-6.5). PwMS performed the UTT in a gait laboratory (supervised) using 6 smartphones at different wear locations and daily during a two-week remote period (unsupervised) using one smartphone (belt front). Median turn speed was computed per UTT. In the supervised setting, turn detection accuracy of smartphones was compared to motion capture (mocap) via F1 scores. Agreement between smartphone- and mocap-derived turn speed was assessed by Bland-Altman and ICC(3,1). In the unsupervised setting, test-retest reliability (ICC[2,1]) and correlations with Timed 25-Foot Walk (T25FW), EDSS, Ambulation Score, 12-item Multiple Sclerosis Walking Scale (MSWS-12), and Activities-specific Balance Confidence scale (ABC) were evaluated. Results: Ninety-six PwMS were included. Turn speed was comparable across supervised (1.44 rad/s) and unsupervised settings (1.47 rad/s). In the supervised setting, turn detection was highly accurate (F1 >95% across wear locations). Turn speed agreement with mocap was high (ICC[3,1]: 0.87-0.92), with minimal bias (-0.04 to 0.11 rad/s). Unsupervised test-retest reliability (ICC[2,1]) was >0.90 when aggregating >=2 tests. Turn speed correlated with T25FW (rho=-0.79), EDSS (rho=-0.75), Ambulation score (rho=-0.73), MSWS-12 (rho=-0.65), and ABC (rho=-0.61). Conclusion: The UTT accurately and reproducibly measures turn speed across wear locations and settings, providing complementary dynamic balance insights to clinical measures and showing potential for use in multiple sclerosis trials.


[62] 2512.21018

LEO Constellations as a Decentralized GNSS Network: Optimizing PNT Corrections in Space

With the rapid expansion of low Earth orbit (LEO) constellations, thousands of satellites are now in operation, many equipped with onboard GNSS receivers capable of continuous orbit determination and time synchronization. This development is creating an unprecedented spaceborne GNSS network, offering new opportunities for network-driven precise LEO orbit and clock estimation. Yet, current onboard GNSS processing is largely standalone and often insufficient for high-precision applications, while centralized fusion is challenging due to computational bottlenecks and the lack of in-orbit infrastructure. In this work, we report a decentralized GNSS network over large-scale LEO constellations, where each satellite processes its own measurements while exchanging compact information with neighboring nodes to enable precise orbit and time determination. We model the moving constellation as a dynamic graph and tailor a momentum-accelerated gradient tracking (GT) method to ensure steady convergence despite topology changes. Numerical simulations with constellations containing hundreds of satellites show that the proposed method matches the accuracy of an ideal centralized benchmark, while substantially reducing communication burdens. Ultimately, this framework supports the development of autonomous and self-organizing space systems, enabling high-precision navigation with reduced dependence on continuous ground contact.


[63] 2601.01179

Adaptive Scheduling: A Reinforcement Learning Whittle Index Approach for Wireless Sensor Networks

We propose a reinforcement learning based scheduling framework for Restless Multi-Armed Bandit (RMAB) problems, centred on a Whittle Index Q-Learning policy with Upper Confidence Bound (UCB) exploration, referred to as WIQL-UCB. Unlike existing approaches that rely on fixed or adaptive epsilon-greedy strategies and require careful hyperparameter tuning, the proposed method removes problem-specific tuning and is therefore more generalisable across diverse RMAB settings. We evaluate WIQL-UCB on standard RMAB benchmarks and on a practical sensor scheduling application based on the Age of Incorrect Information (AoII), using an edge-based state estimation scheme that requires no prior knowledge of system dynamics. Experimental results show that WIQL-UCB achieves near-optimal performance while significantly improving computational and memory efficiency. For a representative problem size of N = 15 and M = 3, the proposed method requires only around 600 bytes of memory, compared with several kilobytes for tabular Q-learning and hundreds of kilobytes to megabytes for deep reinforcement learning baselines. In addition, WIQL-UCB achieves sub-millisecond per-decision runtimes and is several times faster than deep reinforcement learning approaches, while maintaining competitive performance. Overall, these results demonstrate that WIQL-UCB consistently outperforms both non-Whittle-based and Whittle-index learning baselines across a wide range of RMAB settings.


[64] 2601.03486

Adaptive Model-Based Reinforcement Learning for Orbit Feedback Control in NSLS-II Storage Ring

The National Synchrotron Light Source II (NSLS-II) uses highly stable electron beam to produce high-quality X-ray beams with high brightness and low-emittance synchrotron radiation. The traditional algorithm to stabilize the beam applies singular value decomposition (SVD) on the orbit response matrix to remove noise and extract actions. Supervised learning has been studied on NSLS-II storage ring stabilization and other accelerator facilities recently. Several problems, for example, machine status drifting, environment noise, and non-linear accelerator dynamics, remain unresolved in the SVD-based and supervised learning algorithms. To address these problems, we propose an adaptive training framework based on model-based reinforcement learning. This framework consists of two types of optimizations: trajectory optimization attempts to minimize the expected total reward in a differentiable environment, and online model optimization learns non-linear machine dynamics through the agent-environment interaction. Through online training, this framework tracks the internal status drifting in the electron beam ring. Simulation and real in-facility experiments on NSLS-II reveal that our method stabilizes the beam position and minimizes the alignment error, defined as the root mean square (RMS) error between adjusted beam positions and the reference position, down to ~1$\mu$m.


[65] 2601.07150

Analysis, detection and control of secure and safe cyber-physical control systems in a unified framework

This paper deals with analysis, simultaneous detection of faults and attacks, fault-tolerant control and attack-resilient of cyber-physical control systems. In our recent work, it has been observed that an attack detector driven by an input residual signal is capable of reliably detecting attacks. In particular, observing system dynamics from the perspective of the system input-output signal space reveals that attacks and system uncertainties act on different system subspaces. These results motivate our exploration of secure and safe cyber-physical control systems in the unified framework of control and detection. The unified framework is proposed to handle control and detection issues uniformly and in subspaces of system input-output data. Its mathematical and control-theoretic basis is system coprime factorizations with Bezout identity at its core. We firstly explore those methods and schemes of the unified framework, which serve as the major control-theoretic tool in our work. It is followed by re-visiting and examining established attack detection and resilient control schemes. The major part of our work is the endeavours to develop a control-theoretic paradigm, in which analysis, simultaneous detection of faults and attacks, fault-tolerant and attack-resilient control of cyber-physical control systems are addressed in a unified manner.


[66] 2408.15969

Stability of Primal-Dual Gradient Flow Dynamics for Multi-Block Convex Optimization Problems

We examine stability properties of primal-dual gradient flow dynamics for composite convex optimization problems with multiple, possibly nonsmooth, terms in the objective function under the generalized consensus constraint. The proposed dynamics are based on the proximal augmented Lagrangian and they provide a viable alternative to ADMM which faces significant challenges from both analysis and implementation viewpoints in large-scale multi-block scenarios. In contrast to customized algorithms with individualized convergence guarantees, we develop a systematic approach for solving a broad class of challenging composite optimization problems. We leverage various structural properties to establish global (exponential) convergence guarantees for the proposed dynamics. Our assumptions are much weaker than those required to prove (exponential) stability of primal-dual dynamics as well as (linear) convergence of discrete-time methods such as standard two-block and multi-block ADMM and EXTRA algorithms. Finally, we show necessity of some of our structural assumptions for exponential stability and provide computational experiments to demonstrate the convenience of the proposed approach for parallel and distributed computing applications.


[67] 2409.19071

Analog fast Fourier transforms for scalable and efficient signal processing

Edge devices are being deployed at increasing volumes to sense and act on information from the physical world. The discrete Fourier transform (DFT) is often necessary to make this sensed data suitable for further processing -- such as by artificial intelligence (AI) algorithms -- and for transmission over communication networks. Analog in-memory computing has been shown to be a fast, energy-efficient, and scalable solution for processing edge AI workloads, but not for Fourier transforms. This is because of the existence of the fast Fourier transform (FFT) algorithm, which enormously reduces the complexity of the DFT but has so far belonged only to digital processors. Here, we show that the FFT can be mapped to analog in-memory computing systems, enabling them to efficiently scale to arbitrarily large Fourier transforms without requiring large sizes or large numbers of non-volatile memory arrays. We experimentally demonstrate analog FFTs on 1D audio and 2D image signals, performing analog computations on up to 524K charge-trapping memory devices simultaneously, where each device has precisely tunable, low-conductance analog states. The scalability of both the new analog FFT approach and the charge-trapping memory device is leveraged to compute a 65,536-point analog DFT, a scale that is otherwise inaccessible by analog systems and which is $>$500$\times$ larger than any previous analog DFT demonstration. Analog FFT cores can provide higher energy efficiency and performance per area than specialized digital FFT processors at all FFT sizes, while also functioning as efficient matrix multiplication engines for AI workloads.


[68] 2503.09252

Large-scale Regional Traffic Signal Control Based on Single-Agent Reinforcement Learning

In the context of global urbanization and motorization, traffic congestion has become a significant issue, severely affecting the quality of life, environment, and economy. This paper puts forward a single-agent reinforcement learning (RL)-based regional traffic signal control (TSC) model. Different from multi - agent systems, this model can coordinate traffic signals across a large area, with the goals of alleviating regional traffic congestion and minimizing the total travel time. The TSC environment is precisely defined through specific state space, action space, and reward functions. The state space consists of the current congestion state, which is represented by the queue lengths of each link, and the current signal phase scheme of intersections. The action space is designed to select an intersection first and then adjust its phase split. Two reward functions are meticulously crafted. One focuses on alleviating congestion and the other aims to minimize the total travel time while considering the congestion level. The experiments are carried out with the SUMO traffic simulation software. The performance of the TSC model is evaluated by comparing it with a base case where no signal-timing adjustments are made. The results show that the model can effectively control congestion. For example, the queuing length is significantly reduced in the scenarios tested. Moreover, when the reward is set to both alleviate congestion and minimize the total travel time, the average travel time is remarkably decreased, which indicates that the model can effectively improve traffic conditions. This research provides a new approach for large-scale regional traffic signal control and offers valuable insights for future urban traffic management.


[69] 2504.11692

Beyond ISAC: Toward Integrated Heterogeneous Service Provisioning via Elastic Multi-Dimensional Multiple Access

Due to the growing complexity of vertical applications, current integrated sensing and communications (ISAC) in wireless networks remains insufficient for supporting all required beyond communication services. To this end, future networks are evolving toward an integrated heterogeneous service provisioning (IHSP) platform, which seeks to integrate a broad range of heterogeneous services beyond the dual-function scope of ISAC. Nevertheless, this trend intensifies conflicts among concurrent heterogeneous service requirements under constrained resource sharing. In this paper, we overcome this challenge by the joint use of two novel elastic design strategies: compromised service value assessment and flexible multi-dimensional resource multiplexing. Consequently, we propose a value-prioritized elastic multi-dimensional multiple access (MDMA) mechanism for IHSP systems. First, we modify the Value-of-Service (VoS) metric by incorporating elastic parameters to characterize user-specific tolerance and compromise in response to various performance degradations under constrained resources. This VoS metric serves as the foundation for prioritizing services and enabling effective fairness service scheduling among concurrent competing demands. Next, we adapt the MDMA to elastically multiplex services using appropriate multiple access schemes across different resource domains. This protocol leverages user-specific interference tolerances and cancellation capabilities across different domains to reduce resource-demanding conflicts and co-channel interference within the same domain. Then, we maximize the system's VoS by jointly optimizing MDMA and power allocation. Since this problem is non-convex, we develop a monotonic optimization-assisted dynamic programming algorithm for the optimal solution and a VoS-prioritized successive convex approximation algorithm for efficient suboptimal computation.


[70] 2506.18124

Bayesian Multiobject Tracking With Neural-Enhanced Motion and Measurement Models

Multiobject tracking (MOT) is an important task in applications including autonomous driving, ocean sciences, and aerospace surveillance. Traditional MOT methods are model-based and combine sequential Bayesian estimation with data association and an object birth model. More recent methods are fully data-driven and rely on the training of neural networks. Both approaches offer distinct advantages in specific settings. In particular, model-based methods are generally applicable across a wide range of scenarios, whereas data-driven MOT achieves superior performance in scenarios where abundant labeled data for training is available. A natural thought is whether a general framework can integrate the two approaches. This paper introduces a hybrid method that utilizes neural networks to enhance specific aspects of the statistical model in Bayesian MOT that have been identified as overly simplistic. By doing so, the performance of the prediction and update steps of Bayesian MOT is improved. To ensure tractable computation, our framework uses belief propagation to avoid high-dimensional operations combined with sequential Monte Carlo methods to perform low-dimensional operations efficiently. The resulting method combines the flexibility and robustness of model-based approaches with the capability to learn complex information from data of neural networks. We evaluate the performance of the proposed method based on the nuScenes autonomous driving dataset and demonstrate that it has state-of-the-art performance.


[71] 2508.05385

A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding

Non-verbal Vocalizations (NVs), such as laughter and sighs, are vital for conveying emotion and intention in human speech, yet most existing speech systems neglect them, which severely compromises communicative richness and emotional intelligence. Existing methods for NVs acquisition are either costly and unscalable (relying on manual annotation/recording) or unnatural (relying on rule-based synthesis). To address these limitations, we propose a highly scalable automatic annotation framework to label non-verbal phenomena from natural speech, which is low-cost, easily extendable, and inherently diverse and natural. This framework leverages a unified detection model to accurately identify NVs in natural speech and integrates them with transcripts via temporal-semantic alignment method. Using this framework, we created and released \textbf{NonVerbalSpeech-38K}, a diverse, real-world dataset featuring 38,718 samples across 10 NV categories collected from in-the-wild media. Experimental results demonstrate that our dataset provides superior controllability for NVs generation and achieves comparable performance for NVs understanding.


[72] 2508.06538

Symbolic Learning of Interpretable Reduced-Order Models for Jumping Quadruped Robots

Reduced-order models are central to motion planning and control of quadruped robots, yet existing templates are often hand-crafted for a specific locomotion modality. This motivates the need for automatic methods that extract task-specific, interpretable low-dimensional dynamics directly from data. We propose a methodology that combines a linear autoencoder with symbolic regression to derive such models. The linear autoencoder provides a consistent latent embedding for configurations, velocities, accelerations, and inputs, enabling the sparse identification of nonlinear dynamics (SINDy) to operate in a compact, physics-aligned space. A multi-phase, hybrid-aware training scheme ensures coherent latent coordinates across contact transitions. We focus our validation on quadruped jumping-a representative, challenging, yet contained scenario in which a principled template model is especially valuable. The resulting symbolic dynamics outperform the state-of-the-art handcrafted actuated spring-loaded inverted pendulum (aSLIP) baseline in simulation and hardware across multiple robots and jumping modalities.


[73] 2508.10360

A dataset and model for auditory scene recognition for hearing devices: AHEAD-DS and OpenYAMNet

Scene recognition is important for hearing devices, however; this is challenging, in part because of the limitations of existing datasets. Datasets often lack public accessibility, completeness, or audiologically relevant labels, hindering systematic comparison of machine learning models. Deploying such models on resource-constrained edge devices presents another this http URL proposed solution is two-fold, a repack and refinement of several open source datasets to create AHEAD-DS, a dataset designed for auditory scene recognition for hearing devices, and introduce OpenYAMNet, a sound recognition model. AHEAD-DS aims to provide a standardised, publicly available dataset with consistent labels relevant to hearing aids, facilitating model comparison. OpenYAMNet is designed for deployment on edge devices like smartphones connected to hearing devices, such as hearing aids and wireless earphones with hearing aid functionality, serving as a baseline model for sound-based scene recognition. OpenYAMNet achieved a mean average precision of 0.86 and accuracy of 0.93 on the testing set of AHEAD-DS across fourteen categories relevant to auditory scene recognition. Real-time sound-based scene recognition capabilities were demonstrated on edge devices by deploying OpenYAMNet to an Android smartphone. Even with a 2018 Google Pixel 3, a phone with modest specifications, the model processes audio with approximately 50ms of latency to load the model, and an approximate linear increase of 30ms per 1 second of audio. The project website with links to code, data, and models. this https URL


[74] 2509.06926

Continuous Audio Language Models

Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are invertible, audio tokens are extracted from lossy codecs with a limited bitrate. As a consequence, increasing audio quality requires generating more tokens, which imposes a trade-off between fidelity and computational cost. We address this issue by studying Continuous Audio Language Models (CALM). These models instantiate a large Transformer backbone that produces a contextual embedding at every timestep. This sequential information then conditions an MLP that generates the next continuous frame of an audio VAE through consistency modeling. By avoiding lossy compression, CALM achieves higher quality at lower computational cost than their discrete counterpart. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. Samples are available at this http URL. Finally, we release Pocket TTS, an open-source 100M-parameter text-to-speech model that can run faster than real time on a laptop CPU: this http URL.


[75] 2509.24613

HiKE: Hierarchical Evaluation Framework for Korean-English Code-Switching Speech Recognition

Despite advances in multilingual automatic speech recognition (ASR), code-switching (CS), the mixing of languages within an utterance common in daily speech, remains a severely underexplored challenge. In this paper, we introduce HiKE: the Hierarchical Korean-English code-switching benchmark, the first globally accessible non-synthetic evaluation framework for Korean-English CS, aiming to provide a means for the precise evaluation of multilingual ASR models and to foster research in the field. The proposed framework not only consists of high-quality, natural CS data across various topics, but also provides meticulous loanword labels and a hierarchical CS-level labeling scheme (word, phrase, and sentence) that together enable a systematic evaluation of a model's ability to handle each distinct level of code-switching. Through evaluations of diverse multilingual ASR models and fine-tuning experiments, this paper demonstrates that although most multilingual ASR models initially exhibit inadequate CS-ASR performance, this capability can be enabled through fine-tuning with synthetic CS data. HiKE is available at this https URL.


[76] 2510.15947

WaveNet's Precision in EEG Classification

This study introduces a WaveNet-based deep learning model designed to automate the classification of intracranial electroencephalography (iEEG) signals into physiological activity, pathological (epileptic) activity, power-line noise, and other non-cerebral artifacts. Traditional methods for iEEG signal classification, which rely on expert visual review, are becoming increasingly impractical due to the growing complexity and volume of iEEG recordings. Leveraging a publicly available annotated dataset from Mayo Clinic and St. Anne's University Hospital, the WaveNet model was trained, validated, and tested on 209,231 samples using a 70/20/10 split. The model achieved a classification accuracy exceeding previous non-specialized CNN- and LSTM-based approaches and was benchmarked against a Temporal Convolutional Network (TCN) baseline. Notably, the model achieves high discrimination of noise and artifact classes, with precisions of 0.98 and approximately 1, respectively. Classification between physiological and pathological signals exhibits a modest but clinically interpretable overlap, with F1-scores of 0.96 and 0.90 and 175 and 272 cross-class false positives, respectively, reflecting inherent clinical overlap. WaveNet's architecture, originally developed for raw audio synthesis, is well-suited for iEEG data due to its use of dilated causal convolutions and residual connections, enabling the capture of both fine-grained and long-range temporal dependencies. The study also details the preprocessing pipeline, including dynamic dataset partitioning, the use of focal loss to address class imbalance, and normalization steps that support high model performance. While the results demonstrate strong in-distribution performance, generalizability across datasets and clinical settings has yet to be established.


[77] 2510.19068

An Adaptive Neuro-Controller Developed for a Prosthetic Hand Wrist

The significance of employing a controller in prosthetic hands cannot be overstated, as it plays a crucial role in enhancing the functionality and usability of these systems. This paper introduces an adaptive neuro-controller specifically developed for a tendon-driven soft continuum wrist of a prosthetic hand. Kinematic and dynamic modeling of the wrist is carried out using the Timoshenko beam theory. A Neural Network (NN) based strategy is adopted to predict the required motor currents to manipulate the wrist tendons from the errors in the deflection of the wrist section. The Timoshenko beam theory is used to compute the required tendon tension from the input motor current. A comparison of the adaptive neuro-controller with other similar controllers is conducted to analyze the performance of the proposed approach. Simulation studies and experimental validations of the fabricated wrist are included to demonstrate the effectiveness of the controller.