New articles on Electrical Engineering and Systems Science


[1] 2510.21748

Automated Tinnitus Detection Through Dual-Modality Neuroimaging: EEG Microstate Analysis and Resting-State fMRI Classification Using Deep Learning

Objective: Tinnitus affects 10-15% of the population yet lacks objective diagnostic biomarkers. This study applied machine learning to EEG and fMRI data to identify neural signatures distinguishing tinnitus patients from healthy controls. Methods: Two datasets were analyzed: 64-channel EEG recordings from 80 participants (40 tinnitus, 40 controls) and resting-state fMRI data from 38 participants (19 tinnitus, 19 controls). EEG analysis extracted microstate features across four to seven clustering states and five frequency bands, producing 440 features per subject. Global Field Power signals were also transformed into wavelet images for deep learning. fMRI data were analyzed using slice-wise convolutional neural networks and hybrid models combining pre-trained architectures (VGG16, ResNet50) with Decision Tree, Random Forest, and SVM classifiers. Model performance was evaluated using 5-fold cross-validation based on accuracy, precision, recall, F1-score, and ROC-AUC. Results: EEG microstate analysis revealed altered network dynamics in tinnitus, particularly reduced gamma-band microstate B occurrence (healthy: 56.56 vs tinnitus: 43.81, p < 0.001) and diminished alpha coverage. Tree-based classifiers achieved up to 98.8% accuracy, while VGG16 on wavelet-transformed EEG yielded 95.4% and 94.1% accuracy for delta and alpha bands, respectively. fMRI analysis identified 12 high-performing axial slices (>=90% accuracy), with slice 17 reaching 99.0%. The hybrid VGG16-Decision Tree model achieved 98.95% +/- 2.94% accuracy. Conclusion: EEG and fMRI provided effective neural biomarkers for tinnitus classification. Tree-based and hybrid models demonstrated superior performance, highlighting tinnitus as a multi-network disorder requiring multimodal analysis.


[2] 2510.21789

Monitoring Real-Time ECG Signals on Mobile Systems

This study focuses on the connection of a development kit that enables real-time monitoring of electrocardiogram (ECG) signals using a mobile system. A software developed on the Visual Studio .NET platform reads real-time ECG signals from the human body through non invasive methods and displays them graphically on the mobile system. ECG electrodes placed on specific areas of the body using the method known as Einthoven's triangle. Subsequently, the software initiates data flow through the serial port, and these data displayed as signal values on the mobile device's screen via a graphical interface. When the monitored ECG signals fall below a certain threshold or reach a critical value, the system provides feedback with an alert based on medical data. The developed system is fully portable. Additionally, the implemented system has the potential to form the basis for a multi-purpose system in the future, such as online patient monitoring, patient location tracking, and even initial intervention using the defibrillation method.


[3] 2510.21815

HDR Image Reconstruction using an Unsupervised Fusion Model

High Dynamic Range (HDR) imaging aims to reproduce the wide range of brightness levels present in natural scenes, which the human visual system can perceive but conventional digital cameras often fail to capture due to their limited dynamic range. To address this limitation, we propose a deep learning-based multi-exposure fusion approach for HDR image generation. The method takes a set of differently exposed Low Dynamic Range (LDR) images, typically an underexposed and an overexposed image, and learns to fuse their complementary information using a convolutional neural network (CNN). The underexposed image preserves details in bright regions, while the overexposed image retains information in dark regions; the network effectively combines these to reconstruct a high-quality HDR output. The model is trained in an unsupervised manner, without relying on ground-truth HDR images, making it practical for real-world applications where such data is unavailable. We evaluate our results using the Multi-Exposure Fusion Structural Similarity Index Measure (MEF-SSIM) and demonstrate that our approach achieves superior visual quality compared to existing fusion methods. A customized loss function is further introduced to improve reconstruction fidelity and optimize model performance.


[4] 2510.21911

A Perspective on the Algebra, Topology, and Logic of Electrical Networks

This paper presents a unified algebraic, topological, and logical framework for electrical one-port networks based on Šare's $m$-theory. Within this formalism, networks are represented by $m$-words (jorbs) over an ordered alphabet, where series and parallel composition induce an $m$-topology on $m$-graphs with a theta mapping $\vartheta$ that preserves one-port equivalence. The study formalizes quasi-orders, shells, and cores, showing their structural correspondence to network boundary conditions and impedance behavior. The $\lambda--\Delta$ metric, together with the valuation morphism $\Phi$, provides a concise descriptor of the impedance-degree structure. In the computational domain, the framework is extended with algorithmic procedures for generating and classifying non-isomorphic series-parallel topologies, accompanied by programmatic Cauer/Foster synthesis workflows and validation against canonical examples from Ladenheim's catalogue. The resulting approach enables symbolic-to-topological translation of impedance functions, offering a constructive bridge between algebraic representation and electrical realization. Overall, the paper outlines a self-consistent theoretical and computational foundation for automated network synthesis, classification, and formal verification within the emerging field of Jorbology.


[5] 2510.21924

Inverse Design of Metasurface for Spectral Imaging

Inverse design of metasurfaces for the joint optimization of optical modulation and algorithmic decoding in computational optics presents significant challenges, especially in applications such as hyperspectral imaging. We introduce a physics-data co-driven framework for designing reconfigurable metasurfaces fabricated from the phase-change material Ge2Sb2Se4Te1 to achieve compact, compressive spectral imaging in the shortwave infrared region. Central to our approach is a differentiable neural simulator, trained on over 320,000 simulated geometries, that accurately predicts spectral responses across 11 crystallization states. This differentiability enables end-to-end joint optimization of the metasurface geometry, its spectral encoding function, and a deep reconstruction network. We also propose a soft shape regularization technique that preserves manufacturability during gradient-based updates. Experiments show that our optimized system improves reconstruction fidelity by up to 7.6 dB in the peak-signal-to-noise ratio, with enhanced noise resilience and improved measurement matrix conditioning, underscoring the potential of our approach for high-performance hyperspectral imaging.


[6] 2510.21951

Pricing Problems in Adoption of New Technologies

We propose a generalization of the Bass diffusion model in discrete-time that explicitly models the effect of price in adoption. Our model is different from earlier price-incorporated models and fits well to adoption data for various products. We then utilize this model to study two decision-making problems. First, we provide a series of structural results on optimal pricing strategies to maximize profits from product sales by a monopolist over a finite horizon. We fully characterize the optimal pricing strategy in the single-period problem, and establish several structural properties of the same for the multi-period counterpart. Second, we study a Stackelberg game between a policy-maker and a monopolist, where the former seeks to maximize adoption through rebates, while the latter focuses on profits. For this problem, we analytically characterize crucial properties of the equilibrium path of the single-period game, and demonstrate how they carry over to the multi-period variant.


[7] 2510.21969

Adaptive Split-MMD Training for Small-Sample Cross-Dataset P300 EEG Classification

Detecting single-trial P300 from EEG is difficult when only a few labeled trials are available. When attempting to boost a small target set with a large source dataset through transfer learning, cross-dataset shift arises. To address this challenge, we study transfer between two public visual-oddball ERP datasets using five shared electrodes (Fz, Pz, P3, P4, Oz) under a strict small-sample regime (target: 10 trials/subject; source: 80 trials/subject). We introduce Adaptive Split Maximum Mean Discrepancy Training (AS-MMD), which combines (i) a target-weighted loss with warm-up tied to the square root of the source/target size ratio, (ii) Split Batch Normalization (Split-BN) with shared affine parameters and per-domain running statistics, and (iii) a parameter-free logit-level Radial Basis Function kernel Maximum Mean Discrepancy (RBF-MMD) term using the median-bandwidth heuristic. Implemented on an EEG Conformer, AS-MMD is backbone-agnostic and leaves the inference-time model unchanged. Across both transfer directions, it outperforms target-only and pooled training (Active Visual Oddball: accuracy/AUC 0.66/0.74; ERP CORE P3: 0.61/0.65), with gains over pooling significant under corrected paired t-tests. Ablations attribute improvements to all three components.


[8] 2510.22015

Motion Planning with Precedence Specifications via Augmented Graphs of Convex Sets

We present an algorithm for planning trajectories that avoid obstacles and satisfy key-door precedence specifications expressed with a fragment of signal temporal logic. Our method includes a novel exact convex partitioning of the obstacle free space that encodes connectivity among convex free space sets, key sets, and door sets. We then construct an augmented graph of convex sets that exactly encodes the key-door precedence specifications. By solving a shortest path problem in this augmented graph of convex sets, our pipeline provides an exact solution up to a finite parameterization of the trajectory. To illustrate the effectiveness of our approach, we present a method to generate key-door mazes that provide challenging problem instances, and we perform numerical experiments to evaluate the proposed pipeline. Our pipeline is faster by several orders of magnitude than recent state-of-the art methods that use general purpose temporal logic tools.


[9] 2510.22020

A Hybrid GNN-LSE Method for Fast, Robust, and Physically-Consistent AC Power Flow

Conventional AC Power Flow (ACPF) solvers like Newton-Raphson (NR) face significant computational and convergence challenges in modern, large-scale power systems. This paper proposes a novel, two-stage hybrid method that integrates a Physics-Informed Graph Neural Network (GNN) with a robust, iterative Linear State Estimation (LSE) refinement step to produce fast and physically-consistent solutions. The GNN, trained with a physics-informed loss function featuring an efficient dynamic weighting scheme, rapidly predicts a high-quality initial system state. This prediction is then refined using an iterative, direct linear solver inspired by state estimation techniques. This LSE refinement step solves a series of linear equations to enforce physical laws, effectively bypassing the non-linearities and convergence issues of traditional solvers. The proposed GNN-LSE framework is comprehensively validated on systems ranging from small radial distribution networks (IEEE 33-bus, 69-bus) to a large, meshed transmission system (IEEE 118-bus). Results show that our GNN variants are up to $8.4 \times 10^3$ times faster than NR. The LSE refinement provides a fast route to a physically-consistent solution, while heavy-loading stress tests (120%-150% of nominal) and N-1 contingencies demonstrate the method's reliability and generalization. This work presents a powerful and flexible framework for bridging fast, data-driven models with the rigorous constraints of power system physics, offering a practical tool for real-time operations and analysis.


[10] 2510.22029

High-Performance Rotor Cooling with Ducted Liquid in Completely Cold-Formed Modular Motor Shaft

This paper suggests a novel rotor-cooling shaft concept for high-performance electric motors that increases the effectiveness of cooling and is yet simple and cost-effective to manufacture. We investigate the thermal performance of four shaft geometries for rotor cooling in automotive applications. The proposed tooth-guided liquid-cooling shaft design aims to solve the high churning loss of conventional cooled rotor shafts due to internal vortex formation and their still limited heat transfer. Therefore, we optimize heat transfer efficiency and pressure management by incorporating cold-formed internal channels that restrict vortex formation beyond a degree that improves heat transfer. We evaluated key performance metrics, including heat transfer rate, outlet temperature, pressure drop, and velocity profiles, under varying rotational speeds, inlet flow rates, and coolant temperatures. Computational fluid analysis demonstrates that the tooth-guided design outperforms conventional hollow shafts and achieves up to 110% higher cooling efficiency at low rotational speeds, while it maintains comparable pressure levels. These findings provide practical insight into geometry-driven thermal optimization and offer a path toward improving the performance and durability of electric motors.


[11] 2510.22104

TRASE-NODEs: Trajectory Sensitivity-aware Neural Ordinary Differential Equations for Efficient Dynamic Modeling

Modeling dynamical systems is crucial across the science and engineering fields for accurate prediction, control, and decision-making. Recently, machine learning (ML) approaches, particularly neural ordinary differential equations (NODEs), have emerged as a powerful tool for data-driven modeling of continuous-time dynamics. Nevertheless, standard NODEs require a large number of data samples to remain consistent under varying control inputs, posing challenges to generate sufficient simulated data and ensure the safety of control design. To address this gap, we propose trajectory-sensitivity-aware (TRASE-)NODEs, which construct an augmented system for both state and sensitivity, enabling simultaneous learning of their dynamics. This formulation allows the adjoint method to update gradients in a memory-efficient manner and ensures that control-input effects are captured in the learned dynamics. We evaluate TRASE-NODEs using damped oscillator and inverter-based resources (IBRs). The results show that TRASE-NODEs generalize better from the limited training data, yielding lower prediction errors than standard NODEs for both examples. The proposed framework offers a data-efficient, control-oriented modeling approach suitable for dynamic systems that require accurate trajectory sensitivity prediction.


[12] 2510.22154

Frequency-Spatial Interaction Driven Network for Low-Light Image Enhancement

Low-light image enhancement (LLIE) aims at improving the perception or interpretability of an image captured in an environment with poor illumination. With the advent of deep learning, the LLIE technique has achieved significant breakthroughs. However, existing LLIE methods either ignore the important role of frequency domain information or fail to effectively promote the propagation and flow of information, limiting the LLIE performance. In this paper, we develop a novel frequency-spatial interaction-driven network (FSIDNet) for LLIE based on two-stage architecture. To be specific, the first stage is designed to restore the amplitude of low-light images to improve the lightness, and the second stage devotes to restore phase information to refine fine-grained structures. Considering that Frequency domain and spatial domain information are complementary and both favorable for LLIE, we further develop two frequency-spatial interaction blocks which mutually amalgamate the complementary spatial and frequency information to enhance the capability of the model. In addition, we construct the Information Exchange Module (IEM) to associate two stages by adequately incorporating cross-stage and cross-scale features to effectively promote the propagation and flow of information in the two-stage network structure. Finally, we conduct experiments on several widely used benchmark datasets (i.e., LOL-Real, LSRW-Huawei, etc.), which demonstrate that our method achieves the excellent performance in terms of visual results and quantitative metrics while preserving good model efficiency.


[13] 2510.22166

Expert Validation of Synthetic Cervical Spine Radiographs Generated with a Denoising Diffusion Probabilistic Model

Machine learning in neurosurgery is limited by challenges in assembling large, high-quality imaging datasets. Synthetic data offers a scalable, privacy-preserving solution. We evaluated the feasibility of generating realistic lateral cervical spine radiographs using a denoising diffusion probabilistic model (DDPM) trained on 4,963 images from the Cervical Spine X-ray Atlas. Model performance was monitored via training/validation loss and Frechet inception distance, and synthetic image quality was assessed in a blinded "clinical Turing test" with six neuroradiologists and two spine-fellowship trained neurosurgeons. Experts reviewed 50 quartets containing one real and three synthetic images, identifying the real image and rating realism on a 4-point Likert scale. Experts correctly identified the real image in 29% of trials (Fleiss' kappa=0.061). Mean realism scores were comparable between real (3.323) and synthetic images (3.228, 3.258, and 3.320; p=0.383, 0.471, 1.000). Nearest-neighbor analysis found no evidence of memorization. We also provide a dataset of 20,063 synthetic radiographs. These results demonstrate that DDPM-generated cervical spine X-rays are statistically indistinguishable in realism and quality from real clinical images, offering a novel approach to creating large-scale neuroimaging datasets for ML applications in landmarking, segmentation, and classification.


[14] 2510.22180

Experimental Demonstration of Multi-Object Tracking in Integrated Sensing and Communication

For a wide range of envisioned integrated sensing and communication (ISAC) use cases, it is necessary to incorporate tracking techniques into cellular communication systems. While numerous multi-object tracking algorithms exist, they have not yet been applied to real-world ISAC, with its challenges such as clutter and non-optimal hardware. In this work, we showcase multi-object tracking based on the probability hypothesis density (PHD) filter in the range and Doppler speed domain. The measurements are taken with a 5G compliant ISAC proof-of-concept in a real factory environment, where the pedestrian-like objects are generated by a radar object emulator. We detail the complete pipeline, from measurement acquisition to evaluation, with a focus on the post-processing of the raw captured data and the tracking itself. Our end-to-end evaluation and comparison to simulations show good multi-object tracking performance with mean absolute error <1.5m and detection rates >91% for realistic but challenging scenarios.


[15] 2510.22183

A Unified Framework for Direction and Diffuseness Estimation Using Tight-Frame Microphone Arrays

This work presents a unified framework for estimating both sound-field direction and diffuseness using practical microphone arrays with different spatial configurations. Building on covariance-based diffuseness models, we formulate a velocity-only covariance approach that enables consistent diffuseness evaluation across heterogeneous array geometries without requiring mode whitening or spherical-harmonic decomposition. Three array types -- an A-format array, a rigid-sphere array, and a newly proposed tight-frame array -- are modeled and compared through both simulations and measurement-based experiments. The results show that the tight-frame configuration achieves near-isotropic directional sampling and reproduces diffuseness characteristics comparable to those of higher-order spherical arrays, while maintaining a compact physical structure. We further examine the accuracy of direction-of-arrival estimation based on acoustic intensity within the same framework. These findings connect theoretical diffuseness analysis with implementable array designs and support the development of robust, broadband methods for spatial-sound-field characterization.


[16] 2510.22237

Bridging the Perceptual-Statistical Gap in Dysarthria Assessment: Why Machine Learning Still Falls Short

Automated dysarthria detection and severity assessment from speech have attracted significant research attention due to their potential clinical impact. Despite rapid progress in acoustic modeling and deep learning, models still fall short of human expert performance. This manuscript provides a comprehensive analysis of the reasons behind this gap, emphasizing a conceptual divergence we term the ``perceptual-statistical gap''. We detail human expert perceptual processes, survey machine learning representations and methods, review existing literature on feature sets and modeling strategies, and present a theoretical analysis of limits imposed by label noise and inter-rater variability. We further outline practical strategies to narrow the gap, perceptually motivated features, self-supervised pretraining, ASR-informed objectives, multimodal fusion, human-in-the-loop training, and explainability methods. Finally, we propose experimental protocols and evaluation metrics aligned with clinical goals to guide future research toward clinically reliable and interpretable dysarthria assessment tools.


[17] 2510.22239

Synthetic-to-Real Transfer Learning for Chromatin-Sensitive PWS Microscopy

Chromatin sensitive partial wave spectroscopic (csPWS) microscopy enables label free detection of nanoscale chromatin packing alterations that occur before visible cellular transformation. However, manual nuclear segmentation limits population scale analysis needed for biomarker discovery in early cancer detection. The lack of annotated csPWS imaging data prevents direct use of standard deep learning methods. We present CFU Net, a hierarchical segmentation architecture trained with a three stage curriculum on synthetic multimodal data. CFU Net achieves near perfect performance on held out synthetic test data that represent diverse spectroscopic imaging conditions without manual annotations (Dice 0.9879, IoU 0.9895). Our approach uses physics based rendering that incorporates empirically supported chromatin packing statistics, Mie scattering models, and modality specific noise, combined with a curriculum that progresses from adversarial RGB pretraining to spectroscopic fine tuning and histology validation. CFU Net integrates five architectural elements (ConvNeXt backbone, Feature Pyramid Network, UNet plus plus dense connections, dual attention, and deep supervision) that together improve Dice over a baseline UNet by 8.3 percent. We demonstrate deployment ready INT8 quantization with 74.9 percent compression and 0.15 second inference, giving a 240 times throughput gain over manual analysis. Applied to more than ten thousand automatically segmented nuclei from synthetic test data, the pipeline extracts chromatin biomarkers that distinguish normal from pre cancerous tissue with large effect sizes (Cohens d between 1.31 and 2.98), reaching 94 percent classification accuracy. This work provides a general framework for synthetic to real transfer learning in specialized microscopy and open resources for community validation on clinical specimens.


[18] 2510.22258

Binaural Signal Matching with Wearable Arrays for Near-Field Sources and Directional Focus

This paper investigates the performance of Binaural Signal Matching (BSM) methods for near-field sound reproduction using a wearable glasses-mounted microphone array. BSM is a flexible, signal-independent approach for binaural rendering with arbitrary arrays, but its conventional formulation assumes far-field sources. In our previous work, we proposed a near-field extension of BSM (NF-BSM) that incorporates distance-dependent modeling and showed improved performance over far-field BSM using analytic data, though degradation persisted for sources very close to the array. In this study, we extend that analysis by using realistic simulated data of near-field Head-Related Transfer Functions (HRTFs) and Acoustic Transfer Functions (ATFs) of the array, accounting for listener head rotation and evaluating binaural cues such as interaural level and time differences (ILD and ITD). A key contribution is the introduction of a Field of View (FoV) weighting, designed to emphasize perceptually relevant directions and improve robustness under challenging conditions. Results from both simulation and a listening test confirm that NF-BSM outperforms traditional far-field BSM in near-field scenarios, and that the proposed NF-FoV-BSM method achieves the best perceptual and objective quality among all tested methods, particularly at close source distances and under head rotation. These findings highlight the limitations for far-field models in near-field sources and demonstrate that incorporating source distance and directional weighting can significantly improve binaural reproduction performance for wearable spatial audio systems.


[19] 2510.22263

Empowering Multimodal Respiratory Sound Classification with Counterfactual Adversarial Debiasing for Out-of-Distribution Robustness

Multimodal respiratory sound classification offers promise for early pulmonary disease detection by integrating bioacoustic signals with patient metadata. Nevertheless, current approaches remain vulnerable to spurious correlations from attributes such as age, sex, or acquisition device, which hinder their generalization, especially under distribution shifts across clinical sites. To this end, we propose a counterfactual adversarial debiasing framework. First, we employ a causal graph-based counterfactual debiasing strategy to suppress non-causal dependencies from patient metadata. Second, we introduce adversarial debiasing to learn metadata-insensitive representations and reduce metadata-specific biases. Third, we design counterfactual metadata augmentation to mitigate spurious correlations further and strengthen metadata-invariant representations. By doing so, our method consistently outperforms strong baselines in evaluations under both in-distribution and distribution shifts. The code is available at this https URL.


[20] 2510.22297

Angular Estimation Comparison with ISAC PoC

The introduction of Integrated Sensing and Communications (ISAC) in cellular systems is not expected to result in a shift away from the popular choice of cost- and energy-efficient analog or hybrid beamforming structures. However, this comes at the cost of limiting the angular capabilities to a confined space per acquisitions. Thus, as a prerequisite for the successful implementation of numerous ISAC use cases, the need for an optimal angular estimation of targets and their separation based on the minimal number of angular samples arises. In this work, different approaches for angular estimation based on a minimal, DFT-based set of angular samples are evaluated. The samples are acquired through sweeping multiple beams of an ISAC proof of concept (PoC) in the industrial scenario of the ARENA2036. The study's findings indicate that interpolation approaches are more effective for generalizing across different types of angular scenarios. While the orthogonal matching pursuit (OMP) approach exhibits the most accurate estimation for a single, strong and clearly discriminable target, the DFT-based interpolation approach demonstrates the best overall estimation performance.


[21] 2510.22321

Fair Cost Allocation in Energy Communities: A DLMP-based Bilevel Optimization with a Shapley Value Approach

Energy communities (ECs) are emerging as a promising decentralized model for managing cooperative distributed energy resources (DERs). As these communities expand and their operations become increasingly integrated into the grid, ensuring fairness in allocating operating costs among participants becomes a challenge. In distribution networks, DER operations at the community level can influence Distribution Locational Marginal Prices (DLMPs), which in turn affect system's operation cost. This interdependence between local decisions and system-level pricing introduces new challenges for fair and transparent cost allocation. Despite growing interest in fairness-aware methods, most methods do not account for the impact of DLMPs. To fill this gap, we propose a bilevel optimization model in which a Community Energy Aggregator (CEA) schedules DERs across multiple ECs while a Distribution System Operator (DSO) determines DLMPs through network-constrained dispatch. Leveraging the Karush-Kuhn-Tucker (KKT) conditions and strong duality, the bilevel model is reformulated into a tractable single-level problem. We achieve fairness in the cost allocation by applying the Shapley value to quantify each community's marginal contribution to system-wide cost savings. The effectiveness of the proposed method is validated through simulations on several benchmark distribution systems.


[22] 2510.22324

Model-Free Power System Stability Enhancement with Dissipativity-Based Neural Control

The integration of converter-interfaced generation introduces new transient stability challenges to modern power systems. Classical Lyapunov- and scalable passivity-based approaches typically rely on restrictive assumptions, and finding storage functions for large grids is generally considered intractable. Furthermore, most methods require an accurate grid dynamics model. To address these challenges, we propose a model-free, nonlinear, and dissipativity-based controller which, when applied to grid-connected virtual synchronous generators (VSGs), enhances power system transient stability. Using input-state data, we train neural networks to learn dissipativity-characterizing matrices that yield stabilizing controllers. Furthermore, we incorporate cost function shaping to improve the performance with respect to the user-specified objectives. Numerical results on a modified, all-VSG Kundur two-area power system validate the effectiveness of the proposed approach.


[23] 2510.22374

Vector-Valued Native Space Embedding for Adaptive State Observation

This paper combines vector-valued reproducing kernel Hilbert space (vRKHS) embedding with robust adaptive observation, yielding an algorithm that is both non-parametric and robust. The main contribution of this paper lies in the ability of the proposed system to estimate the state of a plan model whose matched uncertainties are elements of an infinite-dimensional native space. The plant model considered in this paper also suffers from unmatched uncertainties. Finally, the measured output is affected by disturbances as well. Upper bounds on the state observation error are provided in an analytical form. The proposed theoretical results are applied to the problem of estimating the state of a rigid body.


[24] 2510.22379

TraceTrans: Translation and Spatial Tracing for Surgical Prediction

Image-to-image translation models have achieved notable success in converting images across visual domains and are increasingly used for medical tasks such as predicting post-operative outcomes and modeling disease progression. However, most existing methods primarily aim to match the target distribution and often neglect spatial correspondences between the source and translated images. This limitation can lead to structural inconsistencies and hallucinations, undermining the reliability and interpretability of the predictions. These challenges are accentuated in clinical applications by the stringent requirement for anatomical accuracy. In this work, we present TraceTrans, a novel deformable image translation model designed for post-operative prediction that generates images aligned with the target distribution while explicitly revealing spatial correspondences with the pre-operative input. The framework employs an encoder for feature extraction and dual decoders for predicting spatial deformations and synthesizing the translated image. The predicted deformation field imposes spatial constraints on the generated output, ensuring anatomical consistency with the source. Extensive experiments on medical cosmetology and brain MRI datasets demonstrate that TraceTrans delivers accurate and interpretable post-operative predictions, highlighting its potential for reliable clinical deployment.


[25] 2510.22406

Data-driven, Wavelet-based Identification and Reduced-order Modeling of Linear Systems with Closely Spaced Modes

This work presents a purely data-driven, wavelet-based framework for modal identification and reduced-order modeling of mechanical systems with assumed linear dynamics characterized by closely spaced modes with classical or non-classical damping distribution. Traditional Fourier-based methods often fail to reliably identify closely spaced modes or accurately capture modal interactions and complexities. To address these limitations, we propose a methodology leveraging the enhanced time -frequency resolution capabilities of the continuous wavelet transform (CWT). By selecting appropriate harmonic regions within the wavelet spectra, we effectively isolate modes, and then invert them back in the temporal domain by applying the inverse CWT (ICWT). In this way we reconstruct the corresponding modal dynamics in the time domain. Using the Hilbert transform, instantaneous phases are extracted for each identified mode, enabling the introduction of a complexified modal matrix which robustly characterizes the system's modal properties, even under challenging perturbations such as noise and uncertainties due to modal interference and unmodeled effects. The identified modal parameters are utilized to reconstruct the frequency response functions (FRFs) of the system and to develop a reduced-order model (ROM) that captures accurately the system's dominant dynamical behavior valid in a specified frequency range.. Validation of the methodology is conducted both with a numerical non-classical damping and an experimental testbed representing a model of an airplane structure. Results demonstrate the effectiveness of the proposed approach in resolving intricate modal interactions and accurately reproducing the dynamic response of complex structural systems.


[26] 2510.22417

Genetic Optimization of a Software-Defined GNSS Receiver

Commercial off-the-shelf (COTS) Global Navigation Satellite System (GNSS) receivers face significant limitations under high-dynamic conditions, particularly in high-acceleration environments such as those experienced by launch vehicles. These performance degradations, often observed as discontinuities in the navigation solution, arise from the inability of traditional tracking loop bandwidths to cope with rapid variations in synchronization parameters. Software-Defined Radio (SDR) receivers overcome these constraints by enabling flexible reconfiguration of tracking loops; however, manual tuning involves a complex, multidimensional search and seldom ensures optimal performance. This work introduces a genetic algorithm-based optimization framework that autonomously explores the receiver configuration space to determine optimal loop parameters for phase, frequency, and delay tracking. The approach is validated within an SDR environment using realistically simulated GPS L1 signals for three representative dynamic regimes -guided rocket flight, Low Earth Orbit (LEO) satellite, and static receiver-processed with the open-source GNSS-SDR architecture. Results demonstrate that evolutionary optimization enables SDR receivers to maintain robust and accurate Position, Velocity, and Time (PVT) solutions across diverse dynamic conditions. The optimized configurations yielded maximum position and velocity errors of approximately 6 m and 0.08 m/s for the static case, 12 m and 2.5 m/s for the rocket case, and 5 m and 0.2 m/s for the LEO case.


[27] 2510.22429

Resilient Composite Control for Stability Enhancement in EV Integrated DC Microgrids

When electric vehicles (EVs) are integrated into standalone DC microgrids (DCMGs), stability issues arise due to their constant power load (CPL) behavior, which provides negative incremental impedance (NII). In addition, the microgrids suffer from an inherent low-inertia problem. Therefore, this study presents a composite controller incorporating a global integral terminal sliding mode controller with a backstepping controller. A virtual capacitor is employed to mitigate the low-inertia issue and strengthen the DC-bus response. An improved fractional power-based reaching law decreases chattering and accelerates convergence. Exact feedback linearization converts the nonlinear boost converter model into Brunovsky's canonical form, mitigating NII effects and non-minimum phase issues. The entire system stability is verified using Lyapunov control theory. Simulation outcomes confirm superior performance, with 34.4-53.3% reduction in overshoot, 52.9-74.9% in undershoot, and 12-47.4% in settling time compared to the existing controller.


[28] 2510.22472

Data-driven Exponential Framing for Pulsive Temporal Patterns without Repetition or Singularity

Extracting pulsive temporal patterns from a small dataset without their repetition or singularity shows significant importance in manufacturing applications but does not sufficiently attract scientific attention. We propose to quantify how long temporal patterns appear without relying on their repetition or singularity, enabling to extract such temporal patterns from a small dataset. Inspired by the celebrated time delay embedding and data-driven Hankel matrix analysis, we introduce a linear dynamical system model on the time-delay coordinates behind the data to derive the discrete-time bases each of which has a distinct exponential decay constant. The derived bases are fitted onto subsequences that are extracted with a sliding window in order to quantify how long patterns are dominant in the set of subsequences. We call the quantification method Data-driven Exponential Framing (DEF). A toy model-based experiment shows that DEF can identify multiple patterns with distinct lengths. DEF is also applied to electric current measurement on a punching machine, showing its possibility to extract multiple patterns from real-world oscillatory data.


[29] 2510.22483

A Scenario-based Stochastic Model of using BESS-based Virtual Transmission Lines in Day-Ahead Unit Commitment

The rapid increase in renewable energy sources (RES) implementation in the power system creates more severe network congestion, which may reduce grid operation efficiency and cause renewable curtailment. Deterministic optimization for the unit commitment shows that battery energy storage system (BESS)-based Virtual Transmission Line (VTL), as an alternative to physical transmission lines, can offer a quick solution for congestion relief, reduced operational costs, and lowered renewable curtailment. This paper aims to evaluate the benefits of VTL when considering Renewable Energy Sources uncertainty. Particularly, this work proposes a scenario-based stochastic security-constrained unit commitment model considering VTL, referred to as SSCUC-VTL. It incorporates the forecast error of RES into the commitment decision for systems with VTL. The performance of applying the VTL strategy is compared to that of adding a new physical transmission line and a standalone BESS. A case study has been conducted on an enhanced IEEE 24-bus test system. The simulation results demonstrate that VTL provides 23% more operational cost reduction than the physical transmission line, and up to 67% more congestion relief than the standalone BESS in a power system with solar and wind generation.


[30] 2510.22496

Functional Uncertainty Classes, Nonparametric Adaptive Contro Functional Uncertainty Classes for Nonparametric Adaptive Control: the Curse of Dimensionality

This paper derives a new class of vector-valued reproducing kernel Hilbert spaces (vRKHS) defined in terms of operator-valued kernels for the representation of functional uncertainty arising in nonparametric adaptive control methods. These are referred to as maneuver or trajectory vRKHS KM in the paper, and they are introduced to address the curse of dimensionality that can arise for some types of nonparametric adaptive control strategies. The maneuver vRKHSs are derived based on the structure of a compact, l-dimensional, smooth Riemannian manifold M that is regularly embedded in the state space X = Rn, where M is assumed to approximately support the ultimate dynamics of the reference system to be tracked.


[31] 2510.22514

Robust Multi-Agent Safety via Tube-Based Tightened Exponential Barrier Functions

This paper presents a constructive framework for synthesizing provably safe controllers for nonlinear multi-agent systems subject to bounded disturbances. The methodology applies to systems representable in Brunovsky canonical form, accommodating arbitrary-order dynamics in multi-dimensional spaces. The central contribution is a method of constraint tightening that formally couples robust error feedback with nominal trajectory planning. The key insight is that the design of an ancillary feedback law, which confines state errors to a robust positively invariant (RPI) tube, simultaneously provides the exact information needed to ensure the safety of the nominal plan. Specifically, the geometry of the resulting RPI tube is leveraged via its support function to derive state-dependent safety margins. These margins are then used to systematically tighten the high relative-degree exponential control barrier function (eCBF) constraints imposed on the nominal planner. This integrated synthesis guarantees that any nominal trajectory satisfying the tightened constraints corresponds to a provably safe trajectory for the true, disturbed system. We demonstrate the practical utility of this formal synthesis method by implementing the planner within a distributed Model Predictive Control (MPC) scheme, which optimizes performance while inheriting the robust safety guarantees.


[32] 2510.22539

Approximate Gradient Coding for Distributed Learning with Heterogeneous Stragglers

In this paper, we propose an optimally structured gradient coding scheme to mitigate the straggler problem in distributed learning. Conventional gradient coding methods often assume homogeneous straggler models or rely on excessive data replication, limiting performance in real-world heterogeneous systems. To address these limitations, we formulate an optimization problem minimizing residual error while ensuring unbiased gradient estimation by explicitly considering individual straggler probabilities. We derive closed-form solutions for optimal encoding and decoding coefficients via Lagrangian duality and convex optimization, and propose data allocation strategies that reduce both redundancy and computation load. We also analyze convergence behavior for $\lambda$-strongly convex and $\mu$-smooth loss functions. Numerical results show that our approach significantly reduces the impact of stragglers and accelerates convergence compared to existing methods.


[33] 2510.22547

Low-Light Image Enhancement Using Gamma Learning And Attention-Enabled Encoder-Decoder Networks

Images acquired in low-light environments present significant obstacles for computer vision systems and human perception, especially for applications requiring accurate object recognition and scene analysis. Such images typically manifest multiple quality issues: amplified noise, inadequate scene illumination, contrast reduction, color distortion, and loss of details. While recent deep learning methods have shown promise, developing simple and efficient frameworks that naturally integrate global illumination adjustment with local detail refinement continues to be an important objective. To this end, we introduce a dual-stage deep learning architecture that combines adaptive gamma correction with attention-enhanced refinement to address these fundamental limitations. The first stage uses an Adaptive Gamma Correction Module (AGCM) to learn suitable gamma values for each pixel based on both local and global cues, producing a brightened intermediate output. The second stage applies an encoder-decoder deep network with Convolutional Block Attention Modules (CBAM) to this brightened image, in order to restore finer details. We train the network using a composite loss that includes L1 reconstruction, SSIM, total variation, color constancy, and gamma regularization terms to balance pixel accuracy with visual quality. Experiments on LOL-v1, LOL-v2 real, and LOL-v2 synthetic datasets show our method reaches PSNR of upto 29.96 dB and upto 0.9458 SSIM, outperforming existing approaches. Additional tests on DICM, LIME, MEF, and NPE datasets using NIQE, BRISQUE, and UNIQUE metrics confirm better perceptual quality with fewer artifacts, achieving the best NIQE scores across all datasets. Our GAtED (Gamma learned and Attention-enabled Encoder-Decoder) method effectively handles both global illumination adjustment and local detail enhancement, offering a practical solution for low-light enhancement.


[34] 2510.22551

Structure Aware Image Downscaling

Image downscaling is one of the key operations in recent display technology and visualization tools. By this process, the dimension of an image is reduced, aiming to preserve structural integrity and visual fidelity. In this paper, we propose a new image downscaling method which is built on the core ideas of image filtering and edge detection. In particular, we present a structure-informed downscaling algorithm that maintains fine details through edge-aware processing. The proposed method comprises three steps: (i) edge map computation, (ii) edge-guided interpolation, and (iii) texture enhancement. To faithfully retain the strong structures in an image, we first compute the edge maps by applying an efficient edge detection operator. This is followed by an edge-guided interpolation to preserve fine details after resizing. Finally, we fuse local texture enriched component of the original image to the interpolated one to restore high-frequency information. By integrating edge information with adaptive filtering, our approach effectively minimizes artifacts while retaining crucial image features. To demonstrate the effective downscaling capability of our proposed method, we validate on four datasets: DIV2K, BSD100, Urban100, and RealSR. For downscaling by 4x, our method could achieve as high as 39.07 dB PSNR on the DIV2K dataset and 38.71 dB on the RealSR dataset. Extensive experimental results confirm that the proposed image downscaling method is capable of achieving superior performance in terms of both visual quality and performance metrics with reference to recent methods. Most importantly, the downscaled images by our method do not suffer from edge blurring and texture loss, unlike many existing ones.


[35] 2510.22557

Large-Model AI for Near Field Beam Prediction: A CNN-GPT2 Framework for 6G XL-MIMO

The emergence of extremely large-scale antenna arrays (ELAA) in millimeter-wave (mmWave) communications, particularly in high-mobility scenarios, highlights the importance of near-field beam prediction. Unlike the conventional far-field assumption, near-field beam prediction requires codebooks that jointly sample the angular and distance domains, which leads to a dramatic increase in pilot overhead. Moreover, unlike the far-field case where the optimal beam evolution is temporally smooth, the optimal near-field beam index exhibits abrupt and nonlinear dynamics due to its joint dependence on user angle and distance, posing significant challenges for temporal modeling. To address these challenges, we propose a novel Convolutional Neural Network-Generative Pre-trained Transformer 2 (CNN-GPT2) based near-field beam prediction framework. Specifically, an uplink pilot transmission strategy is designed to enable efficient channel probing through widebeam analog precoding and frequency-varying digital precoding. The received pilot signals are preprocessed and passed through a CNN-based feature extractor, followed by a GPT-2 model that captures temporal dependencies across multiple frames and directly predicts the near-field beam index in an end-to-end manner.


[36] 2510.22565

Learning Event-guided Exposure-agnostic Video Frame Interpolation via Adaptive Feature Blending

Exposure-agnostic video frame interpolation (VFI) is a challenging task that aims to recover sharp, high-frame-rate videos from blurry, low-frame-rate inputs captured under unknown and dynamic exposure conditions. Event cameras are sensors with high temporal resolution, making them especially advantageous for this task. However, existing event-guided methods struggle to produce satisfactory results on severely low-frame-rate blurry videos due to the lack of temporal constraints. In this paper, we introduce a novel event-guided framework for exposure-agnostic VFI, addressing this limitation through two key components: a Target-adaptive Event Sampling (TES) and a Target-adaptive Importance Mapping (TIM). Specifically, TES samples events around the target timestamp and the unknown exposure time to better align them with the corresponding blurry frames. TIM then generates an importance map that considers the temporal proximity and spatial relevance of consecutive features to the target. Guided by this map, our framework adaptively blends consecutive features, allowing temporally aligned features to serve as the primary cues while spatially relevant ones offer complementary support. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our approach in exposure-agnostic VFI scenarios.


[37] 2510.22588

UltraVoice: Scaling Fine-Grained Style-Controlled Speech Conversations for Spoken Dialogue Models

Spoken dialogue models currently lack the ability for fine-grained speech style control, a critical capability for human-like interaction that is often overlooked in favor of purely functional capabilities like reasoning and question answering. To address this limitation, we introduce UltraVoice, the first large-scale speech dialogue dataset engineered for multiple fine-grained speech style control. Encompassing over 830 hours of speech dialogues, UltraVoice provides instructions across six key speech stylistic dimensions: emotion, speed, volume, accent, language, and composite styles. Fine-tuning leading models such as SLAM-Omni and VocalNet on UltraVoice significantly enhances their fine-grained speech stylistic controllability without degrading core conversational abilities. Specifically, our fine-tuned models achieve improvements of 29.12-42.33% in Mean Opinion Score (MOS) and 14.61-40.09 percentage points in Instruction Following Rate (IFR) on multi-dimensional control tasks designed in the UltraVoice. Moreover, on the URO-Bench benchmark, our fine-tuned models demonstrate substantial gains in core understanding, reasoning, and conversational abilities, with average improvements of +10.84% on the Basic setting and +7.87% on the Pro setting. Furthermore, the dataset's utility extends to training controllable Text-to-Speech (TTS) models, underscoring its high quality and broad applicability for expressive speech synthesis. The complete dataset and model checkpoints are available at: this https URL.


[38] 2510.22603

Mitigating Attention Sinks and Massive Activations in Audio-Visual Speech Recognition with LLMS

Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.


[39] 2510.22621

Parametric Channel Estimation and Design for Active-RIS-Assisted Communications

Reconfigurable Intelligent Surface (RIS) technology has emerged as a key enabler for future wireless communications. However, its potential is constrained by the difficulty of acquiring accurate user-to-RIS channel state information (CSI), due to the cascaded channel structure and the high pilot overhead of non-parametric methods. Unlike a passive RIS, where the reflected signal suffers from multiplicative path loss, an active RIS amplifies the signal, improving its practicality in real deployments. In this letter, we propose a parametric channel estimation method tailored for active RISs. The proposed approach integrates an active RIS model with an adaptive Maximum Likelihood Estimator (MLE) to recover the main channel parameters using a minimal number of pilots. To further enhance performance, an adaptive active RIS configuration strategy is employed, which refines the beam direction based on an initial user location estimate. Moreover, an orthogonal angle-pair codebook is used instead of the conventional Discrete Fourier Transform (DFT) codebook, significantly reducing the codebook size and ensuring reliable operation for both far-field and near-field users. Extensive simulations demonstrate that the proposed method achieves near-optimal performance with very few pilots compared to non-parametric approaches. Its performance is also benchmarked against that of a traditional passive RIS under the same total power budget to ensure fairness. Results show that active RIS yields higher spectral efficiency (SE) by eliminating the multiplicative fading inherent in passive RISs and allocating more resources to data transmission


[40] 2510.22637

HyBeam: Hybrid Microphone-Beamforming Array-Agnostic Speech Enhancement for Wearables

Speech enhancement is a fundamental challenge in signal processing, particularly when robustness is required across diverse acoustic conditions and microphone setups. Deep learning methods have been successful for speech enhancement, but often assume fixed array geometries, limiting their use in mobile, embedded, and wearable devices. Existing array-agnostic approaches typically rely on either raw microphone signals or beamformer outputs, but both have drawbacks under changing geometries. We introduce HyBeam, a hybrid framework that uses raw microphone signals at low frequencies and beamformer signals at higher frequencies, exploiting their complementary strengths while remaining highly array-agnostic. Simulations across diverse rooms and wearable array configurations demonstrate that HyBeam consistently surpasses microphone-only and beamformer-only baselines in PESQ, STOI, and SI-SDR. A bandwise analysis shows that the hybrid approach leverages beamformer directivity at high frequencies and microphone cues at low frequencies, outperforming either method alone across all bands.


[41] 2510.22646

TVMC: Time-Varying Mesh Compression via Multi-Stage Anchor Mesh Generation

Time-varying meshes, characterized by dynamic connectivity and varying vertex counts, hold significant promise for applications such as augmented reality. However, their practical utilization remains challenging due to the substantial data volume required for high-fidelity representation. While various compression methods attempt to leverage temporal redundancy between consecutive mesh frames, most struggle with topological inconsistency and motion-induced artifacts. To address these issues, we propose Time-Varying Mesh Compression (TVMC), a novel framework built on multi-stage coarse-to-fine anchor mesh generation for inter-frame prediction. Specifically, the anchor mesh is progressively constructed in three stages: initial, coarse, and fine. The initial anchor mesh is obtained through fast topology alignment to exploit temporal coherence. A Kalman filter-based motion estimation module then generates a coarse anchor mesh by accurately compensating inter-frame motions. Subsequently, a Quadric Error Metric-based refinement step optimizes vertex positions to form a fine anchor mesh with improved geometric fidelity. Based on the refined anchor mesh, the inter-frame motions relative to the reference base mesh are encoded, while the residual displacements between the subdivided fine anchor mesh and the input mesh are adaptively quantized and compressed. This hierarchical strategy preserves consistent connectivity and high-quality surface approximation, while achieving an efficient and compact representation of dynamic geometry. Extensive experiments on standard MPEG dynamic mesh sequences demonstrate that TVMC achieves state-of-the-art compression performance. Compared to the latest V-DMC standard, it delivers a significant BD-rate gain of 10.2% ~ 16.9%, while preserving high reconstruction quality. The code is available at this https URL.


[42] 2510.22682

SRP-PHAT-NET: A Reliability-Driven DNN for Reverberant Speaker Localization

Accurate Direction-of-Arrival (DOA) estimation in reverberant environments remains a fundamental challenge for spatial audio applications. While deep learning methods have shown strong performance in such conditions, they typically lack a mechanism to assess the reliability of their predictions - an essential feature for real-world deployment. In this work, we present the SRP-PHAT-NET, a deep neural network framework that leverages SRP-PHAT directional maps as spatial features and introduces a built-in reliability estimation. To enable meaningful reliability scoring, the model is trained using Gaussian-weighted labels centered around the true direction. We systematically analyze the influence of label smoothing on accuracy and reliability, demonstrating that the choice of Gaussian kernel width can be tuned to application-specific requirements. Experimental results show that selectively using high-confidence predictions yields significantly improved localization accuracy, highlighting the practical benefits of integrating reliability into deep learning-based DOA estimation.


[43] 2510.22731

Enhancing WiFi CSI Fingerprinting: A Deep Auxiliary Learning Approach

Radio frequency (RF) fingerprinting techniques provide a promising supplement to cryptography-based approaches but rely on dedicated equipment to capture in-phase and quadrature (IQ) samples, hindering their wide adoption. Recent advances advocate easily obtainable channel state information (CSI) by commercial WiFi devices for lightweight RF fingerprinting, while falling short in addressing the challenges of coarse granularity of CSI measurements in an open-world setting. In this paper, we propose CSI2Q, a novel CSI fingerprinting system that achieves comparable performance to IQ-based approaches. Instead of extracting fingerprints directly from raw CSI measurements, CSI2Q first transforms frequency-domain CSI measurements into time-domain signals that share the same feature space with IQ samples. Then, we employ a deep auxiliary learning strategy to transfer useful knowledge from an IQ fingerprinting model to the CSI counterpart. Finally, the trained CSI model is combined with an OpenMax function to estimate the likelihood of unknown ones. We evaluate CSI2Q on one synthetic CSI dataset involving 85 devices and two real CSI datasets, including 10 and 25 WiFi routers, respectively. Our system achieves accuracy increases of at least 16% on the synthetic CSI dataset, 20% on the in-lab CSI dataset, and 17% on the in-the-wild CSI dataset.


[44] 2510.22760

Understanding What Is Not Said:Referring Remote Sensing Image Segmentation with Scarce Expressions

Referring Remote Sensing Image Segmentation (RRSIS) aims to segment instances in remote sensing images according to referring expressions. Unlike Referring Image Segmentation on general images, acquiring high-quality referring expressions in the remote sensing domain is particularly challenging due to the prevalence of small, densely distributed objects and complex backgrounds. This paper introduces a new learning paradigm, Weakly Referring Expression Learning (WREL) for RRSIS, which leverages abundant class names as weakly referring expressions together with a small set of accurate ones to enable efficient training under limited annotation conditions. Furthermore, we provide a theoretical analysis showing that mixed-referring training yields a provable upper bound on the performance gap relative to training with fully annotated referring expressions, thereby establishing the validity of this new setting. We also propose LRB-WREL, which integrates a Learnable Reference Bank (LRB) to refine weakly referring expressions through sample-specific prompt embeddings that enrich coarse class-name inputs. Combined with a teacher-student optimization framework using dynamically scheduled EMA updates, LRB-WREL stabilizes training and enhances cross-modal generalization under noisy weakly referring supervision. Extensive experiments on our newly constructed benchmark with varying weakly referring data ratios validate both the theoretical insights and the practical effectiveness of WREL and LRB-WREL, demonstrating that they can approach or even surpass models trained with fully annotated referring expressions.


[45] 2510.22772

Neural-HAR: A Dimension-Gated CNN Accelerator for Real-Time Radar Human Activity Recognition

Radar-based human activity recognition (HAR) is attractive for unobtrusive and privacy-preserving monitoring, yet many CNN/RNN solutions remain too heavy for edge deployment, and even lightweight ViT/SSM variants often exceed practical compute and memory budgets. We introduce Neural-HAR, a dimension-gated CNN accelerator tailored for real-time radar HAR on resource-constrained platforms. At its core is GateCNN, a parameter-efficient Doppler-temporal network that (i) embeds Doppler vectors to emphasize frequency evolution over time and (ii) applies dual-path gated convolutions that modulate Doppler-aware content features with temporal gates, complemented by a residual path for stable training. On the University of Glasgow UoG2020 continuous radar dataset, GateCNN attains 86.4% accuracy with only 2.7k parameters and 0.28M FLOPs per inference, comparable to CNN-BiGRU at a fraction of the complexity. Our FPGA prototype on Xilinx Zynq-7000 Z-7007S reaches 107.5 $\mu$s latency and 15 mW dynamic power using LUT-based ROM and distributed RAM only (zero DSP/BRAM), demonstrating real-time, energy-efficient edge inference. Code and HLS conversion scripts are available at this https URL.


[46] 2510.22790

Ellipsoidal Set-Theoretic Design of Robust Safety Filters for Constrained Linear Systems

This paper presents an ellipsoidal set-theoretic framework for robust safety filter synthesis in constrained linear systems subject to additive bounded disturbances and input constraints. We formulate the safety filter design as a convex linear matrix inequality (LMI) optimization problem that simultaneously computes a robust controlled invariant (RCI) ellipsoidal set and its associated state-feedback control law. The RCI set is characterized as an ellipsoidal set, enabling computational tractability for high-dimensional systems while providing formal safety guarantees. The safety filter employs a smooth mixing strategy between nominal and backup controllers based on distance to the invariant set boundary, facilitating minimal intervention when the system operates safely. The proposed method extends to nonlinear systems by treating nonlinear terms as bounded disturbances with rigorous approximation bounds. Numerical validation on a six-degree-of-freedom quadrotor system demonstrates the filter's effectiveness in maintaining stability under external disturbances and aggressive maneuvers while preserving nominal performance during safe operation. The approach provides a constructive and computationally efficient solution for safety-critical control applications requiring real-time implementation.


[47] 2510.22812

Region-Adaptive Learned Hierarchical Encoding for 3D Gaussian Splatting Data

We introduce Region-Adaptive Learned Hierarchical Encoding (RALHE) for 3D Gaussian Splatting (3DGS) data. While 3DGS has recently become popular for novel view synthesis, the size of trained models limits its deployment in bandwidth-constrained applications such as volumetric media streaming. To address this, we propose a learned hierarchical latent representation that builds upon the principles of "overfitted" learned image compression (e.g., Cool-Chic and C3) to efficiently encode 3DGS attributes. Unlike images, 3DGS data have irregular spatial distributions of Gaussians (geometry) and consist of multiple attributes (signals) defined on the irregular geometry. Our codec is designed to account for these differences between images and 3DGS. Specifically, we leverage the octree structure of the voxelized 3DGS geometry to obtain a hierarchical multi-resolution representation. Our approach overfits latents to each Gaussian attribute under a global rate constraint. These latents are decoded independently through a lightweight decoder network. To estimate the bitrate during training, we employ an autoregressive probability model that leverages octree-derived contexts from the 3D point structure. The multi-resolution latents, decoder, and autoregressive entropy coding networks are jointly optimized for each Gaussian attribute. Experiments demonstrate that the proposed RALHE compression framework achieves a rendering PSNR gain of up to 2dB at low bitrates (less than 1 MB) compared to the baseline 3DGS compression methods.


[48] 2510.22813

Residual Bias Compensation Filter for Physics-Based SOC Estimation in Lithium Iron Phosphate Batteries

This paper addresses state of charge (SOC) estimation for lithium iron phosphate (LFP) batteries, where the relatively flat open-circuit voltage (OCV-SOC) characteristic reduces observability. A residual bias compensation dual extended Kalman filter (RBC-DEKF) is developed. Unlike conventional bias compensation methods that treat the bias as an augmented state within a single filter, the proposed dual-filter structure decouples residual bias estimation from electrochemical state estimation. One EKF estimates the system states of a control-oriented parameter-grouped single particle model with thermal effects, while the other EKF estimates a residual bias that continuously corrects the voltage observation equation, thereby refining the model-predicted voltage in real time. Unlike bias-augmented single-filter schemes that enlarge the covariance coupling, the decoupled bias estimator refines the voltage observation without perturbing electrochemical state dynamics. Validation is conducted on an LFP cell from a public dataset under three representative operating conditions: US06 at 0 degC, DST at 25 degC, and FUDS at 50 degC. Compared with a conventional EKF using the same model and identical state filter settings, the proposed method reduces the average SOC RMSE from 3.75% to 0.20% and the voltage RMSE between the filtered model voltage and the measured voltage from 32.8 mV to 0.8 mV. The improvement is most evident in the mid-SOC range where the OCV-SOC curve is flat, confirming that residual bias compensation significantly enhances accuracy for model-based SOC estimation of LFP batteries across a wide temperature range.


[49] 2510.22871

Transmission Neural Networks: Approximate Receding Horizon Control for Virus Spread on Networks

Transmission Neural Networks (TransNNs) proposed by Gao and Caines (2022) serve as both virus spread models over networks and neural network models with tuneable activation functions. This paper establishes that TransNNs provide upper bounds on the infection probability generated from the associated Markovian stochastic Susceptible-Infected-Susceptible (SIS) model with 2^n state configurations where n is the number of nodes in the network, and can be employed as an approximate model for the latter. Based on such an approximation, a TransNN-based receding horizon control approach for mitigating virus spread is proposed and we demonstrate that it allows significant computational savings compared to the dynamic programming solution to Markovian SIS model with 2^n state configurations, as well as providing less conservative control actions compared to the TransNN-based optimal control. Finally, numerical comparisons among (a) dynamic programming solutions for the Markovian SIS model, (b) TransNN-based optimal control and (c) the proposed TransNN-based receding horizon control are presented.


[50] 2510.22895

Rmd: Robust Modal Decomposition with Constrained Bandwidth

Modal decomposition techniques, such as Empirical Mode Decomposition (EMD), Variational Mode Decomposition (VMD), and Singular Spectrum Analysis (SSA), have advanced time-frequency signal analysis since the early 21st century. These methods are generally classified into two categories: numerical optimization-based methods (EMD, VMD) and spectral decomposition methods (SSA) that consider the physical meaning of signals. The former can produce spurious modes due to the lack of physical constraints, while the latter is more sensitive to noise and struggles with nonlinear signals. Despite continuous improvements in these methods, a modal decomposition approach that effectively combines the strengths of both categories remains elusive. This paper thus proposes a Robust Modal Decomposition (RMD) method with constrained bandwidth, which preserves the intrinsic structure of the signal by mapping the time series into its trajectory-GRAM matrix in phase space. Moreover, the method incorporates bandwidth constraints during the decomposition process, enhancing noise resistance. Extensive experiments on synthetic and real-world datasets, including millimeter-wave radar echoes, electrocardiogram (ECG), phonocardiogram (PCG), and bearing fault detection data, demonstrate the method's effectiveness and versatility. All code and dataset samples are available on GitHub: this https URL.


[51] 2510.22913

Clinic-Oriented Feasibility of a Sensor-Fused Wearable for Upper-Limb Function

Background: Upper-limb weakness and tremor (4--12 Hz) limit activities of daily living (ADL) and reduce adherence to home rehabilitation. Objective: To assess technical feasibility and clinician-relevant signals of a sensor-fused wearable targeting the triceps brachii and extensor pollicis brevis. Methods: A lightweight node integrates surface EMG (1 kHz), IMU (100--200 Hz), and flex/force sensors with on-device INT8 inference (Tiny 1D-CNN/Transformer) and a safety-bounded assist policy (angle/torque/jerk limits; stall/time-out). Healthy adults (n = 12) performed three ADL-like tasks. Primary outcomes: Tremor Index (TI), range of motion (ROM), repetitions (Reps min$^{-1}$). Secondary: EMG median-frequency slope (fatigue trend), closed-loop latency, session completion, and device-related adverse events. Analyses used subject-level paired medians with BCa 95\% CIs; exact Wilcoxon $p$-values are reported in the Results. Results: Assistance was associated with lower tremor prominence and improved task throughput: TI decreased by $-0.092$ (95\% CI [$-0.102$, $-0.079$]), ROM increased by $+12.65\%$ (95\% CI [$+8.43$, $+13.89$]), and Reps rose by $+2.99$ min$^{-1}$ (95\% CI [$+2.61$, $+3.35$]). Median on-device latency was 8.7 ms at a 100 Hz loop rate; all sessions were completed with no device-related adverse events. Conclusions: Multimodal sensing with low-latency, safety-bounded assistance produced improved movement quality (TI $\downarrow$) and throughput (ROM, Reps $\uparrow$) in a pilot technical-feasibility setting, supporting progression to IRB-approved patient studies. Trial registration: Not applicable (pilot non-clinical).


[52] 2510.22947

Intelligent Multimodal Multi-Sensor Fusion-Based UAV Identification, Localization, and Countermeasures for Safeguarding Low-Altitude Economy

The development of the low-altitude economy has led to a growing prominence of uncrewed aerial vehicle (UAV) safety management issues. Therefore, accurate identification, real-time localization, and effective countermeasures have become core challenges in airspace security assurance. This paper introduces an integrated UAV management and control system based on deep learning, which integrates multimodal multi-sensor fusion perception, precise positioning, and collaborative countermeasures. By incorporating deep learning methods, the system combines radio frequency (RF) spectral feature analysis, radar detection, electro-optical identification, and other methods at the detection level to achieve the identification and classification of UAVs. At the localization level, the system relies on multi-sensor data fusion and the air-space-ground integrated communication network to conduct real-time tracking and prediction of UAV flight status, providing support for early warning and decision-making. At the countermeasure level, it adopts comprehensive measures that integrate ``soft kill'' and ``hard kill'', including technologies such as electromagnetic signal jamming, navigation spoofing, and physical interception, to form a closed-loop management and control process from early warning to final disposal, which significantly enhances the response efficiency and disposal accuracy of low-altitude UAV management.


[53] 2510.22948

PASS-Enhanced MEC: Joint Optimization of Task Offloading and Uplink PASS Beamforming

A pinching-antenna system (PASS)-enhanced mobile edge computing (MEC) architecture is investigated to improve the task offloading efficiency and latency performance in dynamic wireless environments. By leveraging dielectric waveguides and flexibly adjustable pinching antennas, PASS establishes short-distance line-of-sight (LoS) links while effectively mitigating the significant path loss and potential signal blockage, making it a promising solution for high-frequency MEC systems. We formulate a network latency minimization problem to joint optimize uplink PASS beamforming and task offloading. The resulting problem is modeled as a Markov decision process (MDP) and solved via the deep reinforcement learning (DRL) method. To address the instability introduced by the $\max$ operator in the objective function, we propose a load balancing-aware proximal policy optimization (LBPPO) algorithm. LBPPO incorporates both node-level and waveguide-level load balancing information into the policy design, maintaining computational and transmission delay equilibrium, respectively. Simulation results demonstrate that the proposed PASS-enhanced MEC with adaptive uplink PASS beamforming exhibit stronger convergence capability than fixed-PA baselines and conventional MIMO-assisted MEC, especially in scenarios with a large number of UEs or high transmit power.


[54] 2510.22950

DiffRhythm 2: Efficient and High Fidelity Song Generation via Block Flow Matching

Generating full-length, high-quality songs is challenging, as it requires maintaining long-term coherence both across text and music modalities and within the music modality itself. Existing non-autoregressive (NAR) frameworks, while capable of producing high-quality songs, often struggle with the alignment between lyrics and vocal. Concurrently, catering to diverse musical preferences necessitates reinforcement learning from human feedback (RLHF). However, existing methods often rely on merging multiple models during multi-preference optimization, which results in significant performance degradation. To address these challenges, we introduce DiffRhythm 2, an end-to-end framework designed for high-fidelity, controllable song generation. To tackle the lyric alignment problem, DiffRhythm 2 employs a semi-autoregressive architecture based on block flow matching. This design enables faithful alignment of lyrics to singing vocals without relying on external labels and constraints, all while preserving the high generation quality and efficiency of NAR models. To make this framework computationally tractable for long sequences, we implement a music variational autoencoder (VAE) that achieves a low frame rate of 5 Hz while still enabling high-fidelity audio reconstruction. In addition, to overcome the limitations of multi-preference optimization in RLHF, we propose cross-pair preference optimization. This method effectively mitigates the performance drop typically associated with model merging, allowing for more robust optimization across diverse human preferences. We further enhance musicality and structural coherence by introducing stochastic block representation alignment loss.


[55] 2510.22961

Adapting Speech Foundation Models with Large Language Models for Unified Speech Recognition

Unified speech recognition aims to perform auditory, visual, and audiovisual speech recognition within a single model framework. While speech foundation models (SFMs) have demonstrated remarkable performance in auditory tasks, their adaptation to multimodal scenarios remains underexplored. This paper presents UASR-LLM, a novel framework that adapts frozen SFMs to unified VSR, ASR, and AVSR tasks by leveraging large language models (LLMs) as text decoders. Our approach introduces visual representations into multiple SFM layers through visual injection modules, enabling multimodal input processing and unified hidden representations. The augmented SFMs connect with decoder-only LLMs via a feed-forward adaptor, where concatenated representations and instruction prompts guide speech transcription. We implement a twostage training strategy: visual injection pretraining followed by speech recognition finetuning. SFM parameters remain frozen throughout training, with only visual injection modules optimized initially, and LLMs finetuned using LoRA parameters subsequently. Experimental results demonstrate superior performance over state-of-the-art baselines across VSR, ASR, and AVSR tasks under both clean and noisy conditions. Ablation studies confirm generalization across various SFMs and LLMs, validating the proposed training strategy.


[56] 2510.22990

USF-MAE: Ultrasound Self-Supervised Foundation Model with Masked Autoencoding

Ultrasound imaging is one of the most widely used diagnostic modalities, offering real-time, radiation-free assessment across diverse clinical domains. However, interpretation of ultrasound images remains challenging due to high noise levels, operator dependence, and limited field of view, resulting in substantial inter-observer variability. Current Deep Learning approaches are hindered by the scarcity of large labeled datasets and the domain gap between general and sonographic images, which limits the transferability of models pretrained on non-medical data. To address these challenges, we introduce the Ultrasound Self-Supervised Foundation Model with Masked Autoencoding (USF-MAE), the first large-scale self-supervised MAE framework pretrained exclusively on ultrasound data. The model was pre-trained on 370,000 2D and 3D ultrasound images curated from 46 open-source datasets, collectively termed OpenUS-46, spanning over twenty anatomical regions. This curated dataset has been made publicly available to facilitate further research and reproducibility. Using a Vision Transformer encoder-decoder architecture, USF-MAE reconstructs masked image patches, enabling it to learn rich, modality-specific representations directly from unlabeled data. The pretrained encoder was fine-tuned on three public downstream classification benchmarks: BUS-BRA (breast cancer), MMOTU-2D (ovarian tumors), and GIST514-DB (gastrointestinal stromal tumors). Across all tasks, USF-MAE consistently outperformed conventional CNN and ViT baselines, achieving F1-scores of 81.6%, 79.6%, and 82.4%, respectively. Despite not using labels during pretraining, USF-MAE approached the performance of the supervised foundation model UltraSam on breast cancer classification and surpassed it on the other tasks, demonstrating strong cross-anatomical generalization.


[57] 2510.23021

Planning Oriented Integrated Sensing and Communication

Integrated sensing and communication (ISAC) enables simultaneous localization, environment perception, and data exchange for connected autonomous vehicles. However, most existing ISAC designs prioritize sensing accuracy and communication throughput, treating all targets uniformly and overlooking the impact of critical obstacles on motion efficiency. To overcome this limitation, we propose a planning-oriented ISAC (PISAC) framework that reduces the sensing uncertainty of planning-bottleneck obstacles and expands the safe navigable path for the ego-vehicle, thereby bridging the gap between physical-layer optimization and motion-level planning. The core of PISAC lies in deriving a closed-form safety bound that explicitly links ISAC transmit power to sensing uncertainty, based on the Cramér-Rao Bound and occupancy inflation principles. Using this model, we formulate a bilevel power allocation and motion planning (PAMP) problem, where the inner layer optimizes the ISAC beam power distribution and the outer layer computes a collision-free trajectory under uncertainty-aware safety constraints. Comprehensive simulations in high-fidelity urban driving environments demonstrate that PISAC achieves up to 40% higher success rates and over 5% shorter traversal times than existing ISAC-based and communication-oriented benchmarks, validating its effectiveness in enhancing both safety and efficiency.


[58] 2510.23067

NeuroDOB: A Deep Neural Observer-Based Controller for Vehicle Lateral Dynamics

This paper proposes NeuroDOB, a deep neural network based observer controller for vehicle lateral dynamics, which replaces the conventional disturbance observer (DOB) with a deep neural network (DNN) to enhance personalized lateral control. Unlike conventional DOBs that compensate for general disturbances such as road friction variation and crosswind, NeuroDOB explicitly addresses unmodeled vehicle dynamics and driver-specific behaviors by learning the steering compensation signal from driver-in-the-loop simulations using CarSim's embedded controller as a surrogate driver. The proposed architecture integrates NeuroDOB with a linear quadratic regulator (LQR), where the DNN outputs a delta error correction added to the baseline LQR steering input to produce the final control command. Input features to the DNN include lateral position and yaw angle errors, and the LQR control input. Experimental validation using a lateral dynamic bicycle model within CarSim demonstrates that NeuroDOB effectively adapts to individual driving habits, improving lateral control performance beyond what conventional LQR controllers achieve. The results indicate the potential of deep neural network based observer to enable personalized and adaptive autonomous vehicle control. In cognitive terms, the proposed architecture can be viewed as a dual-system control structure. The baseline LQR corresponds to System 1, a model-based, fast, and analytic reasoning layer ensuring stability. The NeuroDOB acts as System 2, a reflective, data-driven layer that learns compensation from experience and corrects the analytical bias of System 1. Together, they form an integrated decision process analogous to human intuition-reflection interaction, enabling both stability and adaptability in lateral control.


[59] 2510.23125

Context-awareness for Dependable Low-Power IoT

Dependability is the ability to consistently deliver trusted and uninterrupted service in the face of operational uncertainties. Ensuring dependable operation in large-scale, energy-constrained Internet of Things (IoT) deployments is as crucial as challenging, and calls for context-aware protocols where context refers to situational or state information. In this paper, we identify four critical context dimensions for IoT networks, namely energy status, information freshness, task relevance, and physical/medium conditions, and show how each one underpins core dependability attributes. Building on these insights, we propose a two-step protocol design framework that incorporates operation-specific context fields. Through three representative use cases, we demonstrate how context awareness can significantly enhance system dependability while imposing only minimal control-plane overhead.


[60] 2510.23141

Treble10: A high-quality dataset for far-field speech recognition, dereverberation, and enhancement

Accurate far-field speech datasets are critical for tasks such as automatic speech recognition (ASR), dereverberation, speech enhancement, and source separation. However, current datasets are limited by the trade-off between acoustic realism and scalability. Measured corpora provide faithful physics but are expensive, low-coverage, and rarely include paired clean and reverberant data. In contrast, most simulation-based datasets rely on simplified geometrical acoustics, thus failing to reproduce key physical phenomena like diffraction, scattering, and interference that govern sound propagation in complex environments. We introduce Treble10, a large-scale, physically accurate room-acoustic dataset. Treble10 contains over 3000 broadband room impulse responses (RIRs) simulated in 10 fully furnished real-world rooms, using a hybrid simulation paradigm implemented in the Treble SDK that combines a wave-based and geometrical acoustics solver. The dataset provides six complementary subsets, spanning mono, 8th-order Ambisonics, and 6-channel device RIRs, as well as pre-convolved reverberant speech scenes paired with LibriSpeech utterances. All signals are simulated at 32 kHz, accurately modelling low-frequency wave effects and high-frequency reflections. Treble10 bridges the realism gap between measurement and simulation, enabling reproducible, physically grounded evaluation and large-scale data augmentation for far-field speech tasks. The dataset is openly available via the Hugging Face Hub, and is intended as both a benchmark and a template for next-generation simulation-driven audio research.


[61] 2510.23147

HAPS-ISAC for 6G: Architecture, Design Trade-offs, and a Practical Roadmap

To meet the ambitious goals of next-generation 6G networks, including ultra-high data rates and ubiquitous coverage, we propose a novel high-altitude platform station (HAPS)-based integrated sensing and communication (ISAC) architecture. Operating in the stratosphere, the HAPS functions as both a powerful communication hub and an advanced environmental sensor. Combined with a fleet of cooperative uncrewed aerial vehicles (UAVs), this dual-purpose system forms a scalable and intelligent 3D network. Simulation results indicate that this approach significantly boosts network performance, improves sensing accuracy, and ensures a fairer service distribution across users, outperforming conventional UAV-only baselines. We conclude by outlining the prospective applications and a deployment roadmap for this technology for smart cities and other large-scale environments.


[62] 2510.23158

Matching Reverberant Speech Through Learned Acoustic Embeddings and Feedback Delay Networks

Reverberation conveys critical acoustic cues about the environment, supporting spatial awareness and immersion. For auditory augmented reality (AAR) systems, generating perceptually plausible reverberation in real time remains a key challenge, especially when explicit acoustic measurements are unavailable. We address this by formulating blind estimation of artificial reverberation parameters as a reverberant signal matching task, leveraging a learned room-acoustic prior. Furthermore, we propose a feedback delay network (FDN) structure that reproduces both frequency-dependent decay times and the direct-to-reverberation ratio of a target space. Experimental evaluation against a leading automatic FDN tuning method demonstrates improvements in estimated room-acoustic parameters and perceptual plausibility of artificial reverberant speech. These results highlight the potential of our approach for efficient, perceptually consistent reverberation rendering in AAR applications.


[63] 2510.23186

Approaching Domain Generalization with Embeddings for Robust Discrimination and Recognition of RF Communication Signals

Radio frequency (RF) signal recognition plays a critical role in modern wireless communication and security applications. Deep learning-based approaches have achieved strong performance but typically rely heavily on extensive training data and often fail to generalize to unseen signals. In this paper, we propose a method to learn discriminative embeddings without relying on real-world RF signal recordings by training on signals of synthetic wireless protocols. We validate the approach on a dataset of real RF signals and show that the learned embeddings capture features enabling accurate discrimination of previously unseen real-world signals, highlighting its potential for robust RF signal classification and anomaly detection.


[64] 2510.23188

Embroidery Actuator Utilizing Embroidery Patterns to Generate Diverse Fabric Deformations

This paper presents a novel Embroidery Actuator, a fabric-integrated pneumatic actuator that enables diverse and controllable deformations through embroidery pattern design. Unlike conventional fabric actuators that rely on fiber- or thread-shaped actuators, the proposed actuator is fabricated by directly stitching an inflatable tube onto the fabric using a cord-embroidery technique. The embroidered thread and the fabric jointly form a sleeve that constrains the expansion of the inflatable tube, converting internal pressure into targeted bending or stretching deformations. By varying the embroidery pattern, such as zigzag or cross configurations, different geometric constraints can be realized, allowing for flexible control of deformation direction and magnitude. Analytical deformation models based on the Neo-Hookean model and Lagrange's equations were developed to predict the relationship between pneumatic pressure and bending angle, and were experimentally validated using motion-capture measurements. The results demonstrated that the actuator achieves strong agreement with the analytical deformation model.


[65] 2510.23196

Neural Networks for AC Optimal Power Flow: Improving Worst-Case Guarantees during Training

The AC Optimal Power Flow (AC-OPF) problem is central to power system operation but challenging to solve efficiently due to its nonconvex and nonlinear nature. Neural networks (NNs) offer fast surrogates, yet their black-box behavior raises concerns about constraint violations that can compromise safety. We propose a verification-informed NN framework that incorporates worst-case constraint violations directly into training, producing models that are both accurate and provably safer. Through post-hoc verification, we achieve substantial reductions in worst-case violations and, for the first time, verify all operational constraints of large-scale AC-OPF proxies. Practical feasibility is further enhanced via restoration and warm-start strategies for infeasible operating points. Experiments on systems ranging from 57 to 793 buses demonstrate scalability, speed, and reliability, bridging the gap between ML acceleration and safe, real-time deployment of AC-OPF solutions - and paving the way toward data-driven optimal control.


[66] 2510.23226

Inertia Partitioning Modular Control Framework for Reconfigurable Multibody Systems

A novel modular control framework for reconfigurable rigid multibody systems is proposed, motivated by the challenges of modular control of systems with closed kinematic chains. In the framework, modularity is defined in the sense of degrees of freedom, and the inertial properties of each body are partitioned with respect to how they are reflected in the kinetic energy of the system through the motion induced by each degree of freedom. This approach inherently handles closed chains in the same manner as tree-like structures, eliminating the need for explicit constraint force calculations or formulations based on differential-algebraic equations. The proposed framework is implemented via simulation on a three-degree-of-freedom series-parallel manipulator, with the results being consistent with the expected stability and tracking performance, and indicating the framework's potential for scalability in trajectory-tracking control of multibody systems.


[67] 2510.23296

Payload trajectory tracking control for aerial transportation systems with cable length online optimization

Cable-suspended aerial transportation systems are employed extensively across various industries. The capability to flexibly adjust the relative position between the multirotor and the payload has spurred growing interest in the system equipped with variable-length cable, promising broader application potential. Compared to systems with fixed-length cables, introducing the variable-length cable adds a new degree of freedom. However, it also results in increased nonlinearity and more complex dynamic coupling among the multirotor, the cable and the payload, posing significant challenges in control design. This paper introduces a backstepping control strategy tailored for aerial transportation systems with variable-length cable, designed to precisely track the payload trajectory while dynamically adjusting cable length. Then, a cable length generator has been developed that achieves online optimization of the cable length while satisfying state constraints, thus balancing the multirotor's motion and cable length changes without the need for manual trajectory planning. The asymptotic stability of the closed-loop system is guaranteed through Lyapunov techniques and the growth restriction condition. Finally, simulation results confirm the efficacy of the proposed method in managing trajectory tracking and cable length adjustments effectively.


[68] 2510.23317

Equivariance2Inverse: A Practical Self-Supervised CT Reconstruction Method Benchmarked on Real, Limited-Angle, and Blurred Data

Deep learning has shown impressive results in reducing noise and artifacts in X-ray computed tomography (CT) reconstruction. Self-supervised CT reconstruction methods are especially appealing for real-world applications because they require no ground truth training examples. However, these methods involve a simplified X-ray physics model during training, which may make inaccurate assumptions, for example, about scintillator blurring, the scanning geometry, or the distribution of the noise. As a result, they can be less robust to real-world imaging circumstances. In this paper, we review the model assumptions of six recent self-supervised CT reconstruction methods. Moreover, we benchmark these methods on the real-world 2DeteCT dataset and on synthetic data with and without scintillator blurring and a limited-angle scanning geometry. The results of our benchmark show that methods that assume that the noise is pixel-wise independent do not perform well on data with scintillator blurring, and that assuming rotation invariance improves results on limited-angle reconstructions. Based on these findings, we combined successful concepts of the Robust Equivariant Imaging and Sparse2Inverse methods in a new self-supervised CT reconstruction method called Equivariance2Inverse.


[69] 2510.23320

LibriConvo: Simulating Conversations from Read Literature for ASR and Diarization

We introduce LibriConvo, a simulated multi-speaker conversational dataset based on speaker-aware conversation simulation (SASC), designed to support training and evaluation of speaker diarization and automatic speech recognition (ASR) systems. Unlike prior resources that mostly rely on semantically disconnected utterances and implausible temporal gaps, LibriConvo ensures semantic coherence and realistic conversational timing. Our pipeline leverages CallHome with external VAD for reliable boundaries, applies compression to reduce unnaturally long silences, and organizes LibriTTS utterances by book to maintain contextual consistency. Acoustic realism is enhanced via a novel room impulse response selection procedure that ranks speaker-microphone configurations by spatial plausibility, balancing realism and diversity. The dataset comprises 240.1 hours across 1,496 dialogues with 830 unique speakers, split in a speaker-disjoint manner for robust evaluation. Baselines show that the sortformer model outperforms the pyannote pipeline in diarization, while a fine-tuned Fast Conformer-CTC XLarge with Serialized Output Training achieves 7.29\% WER for ASR, surpassing zero-shot Whisper-large-v3. LibriConvo provides a valuable resource for advancing multi-speaker speech processing research with realistic conversational dynamics and controlled experimental conditions.


[70] 2510.23355

Uplink SCMA-empowered Uncoordinated Random Access for Future mMTC

In this paper, a novel uncoordinated random access (URA) protocol is presented to address the pressing demand for massive connectivity with low access latency in future massive machine type communication (mMTC) scenarios. The proposed URA scheme integrates the classical slotted ALOHA (S-ALOHA) protocol with sparse code multiple access (SCMA) technique, referred to as SCMA-empowered URA. Specifically, active users randomly choose an SCMA codebook to access the communication network in an arbitrary time slot whenever they want without scheduling. However, due to the lack of central coordination in the proposed URA scheme, SCMA codebook collisions become inevitable, making decoding challenging and leading to increased access failures. To cope with the decoding issue, an interference-canceling (IC) first decoding strategy is proposed at the access point (AP), which can partially tackles collision problems, contributing to a higher system throughput. Taking the proposed IC-first decoding strategy into account, a closed-form theoretical expression of the throughput is derived. Moreover, to alleviate the throughput degradation under the congested user traffic, a user barring mechanism is introduced to manage the traffic load. Firstly, a closed-form expression of idle codebook probability is developed to help indicate the system state, i.e., congested or not. Then, in addition to the estimated real-time load, the AP adaptively adjusts the access probability and redistributes the actual access load. Finally, simulation results demonstrate that the proposed SCMA-empowered URA scheme enjoys higher maximum throughput, compared to the conventional orthogonal multiple access (OMA) based URA scheme. Moreover, the accuracy of the presented theoretical analysis and the effectiveness of the user barring mechanism are verified.


[71] 2510.23356

IoT-Driven Smart Management in Broiler Farming: Simulation of Remote Sensing and Control Systems

Parameter monitoring and control systems are crucial in the industry as they enable automation processes that improve productivity and resource optimization. These improvements also help to manage environmental factors and the complex interactions between multiple inputs and outputs required for production management. This paper proposes an automation system for broiler management based on a simulation scenario that involves sensor networks and embedded systems. The aim is to create a transmission network for monitoring and controlling broiler temperature and feeding using the Internet of Things (IoT), complemented by a dashboard and a cloud-based service database to track improvements in broiler management. We look forward this work will serve as a guide for stakeholders and entrepreneurs in the animal production industry, fostering sustainable development through simple and cost-effective automation solutions. The goal is for them to scale and integrate these recommendations into their existing operations, leading to more efficient decision-making at the management level.


[72] 2510.23403

Evaluation of Spherical Wavelet Framework in Comparsion with Ambisonics

Recently, the Spherical Wavelet Framework (SWF) was proposed to combine the benefits of Ambisonics and Object-Based Audio (OBA) by utilising highly localised basis functions. SWF can enhance the sweet-spot area and reduce localisation blur while still enabling a sparse representation of the complete sound field, making storage and transmission more efficient. Initial vector analysis and listening test of SWF have shown promising results; however, these findings are limited to very specific conditions and do not include perceptual metrics. The present study investigates SWF in greater detail, comparing it with Ambisonics. The comparison was carried out using IACC, ITD, and ILD estimations, as well as listening tests with ecologically valid sound sources. Various reproduction layouts: regular polyhedron, t-design, and Lebedev grid with their corresponding Ambisonics orders and channel counts were evaluated. Results indicate that SWF is rated significantly more similar to the reference than Ambisonics is, in terms of overall spatial and timbral fidelity; however, it is considerably dependent on the subdivison of the sphere. Moreover, it cannot natively represent a wave arriving at a continuous direction. Possible solutions are proposed.


[73] 2510.23440

Randomized Space-Time Coded Stacked Intelligent Metasurfaces for Massive Multiuser Downlink Connectivity

Stacked intelligent metasurfaces (SIMs) represent a key enabler for next-generation wireless networks, offering beamforming gains while significantly reducing radio-frequency chain requirements. In conventional space-only SIM architectures, the rate of reconfigurability of the SIM is equal to the inverse of the channel coherence time. This paper investigates a novel beamforming strategy for massive downlink connectivity using a randomized space-time (ST) coded SIM. In addition to conventional space-only metasurface layers, the proposed design integrates a ST metasurface layer at the input stage of the SIM that introduces random time variations over each channel coherence time interval. These artificial time variations enable opportunistic user scheduling and exploitation of multiuser diversity under slow channel dynamics. To mitigate the prohibitive overhead associated with full channel state information at the transmitter (CSIT), we propose a partial-CSIT-based beamforming scheme that leverages randomized steering vectors and limited user-side feedback based on signal quality measurements. Numerical results demonstrate that the proposed ST-SIM architecture achieves satisfactory sum-rate performance while significantly reducing CSIT acquisition and feedback overhead, thereby enabling scalable downlink connectivity in dense networks.


[74] 2510.23467

Joint Uplink and Downlink Resource Allocation and Antenna Activation for Pinching Antenna Systems

In this paper, we explore a novel joint uplink and downlink framework utilizing a pinching antenna system (PASS). We consider two waveguides, one dedicated to transmission and one to reception, and both of them are connected to a base station (BS). Each type of waveguide consists of several pinching antennas (PAs) in some preconfigured positions. In this framework, we assume the BS can serve downlink and uplink user equipments (UEs) at the same time using the same spectrum resources through the presented PASS. In this aspect, we formulate a sum rate optimization problem that jointly optimizes the antenna activation factor, the BS transmit power, and the UE's transmit power, subject to power budget constraints for the BS and the UEs, as well as minimum rate requirements for the UEs. The formulated problem is highly non-convex and difficult to solve directly. Hence, we divide the main problem into two sub-problems: the antenna activation sub-problem and the power allocation sub-problem. Then, we solve the antenna activation problem utilizing a distance and spatial correlation-based algorithm. Meanwhile, the resource allocation problem is solved using a successive convex approximation (SCA)-based algorithm. Numerical results show that our proposed framework can achieve around 60-90\% performance gains over its time division duplex (TDD) where the uplink and downlink transmissions are served in different orthogonal time slots.


[75] 2510.23491

An Error-Based Safety Buffer for Safe Adaptive Control (Extended Version)

We consider the problem of adaptive control of a class of feedback linearizable plants with matched parametric uncertainties whose states are accessible, subject to state constraints, which often arise due to safety considerations. In this paper, we combine adaptation and control barrier functions into a real-time control architecture that guarantees stability, ensures control performance, and remains safe even with the parametric uncertainties. Two problems are considered, differing in the nature of the parametric uncertainties. In both cases, the control barrier function is assumed to have an arbitrary relative degree. In addition to guaranteeing stability, it is proved that both the control objective and safety objective are met with near-zero conservatism. No excitation conditions are imposed on the command signal. Simulation results demonstrate the non-conservatism of all of the theoretical developments.


[76] 2510.23541

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

Recent advances in text-to-speech (TTS) synthesis have significantly improved speech expressiveness and naturalness. However, most existing systems are tailored for single-speaker synthesis and fall short in generating coherent multi-speaker conversational speech. This technical report presents SoulX-Podcast, a system designed for podcast-style multi-turn, multi-speaker dialogic speech generation, while also achieving state-of-the-art performance in conventional TTS tasks. To meet the higher naturalness demands of multi-turn spoken dialogue, SoulX-Podcast integrates a range of paralinguistic controls and supports both Mandarin and English, as well as several Chinese dialects, including Sichuanese, Henanese, and Cantonese, enabling more personalized podcast-style speech generation. Experimental results demonstrate that SoulX-Podcast can continuously produce over 90 minutes of conversation with stable speaker timbre and smooth speaker transitions. Moreover, speakers exhibit contextually adaptive prosody, reflecting natural rhythm and intonation changes as dialogues progress. Across multiple evaluation metrics, SoulX-Podcast achieves state-of-the-art performance in both monologue TTS and multi-turn conversational speech synthesis.


[77] 2510.23551

Towards Stochastic (N-1)-Secure Redispatch

The intermittent nature of renewable power availability is one of the major sources of uncertainty in power systems. While markets can guarantee that the demand is covered by the available generation, transmission system operators have to often intervene via economic redispatch to ensure that the physical constraints of the network are satisfied. To account for uncertainty, the underlying optimal power flow (OPF) routines have to be modified. Recently, polynomial chaos expansion (PCE) has been suggested in the literature as a tool for stochastic OPF problems. However, the usage of PCE-based methods in security-constrained OPF for (N-1)-secure operations has not yet been explored. In this paper, we propose a procedure that iteratively solves a PCE-overloaded stochastic OPF problem by including line outage constraints until an (N-1)-secure solution is achieved. We demonstrate the efficacy of our method by comparing it with a Monte-Carlo simulation on a 118-bus example system.


[78] 2510.23559

KongNet: A Multi-headed Deep Learning Model for Detection and Classification of Nuclei in Histopathology Images

Accurate detection and classification of nuclei in histopathology images are critical for diagnostic and research applications. We present KongNet, a multi-headed deep learning architecture featuring a shared encoder and parallel, cell-type-specialised decoders. Through multi-task learning, each decoder jointly predicts nuclei centroids, segmentation masks, and contours, aided by Spatial and Channel Squeeze-and-Excitation (SCSE) attention modules and a composite loss function. We validate KongNet in three Grand Challenges. The proposed model achieved first place on track 1 and second place on track 2 during the MONKEY Challenge. Its lightweight variant (KongNet-Det) secured first place in the 2025 MIDOG Challenge. KongNet pre-trained on the MONKEY dataset and fine-tuned on the PUMA dataset ranked among the top three in the PUMA Challenge without further optimisation. Furthermore, KongNet established state-of-the-art performance on the publicly available PanNuke and CoNIC datasets. Our results demonstrate that the specialised multi-decoder design is highly effective for nuclei detection and classification across diverse tissue and stain types. The pre-trained model weights along with the inference code have been publicly released to support future research.


[79] 2510.23561

Revising Second Order Terms in Deep Animation Video Coding

First Order Motion Model is a generative model that animates human heads based on very little motion information derived from keypoints. It is a promising solution for video communication because first it operates at very low bitrate and second its computational complexity is moderate compared to other learning based video codecs. However, it has strong limitations by design. Since it generates facial animations by warping source-images, it fails to recreate videos with strong head movements. This works concentrates on one specific kind of head movements, namely head rotations. We show that replacing the Jacobian transformations in FOMM by a global rotation helps the system to perform better on items with head-rotations while saving 40% to 80% of bitrate on P-frames. Moreover, we apply state-of-the-art normalization techniques to the discriminator to stabilize the adversarial training which is essential for generating visually appealing videos. We evaluate the performance by the learned metics LPIPS and DISTS to show the success our optimizations.


[80] 2510.21715

Beyond IVR Touch-Tones: Customer Intent Routing using LLMs

Widespread frustration with rigid touch-tone Interactive Voice Response (IVR) systems for customer service underscores the need for more direct and intuitive language interaction. While speech technologies are necessary, the key challenge lies in routing intents from user phrasings to IVR menu paths, a task where Large Language Models (LLMs) show strong potential. Progress, however, is limited by data scarcity, as real IVR structures and interactions are often proprietary. We present a novel LLM-based methodology to address this gap. Using three distinct models, we synthesized a realistic 23-node IVR structure, generated 920 user intents (230 base and 690 augmented), and performed the routing task. We evaluate two prompt designs: descriptive hierarchical menus and flattened path representations, across both base and augmented datasets. Results show that flattened paths consistently yield higher accuracy, reaching 89.13% on the base dataset compared to 81.30% with the descriptive format, while augmentation introduces linguistic noise that slightly reduces performance. Confusion matrix analysis further suggests that low-performing routes may reflect not only model limitations but also redundancies in menu design. Overall, our findings demonstrate proof-of-concept that LLMs can enable IVR routing through a smoother, more seamless user experience -- moving customer service one step ahead of touch-tone menus.


[81] 2510.21734

Force-Displacement Profiling for Robot-Assisted Deployment of a Left Atrial Appendage Occluder Using FBG-EM Distal Sensing

Atrial fibrillation (AF) increases the risk of thromboembolic events due to impaired function of the left atrial appendage (LAA). Left atrial appendage closure (LAAC) is a minimally invasive intervention designed to reduce stroke risk by sealing the LAA with an expandable occluder device. Current deployment relies on manual catheter control and imaging modalities like fluoroscopy and transesophageal echocardiography, which carry limitations including radiation exposure and limited positioning precision. In this study, we leverage a previously developed force-sensing delivery sheath integrating fiber Bragg gratings (FBGs) at the interface between the catheter and the occluder. Combined with electromagnetic (EM) tracking, this setup enables real-time measurement of interaction forces and catheter tip position during robot-assisted LAAC deployment in an anatomical phantom. We present a novel force-displacement profiling method that characterizes occluder deployment dynamics and identifies key procedural steps without relying on ionizing radiation. The force profiles reveal low-magnitude interaction forces, suggesting minimal mechanical stress on the surrounding anatomy. This approach shows promise in providing clinicians with enhanced intraoperative feedback, improving deployment outcome. Future work will focus on automating deployment steps classification and validating the sensing strategy in dynamic, realistic environments.


[82] 2510.21735

A phase-aware AI car-following model for electric vehicles with adaptive cruise control: Development and validation using real-world data

Internal combustion engine (ICE) vehicles and electric vehicles (EVs) exhibit distinct vehicle dynamics. EVs provide rapid acceleration, with electric motors producing peak power across a wider speed range, and achieve swift deceleration through regenerative braking. While existing microscopic models effectively capture the driving behavior of ICE vehicles, a modeling framework that accurately describes the unique car-following dynamics of EVs is lacking. Developing such a model is essential given the increasing presence of EVs in traffic, yet creating an easy-to-use and accurate analytical model remains challenging. To address these gaps, this study develops and validates a Phase-Aware AI (PAAI) car-following model specifically for EVs. The proposed model enhances traditional physics-based frameworks with an AI component that recognizes and adapts to different driving phases, such as rapid acceleration and regenerative braking. Using real-world trajectory data from vehicles equipped with adaptive cruise control (ACC), we conduct comprehensive simulations to validate the model's performance. The numerical results demonstrate that the PAAI model significantly improves prediction accuracy over traditional car-following models, providing an effective tool for accurately representing EV behavior in traffic simulations.


[83] 2510.21736

Learn2Drive: A neural network-based framework for socially compliant automated vehicle control

This study introduces a novel control framework for adaptive cruise control (ACC) in automated driving, leveraging Long Short-Term Memory (LSTM) networks and physics-informed constraints. As automated vehicles (AVs) adopt advanced features like ACC, transportation systems are becoming increasingly intelligent and efficient. However, existing AV control strategies primarily focus on optimizing the performance of individual vehicles or platoons, often neglecting their interactions with human-driven vehicles (HVs) and the broader impact on traffic flow. This oversight can exacerbate congestion and reduce overall system efficiency. To address this critical research gap, we propose a neural network-based, socially compliant AV control framework that incorporates social value orientation (SVO). This framework enables AVs to account for their influence on HVs and traffic dynamics. By leveraging AVs as mobile traffic regulators, the proposed approach promotes adaptive driving behaviors that reduce congestion, improve traffic efficiency, and lower energy consumption. Within this framework, we define utility functions for both AVs and HVs, which are optimized based on the SVO of each AV to balance its own control objectives with broader traffic flow considerations. Numerical results demonstrate the effectiveness of the proposed method in adapting to varying traffic conditions, thereby enhancing system-wide efficiency. Specifically, when the AV's control mode shifts from prioritizing energy consumption to optimizing traffic flow efficiency, vehicles in the following platoon experience at least a 58.99% increase in individual energy consumption alongside at least a 38.39% improvement in individual average speed, indicating significant enhancements in traffic dynamics.


[84] 2510.21739

Next-Generation LLM for UAV: From Natural Language to Autonomous Flight

With the rapid advancement of Large Language Models (LLMs), their capabilities in various automation domains, particularly Unmanned Aerial Vehicle (UAV) operations, have garnered increasing attention. Current research remains predominantly constrained to small-scale UAV applications, with most studies focusing on isolated components such as path planning for toy drones, while lacking comprehensive investigation of medium- and long-range UAV systems in real-world operational contexts. Larger UAV platforms introduce distinct challenges, including stringent requirements for airport-based take-off and landing procedures, adherence to complex regulatory frameworks, and specialized operational capabilities with elevated mission expectations. This position paper presents the Next-Generation LLM for UAV (NeLV) system -- a comprehensive demonstration and automation roadmap for integrating LLMs into multi-scale UAV operations. The NeLV system processes natural language instructions to orchestrate short-, medium-, and long-range UAV missions through five key technical components: (i) LLM-as-Parser for instruction interpretation, (ii) Route Planner for Points of Interest (POI) determination, (iii) Path Planner for waypoint generation, (iv) Control Platform for executable trajectory implementation, and (v) UAV monitoring. We demonstrate the system's feasibility through three representative use cases spanning different operational scales: multi-UAV patrol, multi-POI delivery, and multi-hop relocation. Beyond the current implementation, we establish a five-level automation taxonomy that charts the evolution from current LLM-as-Parser capabilities (Level 1) to fully autonomous LLM-as-Autopilot systems (Level 5), identifying technical prerequisites and research challenges at each stage.


[85] 2510.21775

Face-MakeUpV2: Facial Consistency Learning for Controllable Text-to-Image Generation

In facial image generation, current text-to-image models often suffer from facial attribute leakage and insufficient physical consistency when responding to local semantic instructions. In this study, we propose Face-MakeUpV2, a facial image generation model that aims to maintain the consistency of face ID and physical characteristics with the reference image. First, we constructed a large-scale dataset FaceCaptionMask-1M comprising approximately one million image-text-masks pairs that provide precise spatial supervision for the local semantic instructions. Second, we employed a general text-to-image pretrained model as the backbone and introduced two complementary facial information injection channels: a 3D facial rendering channel to incorporate the physical characteristics of the image and a global facial feature channel. Third, we formulated two optimization objectives for the supervised learning of our model: semantic alignment in the model's embedding space to mitigate the attribute leakage problem and perceptual loss on facial images to preserve ID consistency. Extensive experiments demonstrated that our Face-MakeUpV2 achieves best overall performance in terms of preserving face ID and maintaining physical consistency of the reference images. These results highlight the practical potential of Face-MakeUpV2 for reliable and controllable facial editing in diverse applications.


[86] 2510.21793

2D_3D Feature Fusion via Cross-Modal Latent Synthesis and Attention Guided Restoration for Industrial Anomaly Detection

Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR), which synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders. Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts. Evaluations on the MVTec 3D-AD and Eyecandies benchmarks demonstrate that MAFR achieves state-of-the-art results, with a mean I-AUROC of 0.972 and 0.901, respectively. The framework also exhibits strong performance in few-shot learning settings, and ablation studies confirm the critical roles of the fusion architecture and composite loss. MAFR offers a principled approach for fusing visual and geometric information, advancing the robustness and accuracy of industrial anomaly detection. Code is available at this https URL


[87] 2510.21797

Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning

Current mainstream approaches to addressing multimodal imbalance primarily focus on architectural modifications and optimization-based, often overlooking a quantitative analysis of the imbalance degree between modalities. To address this gap, our work introduces a novel method for the quantitative analysis of multi-modal imbalance, which in turn informs the design of a sample-level adaptive loss this http URL begin by defining the "Modality Gap" as the difference between the Softmax scores of different modalities (e.g., audio and visual) for the ground-truth class prediction. Analysis of the Modality Gap distribution reveals that it can be effectively modeled by a bimodal Gaussian Mixture Model (GMM). These two components are found to correspond respectively to "modality-balanced" and "modality-imbalanced" data samples. Subsequently, we apply Bayes' theorem to compute the posterior probability of each sample belonging to these two distinct this http URL by this quantitative analysis, we design a novel adaptive loss function with three objectives: (1) to minimize the overall Modality Gap; (2) to encourage the imbalanced sample distribution to shift towards the balanced one; and (3) to apply greater penalty weights to imbalanced samples. We employ a two-stage training strategy consisting of a warm-up phase followed by an adaptive training this http URL results demonstrate that our approach achieves state-of-the-art (SOTA) performance on the public CREMA-D and AVE datasets, attaining accuracies of $80.65\%$ and $70.90\%$, respectively. This validates the effectiveness of our proposed methodology.


[88] 2510.21872

GuitarFlow: Realistic Electric Guitar Synthesis From Tablatures via Flow Matching and Style Transfer

Music generation in the audio domain using artificial intelligence (AI) has witnessed steady progress in recent years. However for some instruments, particularly the guitar, controllable instrument synthesis remains limited in expressivity. We introduce GuitarFlow, a model designed specifically for electric guitar synthesis. The generative process is guided using tablatures, an ubiquitous and intuitive guitar-specific symbolic format. The tablature format easily represents guitar-specific playing techniques (e.g. bends, muted strings and legatos), which are more difficult to represent in other common music notation formats such as MIDI. Our model relies on an intermediary step of first rendering the tablature to audio using a simple sample-based virtual instrument, then performing style transfer using Flow Matching in order to transform the virtual instrument audio into more realistic sounding examples. This results in a model that is quick to train and to perform inference, requiring less than 6 hours of training data. We present the results of objective evaluation metrics, together with a listening test, in which we show significant improvement in the realism of the generated guitar audio from tablatures.


[89] 2510.21944

Fixed Horizon Linear Quadratic Covariance Steering in Continuous Time with Hilbert-Schmidt Terminal Cost

We formulate and solve the fixed horizon linear quadratic covariance steering problem in continuous time with a terminal cost measured in Hilbert-Schmidt (i.e., Frobenius) norm error between the desired and the controlled terminal covariances. For this problem, the necessary conditions of optimality become a coupled matrix ODE two-point boundary value problem. To solve this system of equations, we design a matricial recursive algorithm and prove its convergence. The proposed algorithm and its analysis make use of the linear fractional transforms parameterized by the state transition matrix of the associated Hamiltonian matrix. To illustrate the results, we provide two numerical examples: one with a two dimensional and another with a six dimensional state space.


[90] 2510.22010

FlowOpt: Fast Optimization Through Whole Flow Processes for Training-Free Editing

The remarkable success of diffusion and flow-matching models has ignited a surge of works on adapting them at test time for controlled generation tasks. Examples range from image editing to restoration, compression and personalization. However, due to the iterative nature of the sampling process in those models, it is computationally impractical to use gradient-based optimization to directly control the image generated at the end of the process. As a result, existing methods typically resort to manipulating each timestep separately. Here we introduce FlowOpt - a zero-order (gradient-free) optimization framework that treats the entire flow process as a black box, enabling optimization through the whole sampling path without backpropagation through the model. Our method is both highly efficient and allows users to monitor the intermediate optimization results and perform early stopping if desired. We prove a sufficient condition on FlowOpt's step-size, under which convergence to the global optimum is guaranteed. We further show how to empirically estimate this upper bound so as to choose an appropriate step-size. We demonstrate how FlowOpt can be used for image editing, showcasing two options: (i) inversion (determining the initial noise that generates a given image), and (ii) directly steering the edited image to be similar to the source image while conforming to a target text prompt. In both cases, FlowOpt achieves state-of-the-art results while using roughly the same number of neural function evaluations (NFEs) as existing methods. Code and examples are available on the project's webpage.


[91] 2510.22021

K-DAREK: Distance Aware Error for Kurkova Kolmogorov Networks

Neural networks are parametric and powerful tools for function approximation, and the choice of architecture heavily influences their interpretability, efficiency, and generalization. In contrast, Gaussian processes (GPs) are nonparametric probabilistic models that define distributions over functions using a kernel to capture correlations among data points. However, these models become computationally expensive for large-scale problems, as they require inverting a large covariance matrix. Kolmogorov- Arnold networks (KANs), semi-parametric neural architectures, have emerged as a prominent approach for modeling complex functions with structured and efficient representations through spline layers. Kurkova Kolmogorov-Arnold networks (KKANs) extend this idea by reducing the number of spline layers in KAN and replacing them with Chebyshev layers and multi-layer perceptrons, thereby mapping inputs into higher-dimensional spaces before applying spline-based transformations. Compared to KANs, KKANs perform more stable convergence during training, making them a strong architecture for estimating operators and system modeling in dynamical systems. By enhancing the KKAN architecture, we develop a novel learning algorithm, distance-aware error for Kurkova-Kolmogorov networks (K-DAREK), for efficient and interpretable function approximation with uncertainty quantification. Our approach establishes robust error bounds that are distance-aware; this means they reflect the proximity of a test point to its nearest training points. Through case studies on a safe control task, we demonstrate that K-DAREK is about four times faster and ten times higher computationally efficiency than Ensemble of KANs, 8.6 times more scalable than GP by increasing the data size, and 50% safer than our previous work distance-aware error for Kolmogorov networks (DAREK).


[92] 2510.22022

Control of neural field equations with step-function inputs

Wilson-Cowan and Amari-type models capture nonlinear neural population dynamics, providing a fundamental framework for modeling how sensory and other exogenous inputs shape activity in neural tissue. We study the controllability properties of Amari-type neural fields subject to piecewise/constant-in-time inputs. The model describes the time evolution of the polarization of neural tissue within a spatial continuum, with synaptic interactions represented by a convolution kernel. We study the synthesis of piecewise/constant-in-time inputs to achieve two-point boundary-type control objectives, namely, steering neural activity from an initial state to a prescribed target state. This approach is particularly relevant for predicting the emergence of paradoxical neural representations, such as discordant visual illusions that occur in response to overt sensory stimuli. We first present a control synthesis based on the Banach fixed-point theorem, which yields an iterative construction of a constant-in-time input under minimal regularity assumptions on the kernel and transfer function; however, it exhibits practical limitations, even in the linear case. To overcome these challenges, we then develop a generic synthesis framework based on the flow of neural dynamics drift, enabling explicit piecewise constant and constant-in-time inputs. Extensive numerical results in one and two spatial dimensions confirm the effectiveness of the proposed syntheses and demonstrate their superior performance compared to inputs derived from naive linearization at the initial or target states when these states are not equilibria of the drift dynamics. By providing a mathematically rigorous framework for controlling Amari-type neural fields, this work advances our understanding of nonlinear neural population control with potential applications in computational neuroscience, psychophysics, and neurostimulation.


[93] 2510.22035

Caption-Driven Explainability: Probing CNNs for Bias via CLIP

Robustness has become one of the most critical problems in machine learning (ML). The science of interpreting ML models to understand their behavior and improve their robustness is referred to as explainable artificial intelligence (XAI). One of the state-of-the-art XAI methods for computer vision problems is to generate saliency maps. A saliency map highlights the pixel space of an image that excites the ML model the most. However, this property could be misleading if spurious and salient features are present in overlapping pixel spaces. In this paper, we propose a caption-based XAI method, which integrates a standalone model to be explained into the contrastive language-image pre-training (CLIP) model using a novel network surgery approach. The resulting caption-based XAI model identifies the dominant concept that contributes the most to the models prediction. This explanation minimizes the risk of the standalone model falling for a covariate shift and contributes significantly towards developing robust ML models.


[94] 2510.22070

MAGIC-Flow: Multiscale Adaptive Conditional Flows for Generation and Interpretable Classification

Generative modeling has emerged as a powerful paradigm for representation learning, but its direct applicability to challenging fields like medical imaging remains limited: mere generation, without task alignment, fails to provide a robust foundation for clinical use. We propose MAGIC-Flow, a conditional multiscale normalizing flow architecture that performs generation and classification within a single modular framework. The model is built as a hierarchy of invertible and differentiable bijections, where the Jacobian determinant factorizes across sub-transformations. We show how this ensures exact likelihood computation and stable optimization, while invertibility enables explicit visualization of sample likelihoods, providing an interpretable lens into the model's reasoning. By conditioning on class labels, MAGIC-Flow supports controllable sample synthesis and principled class-probability estimation, effectively aiding both generative and discriminative objectives. We evaluate MAGIC-Flow against top baselines using metrics for similarity, fidelity, and diversity. Across multiple datasets, it addresses generation and classification under scanner noise, and modality-specific synthesis and identification. Results show MAGIC-Flow creates realistic, diverse samples and improves classification. MAGIC-Flow is an effective strategy for generation and classification in data-limited domains, with direct benefits for privacy-preserving augmentation, robust generalization, and trustworthy medical AI.


[95] 2510.22089

From Time Series to Affine Systems

The paper extends core results of behavioral systems theory from linear to affine time-invariant systems. We characterize the behavior of affine time-invariant systems via kernel, input-output, state-space, and finite-horizon data-driven representations, demonstrating a range of structural parallels with linear time-invariant systems. Building on these representations, we introduce a new persistence of excitation condition tailored to the model class of affine time-invariant systems. The condition yields a new fundamental lemma that parallels the classical result for linear systems while provably reducing data requirements. Our analysis highlights that excitation conditions must be adapted to the model class: overlooking structural differences may lead to unnecessarily conservative data requirements.


[96] 2510.22141

LOC: A General Language-Guided Framework for Open-Set 3D Occupancy Prediction

Vision-Language Models (VLMs) have shown significant progress in open-set challenges. However, the limited availability of 3D datasets hinders their effective application in 3D scene understanding. We propose LOC, a general language-guided framework adaptable to various occupancy networks, supporting both supervised and self-supervised learning paradigms. For self-supervised tasks, we employ a strategy that fuses multi-frame LiDAR points for dynamic/static scenes, using Poisson reconstruction to fill voids, and assigning semantics to voxels via K-Nearest Neighbor (KNN) to obtain comprehensive voxel representations. To mitigate feature over-homogenization caused by direct high-dimensional feature distillation, we introduce Densely Contrastive Learning (DCL). DCL leverages dense voxel semantic information and predefined textual prompts. This efficiently enhances open-set recognition without dense pixel-level supervision, and our framework can also leverage existing ground truth to further improve performance. Our model predicts dense voxel features embedded in the CLIP feature space, integrating textual and image pixel information, and classifies based on text and semantic similarity. Experiments on the nuScenes dataset demonstrate the method's superior performance, achieving high-precision predictions for known classes and distinguishing unknown classes without additional training data.


[97] 2510.22224

Taming Silent Failures: A Framework for Verifiable AI Reliability

The integration of Artificial Intelligence (AI) into safety-critical systems introduces a new reliability paradigm: silent failures, where AI produces confident but incorrect outputs that can be dangerous. This paper introduces the Formal Assurance and Monitoring Environment (FAME), a novel framework that confronts this challenge. FAME synergizes the mathematical rigor of offline formal synthesis with the vigilance of online runtime monitoring to create a verifiable safety net around opaque AI components. We demonstrate its efficacy in an autonomous vehicle perception system, where FAME successfully detected 93.5% of critical safety violations that were otherwise silent. By contextualizing our framework within the ISO 26262 and ISO/PAS 8800 standards, we provide reliability engineers with a practical, certifiable pathway for deploying trustworthy AI. FAME represents a crucial shift from accepting probabilistic performance to enforcing provable safety in next-generation systems.


[98] 2510.22230

Robust MIMO Channel Estimation Using Energy-Based Generative Diffusion Models

Channel estimation for massive multiple-input multiple-output (MIMO) systems is fundamentally constrained by excessive pilot overhead and high estimation latency. To overcome these obstacles, recent studies have leveraged deep generative networks to capture the prior distribution of wireless channels. In this paper, we propose a novel estimation framework that integrates an energy-based generative diffusion model (DM) with the Metropolis-Hastings (MH) principle. By reparameterizing the diffusion process with an incorporated energy function, the framework explicitly estimates the unnormalized log-prior, while MH corrections refine the sampling trajectory, mitigate deviations, and enhance robustness, ultimately enabling accurate posterior sampling for high-fidelity channel estimation. Numerical results reveal that the proposed approach significantly improves estimation accuracy compared with conventional parameterized DMs and other baseline methods, particularly in cases with limited pilot overhead.


[99] 2510.22270

Distributed Stochastic Proximal Algorithm on Riemannian Submanifolds for Weakly-convex Functions

This paper aims to investigate the distributed stochastic optimization problems on compact embedded submanifolds (in the Euclidean space) for multi-agent network systems. To address the manifold structure, we propose a distributed Riemannian stochastic proximal algorithm framework by utilizing the retraction and Riemannian consensus protocol, and analyze three specific algorithms: the distributed Riemannian stochastic subgradient, proximal point, and prox-linear algorithms. When the local costs are weakly-convex and the initial points satisfy certain conditions, we show that the iterates generated by this framework converge to a nearly stationary point in expectation while achieving consensus. We further establish the convergence rate of the algorithm framework as $\mathcal{O}(\frac{1+\kappa_g}{\sqrt{k}})$ where $k$ denotes the number of iterations and $\kappa_g$ shows the impact of manifold geometry on the algorithm performance. Finally, numerical experiments are implemented to demonstrate the theoretical results and show the empirical performance.


[100] 2510.22283

Adapting Noise-Driven PUF and AI for Secure WBG ICS: A Proof-of-Concept Study

Wide-bandgap (WBG) technologies offer unprecedented improvements in power system efficiency, size, and performance, but also introduce unique sensor corruption and cybersecurity risks in industrial control systems (ICS), particularly due to high-frequency noise and sophisticated cyber-physical threats. This proof-of-concept (PoC) study demonstrates the adaptation of a noise-driven physically unclonable function (PUF) and machine learning (ML)-assisted anomaly detection framework to the demanding environment of WBG-based ICS sensor pathways. By extracting entropy from unavoidable WBG switching noise (up to 100 kHz) as a PUF source, and simultaneously using this noise as a real-time threat indicator, the proposed system unites hardware-level authentication and anomaly detection. Our approach integrates hybrid machine learning (ML) models with adaptive Bayesian filtering, providing robust and low-latency detection capabilities resilient to both natural electromagnetic interference (EMI) and active adversarial manipulation. Through detailed simulations of WBG modules under benign and attack scenarios--including EMI injection, signal tampering, and node impersonation--we achieve 95% detection accuracy and sub-millisecond processing latency. These results demonstrate the feasibility of physics-driven, dual-use noise exploitation as a scalable ICS defense primitive. Our findings lay the groundwork for next-generation security strategies that leverage inherent device characteristics, bridging hardware and artificial intelligence (AI) for enhanced protection of critical ICS infrastructure.


[101] 2510.22364

Tuned for Creativity? Graph-Theoretical Mapping of Resting-State EEG Reveals Neural Signatures of Creativity

Understanding how creativity is represented in the brain's intrinsic functional architecture remains a central challenge in cognitive neuroscience. While resting-state fMRI studies have revealed large-scale network correlates of creative potential, electroencephalography (EEG) offers a temporally precise and scalable approach to capture the fast oscillatory dynamics that underlie spontaneous neural organization. In this study, we used a data-driven network approach to examine whether resting-state EEG connectivity patterns differentiate individuals according to their creative abilities. Creativity was evaluated by: The Inventory of Creative Activities and Achievements (ICAA), The Divergent Association Task (DAT), The Matchstick Arithmetic Puzzles Task (MAPT) and Self-rating (SR) of creative ability in 30 healthy young adults. Graph-theoretical analyses were applied to functional connectivity matrices and clustered based on graph similarity. Two distinct participant clusters emerged, differing systematically across multiple dimensions of creativity. Cluster 1, characterized by consistently higher performance across multiple creativity variables (ICAA, DAT, MAPT and SR), showed broad alpha-band hypoconnectivity, relatively preserved left frontal connectivity and greater network modularity. Cluster 0, associated with lower creativity scores, exhibited stronger overall connectivity strength, reduced modularity and higher local clustering. These findings suggest that resting-state EEG connectivity patterns can index stable cognitive traits such as creativity. More broadly, they point to an intrinsic neural signature of adaptive brain function marked by efficient yet flexible network organization that may support creative and adaptive cognition.


[102] 2510.22411

Politics, Inequality, and the Robustness of Shared Infrastructure Systems

Our infrastructure systems enable our well-being by allowing us to move, store, and transform materials and information given considerable social and environmental variation. Critically, this ability is shaped by the degree to which society invests in infrastructure, a fundamentally political question in large public systems. There, infrastructure providers are distinguished from users through political processes, such as elections, and there is considerable heterogeneity among users. Previous political economic models have not taken into account (i) dynamic infrastructures, (ii) dynamic user preferences, and (iii) alternatives to rational actor theory. Meanwhile, engineering often neglects politics. We address these gaps with a general dynamic model of shared infrastructure systems that incorporates theories from political economy, social-ecological systems, and political psychology. We use the model to develop propositions on how multiple characteristics of the political process impact the robustness of shared infrastructure systems to capacity shocks and unequal opportunity for private infrastructure investment. Under user fees, inequality decreases robustness, but taxing private infrastructure use can increase robustness if non-elites have equal political influence. Election cycle periods have a nonlinear effect where increasing them increases robustness up to a point but decreases robustness beyond that point. Further, there is a negative relationship between the ideological sensitivity of candidates and robustness. Overall, the biases of voters and candidates (whether they favor tax increases or decreases) mediate these political-economic effects on robustness because biases may or may not match the reality of system needs (whether system recovery requires tax increases).


[103] 2510.22420

A Novel Multi-Timescale Stability-Preserving Hierarchical Reinforcement Learning Controller Framework for Adaptive Control in High-Dimensional Dynamical Systems

Controlling high-dimensional stochastic systems, critical in robotics, autonomous vehicles, and hyperchaotic systems, faces the curse of dimensionality, lacks temporal abstraction, and often fails to ensure stochastic stability. To overcome these limitations, this study introduces the Multi-Timescale Lyapunov-Constrained Hierarchical Reinforcement Learning (MTLHRL) framework. MTLHRL integrates a hierarchical policy within a semi-Markov Decision Process (SMDP), featuring a high-level policy for strategic planning and a low-level policy for reactive control, which effectively manages complex, multi-timescale decision-making and reduces dimensionality overhead. Stability is rigorously enforced using a neural Lyapunov function optimized via Lagrangian relaxation and multi-timescale actor-critic updates, ensuring mean-square boundedness or asymptotic stability in the face of stochastic dynamics. The framework promotes efficient and reliable learning through trust-region constraints and decoupled optimization. Extensive simulations on an 8D hyperchaotic system and a 5-DOF robotic manipulator demonstrate MTLHRL's empirical superiority. It significantly outperforms baseline methods in both stability and performance, recording the lowest error indices (e.g., Integral Absolute Error (IAE): 3.912 in hyperchaotic control and IAE: 1.623 in robotics), achieving faster convergence, and exhibiting superior disturbance rejection. MTLHRL offers a theoretically grounded and practically viable solution for robust control of complex stochastic systems.


[104] 2510.22455

Evaluating Multimodal Large Language Models on Core Music Perception Tasks

Multimodal Large Language Models (LLMs) claim "musical understanding" via evaluations that conflate listening with score reading. We benchmark three SOTA LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, and Qwen2.5-Omni) across three core music skills: Syncopation Scoring, Transposition Detection, and Chord Quality Identification. Moreover, we separate three sources of variability: (i) perceptual limitations (audio vs. MIDI inputs), (ii) exposure to examples (zero- vs. few-shot manipulations), and (iii) reasoning strategies (Standalone, CoT, LogicLM). For the latter we adapt LogicLM, a framework combining LLMs with symbolic solvers to perform structured reasoning, to music. Results reveal a clear perceptual gap: models perform near ceiling on MIDI but show accuracy drops on audio. Reasoning and few-shot prompting offer minimal gains. This is expected for MIDI, where performance reaches saturation, but more surprising for audio, where LogicLM, despite near-perfect MIDI accuracy, remains notably brittle. Among models, Gemini Pro achieves the highest performance across most conditions. Overall, current systems reason well over symbols (MIDI) but do not yet "listen" reliably from audio. Our method and dataset make the perception-reasoning boundary explicit and offer actionable guidance for building robust, audio-first music systems.


[105] 2510.22517

Smart Sensor Placement: A Correlation-Aware Attribution Framework (CAAF) for Real-world Data Modeling

Optimal sensor placement (OSP) is critical for efficient, accurate monitoring, control, and inference in complex real-world systems. We propose a machine-learning-based feature attribution framework to identify OSP for the prediction of quantities of interest. Feature attribution quantifies input contributions to a model's output; however, it struggles with highly correlated input data often encountered in real-world applications. To address this, we propose a Correlation-Aware Attribution Framework (CAAF), which introduces a clustering step before performing feature attribution to reduce redundancy and enhance generalizability. We first illustrate the core principles of the proposed framework through a series of validation cases, then demonstrate its effectiveness in real-world dynamical systems, such as structural health monitoring, airfoil lift prediction, and wall-normal velocity estimation for turbulent channel flow. The results show that the CAAF outperforms alternative approaches that typically struggle due to the presence of nonlinear dynamics, chaotic behavior, and multi-scale interactions, and enables the effective application of feature attribution for identifying OSP in real-world environments.


[106] 2510.22568

SPIRAL: Self-Play Incremental Racing Algorithm for Learning in Multi-Drone Competitions

This paper introduces SPIRAL (Self-Play Incremental Racing Algorithm for Learning), a novel approach for training autonomous drones in multi-agent racing competitions. SPIRAL distinctively employs a self-play mechanism to incrementally cultivate complex racing behaviors within a challenging, dynamic environment. Through this self-play core, drones continuously compete against increasingly proficient versions of themselves, naturally escalating the difficulty of competitive interactions. This progressive learning journey guides agents from mastering fundamental flight control to executing sophisticated cooperative multi-drone racing strategies. Our method is designed for versatility, allowing integration with any state-of-the-art Deep Reinforcement Learning (DRL) algorithms within its self-play framework. Simulations demonstrate the significant advantages of SPIRAL and benchmark the performance of various DRL algorithms operating within it. Consequently, we contribute a versatile, scalable, and self-improving learning framework to the field of autonomous drone racing. SPIRAL's capacity to autonomously generate appropriate and escalating challenges through its self-play dynamic offers a promising direction for developing robust and adaptive racing strategies in multi-agent environments. This research opens new avenues for enhancing the performance and reliability of autonomous racing drones in increasingly complex and competitive scenarios.


[107] 2510.22570

Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing

The coordination of multiple autonomous agents in high-speed, competitive environments represents a significant engineering challenge. This paper presents CRUISE (Curriculum-Based Iterative Self-Play for Scalable Multi-Drone Racing), a reinforcement learning framework designed to solve this challenge in the demanding domain of multi-drone racing. CRUISE overcomes key scalability limitations by synergistically combining a progressive difficulty curriculum with an efficient self-play mechanism to foster robust competitive behaviors. Validated in high-fidelity simulation with realistic quadrotor dynamics, the resulting policies significantly outperform both a standard reinforcement learning baseline and a state-of-the-art game-theoretic planner. CRUISE achieves nearly double the planner's mean racing speed, maintains high success rates, and demonstrates robust scalability as agent density increases. Ablation studies confirm that the curriculum structure is the critical component for this performance leap. By providing a scalable and effective training methodology, CRUISE advances the development of autonomous systems for dynamic, competitive tasks and serves as a blueprint for future real-world deployment.


[108] 2510.22674

Approximate Signed Multiplier with Sign-Focused Compressor for Edge Detection Applications

This paper presents an approximate signed multiplier architecture that incorporates a sign-focused compressor, specifically designed for edge detection applications in machine learning and signal processing. The multiplier incorporates two types of sign-focused compressors: A + B + C + 1 and A + B + C + D + 1. Both exact and approximate compressor designs are utilized, with a focus on efficiently handling constant value "1" and negative partial products, which frequently appear in the partial product matrices of signed multipliers. To further enhance efficiency, the lower N - 1 columns of the partial product matrix are truncated, followed by an error compensation mechanism. Experimental results show that the proposed 8-bit approximate multiplier achieves a 29.21% reduction in power delay product (PDP) and a 14.39% reduction in power compared to the best of existing multipliers. The proposed multiplier is integrated into a custom convolution layer and performs edge detection, demonstrating its practical utility in real-world applications.


[109] 2510.22702

Atlas Urban Index: A VLM-Based Approach for Spatially and Temporally Calibrated Urban Development Monitoring

We introduce the {\em Atlas Urban Index} (AUI), a metric for measuring urban development computed using Sentinel-2 \citep{spoto2012sentinel2} satellite imagery. Existing approaches, such as the {\em Normalized Difference Built-up Index} (NDBI), often struggle to accurately capture urban development due to factors like atmospheric noise, seasonal variation, and cloud cover. These limitations hinder large-scale monitoring of human development and urbanization. To address these challenges, we propose an approach that leverages {\em Vision-Language Models }(VLMs) to provide a development score for regions. Specifically, we collect a time series of Sentinel-2 images for each region. Then, we further process the images within fixed time windows to get an image with minimal cloud cover, which serves as the representative image for that time window. To ensure consistent scoring, we adopt two strategies: (i) providing the VLM with a curated set of reference images representing different levels of urbanization, and (ii) supplying the most recent past image to both anchor temporal consistency and mitigate cloud-related noise in the current image. Together, these components enable AUI to overcome the challenges of traditional urbanization indices and produce more reliable and stable development scores. Our qualitative experiments on Bangalore suggest that AUI outperforms standard indices such as NDBI.


[110] 2510.22821

Analytical Swarm Chemistry: Characterization and Analysis of Emergent Swarm Behaviors

Swarm robotics has potential for a wide variety of applications, but real-world deployments remain rare due to the difficulty of predicting emergent behaviors arising from simple local interactions. Traditional engineering approaches design controllers to achieve desired macroscopic outcomes under idealized conditions, while agent-based and artificial life studies explore emergent phenomena in a bottom-up, exploratory manner. In this work, we introduce Analytical Swarm Chemistry, a framework that integrates concepts from engineering, agent-based and artificial life research, and chemistry. This framework combines macrostate definitions with phase diagram analysis to systematically explore how swarm parameters influence emergent behavior. Inspired by concepts from chemistry, the framework treats parameters like thermodynamic variables, enabling visualization of regions in parameter space that give rise to specific behaviors. Applying this framework to agents with minimally viable capabilities, we identify sufficient conditions for behaviors such as milling and diffusion and uncover regions of the parameter space that reliably produce these behaviors. Preliminary validation on real robots demonstrates that these regions correspond to observable behaviors in practice. By providing a principled, interpretable approach, this framework lays the groundwork for predictable and reliable emergent behavior in real-world swarm systems.


[111] 2510.22892

Never Too Rigid to Reach: Adaptive Virtual Model Control with LLM- and Lyapunov-Based Reinforcement Learning

Robotic arms are increasingly deployed in uncertain environments, yet conventional control pipelines often become rigid and brittle when exposed to perturbations or incomplete information. Virtual Model Control (VMC) enables compliant behaviors by embedding virtual forces and mapping them into joint torques, but its reliance on fixed parameters and limited coordination among virtual components constrains adaptability and may undermine stability as task objectives evolve. To address these limitations, we propose Adaptive VMC with Large Language Model (LLM)- and Lyapunov-Based Reinforcement Learning (RL), which preserves the physical interpretability of VMC while supporting stability-guaranteed online adaptation. The LLM provides structured priors and high-level reasoning that enhance coordination among virtual components, improve sample efficiency, and facilitate flexible adjustment to varying task requirements. Complementarily, Lyapunov-based RL enforces theoretical stability constraints, ensuring safe and reliable adaptation under uncertainty. Extensive simulations on a 7-DoF Panda arm demonstrate that our approach effectively balances competing objectives in dynamic tasks, achieving superior performance while highlighting the synergistic benefits of LLM guidance and Lyapunov-constrained adaptation.


[112] 2510.22949

End-to-End Design and Validation of a Low-Cost Stewart Platform with Nonlinear Estimation and Control

This paper presents the complete design, control, and experimental validation of a low-cost Stewart platform prototype developed as an affordable yet capable robotic testbed for research and education. The platform combines off the shelf components with 3D printed and custom fabricated parts to deliver full six degrees of freedom motions using six linear actuators connecting a moving platform to a fixed base. The system software integrates dynamic modeling, data acquisition, and real time control within a unified framework. A robust trajectory tracking controller based on feedback linearization, augmented with an LQR scheme, compensates for the platform's nonlinear dynamics to achieve precise motion control. In parallel, an Extended Kalman Filter fuses IMU and actuator encoder feedback to provide accurate and reliable state estimation under sensor noise and external disturbances. Unlike prior efforts that emphasize only isolated aspects such as modeling or control, this work delivers a complete hardware-software platform validated through both simulation and experiments on static and dynamic trajectories. Results demonstrate effective trajectory tracking and real-time state estimation, highlighting the platform's potential as a cost effective and versatile tool for advanced research and educational applications.


[113] 2510.23003

An Intelligent Water-Saving Irrigation System Based on Multi-Sensor Fusion and Visual Servoing Control

This paper introduces an intelligent water-saving irrigation system designed to address critical challenges in precision agriculture, such as inefficient water use and poor terrain adaptability. The system integrates advanced computer vision, robotic control, and real-time stabilization technologies via a multi-sensor fusion approach. A lightweight YOLO model, deployed on an embedded vision processor (K210), enables real-time plant container detection with over 96% accuracy under varying lighting conditions. A simplified hand-eye calibration algorithm-designed for 'handheld camera' robot arm configurations-ensures that the end effector can be precisely positioned, with a success rate exceeding 90%. The active leveling system, driven by the STM32F103ZET6 main control chip and JY901S inertial measurement data, can stabilize the irrigation platform on slopes up to 10 degrees, with a response time of 1.8 seconds. Experimental results across three simulated agricultural environments (standard greenhouse, hilly terrain, complex lighting) demonstrate a 30-50% reduction in water consumption compared to conventional flood irrigation, with water use efficiency exceeding 92% in all test cases.


[114] 2510.23057

Seq-DeepIPC: Sequential Sensing for End-to-End Control in Legged Robot Navigation

We present Seq-DeepIPC, a sequential end-to-end perception-to-control model for legged robot navigation in realworld environments. Seq-DeepIPC advances intelligent sensing for autonomous legged navigation by tightly integrating multi-modal perception (RGB-D + GNSS) with temporal fusion and control. The model jointly predicts semantic segmentation and depth estimation, giving richer spatial features for planning and control. For efficient deployment on edge devices, we use EfficientNet-B0 as the encoder, reducing computation while maintaining accuracy. Heading estimation is simplified by removing the noisy IMU and instead computing the bearing angle directly from consecutive GNSS positions. We collected a larger and more diverse dataset that includes both road and grass terrains, and validated Seq-DeepIPC on a robot dog. Comparative and ablation studies show that sequential inputs improve perception and control in our models, while other baselines do not benefit. Seq-DeepIPC achieves competitive or better results with reasonable model size; although GNSS-only heading is less reliable near tall buildings, it is robust in open areas. Overall, Seq-DeepIPC extends end-to-end navigation beyond wheeled robots to more versatile and temporally-aware systems. To support future research, we will release the codes to our GitHub repository at this https URL.


[115] 2510.23060

zkSTAR: A zero knowledge system for time series attack detection enforcing regulatory compliance in critical infrastructure networks

Industrial control systems (ICS) form the operational backbone of critical infrastructure networks (CIN) such as power grids, water supply systems, and gas pipelines. As cyber threats to these systems escalate, regulatory agencies are imposing stricter compliance requirements to ensure system-wide security and reliability. A central challenge, however, is enabling regulators to verify the effectiveness of detection mechanisms without requiring utilities to disclose sensitive operational data. In this paper, we introduce zkSTAR, a cyberattack detection framework that leverages zk-SNARKs to reconcile these requirements and enable provable detection guarantees while preserving data confidentiality. Our approach builds on established residual-based statistical hypothesis testing methods applied to state-space detection models. Specifically, we design a two-pronged zk-SNARK architecture that enforces temporal consistency of the state-space dynamics and statistical consistency of the detection tests, allowing regulators to temporally verify alarm correctness without visibility into utility-level data. We formally analyze the soundness and zero knowledge properties of our framework and validate its practical feasibility through computational experiments on real-world ICS datasets. As a result, our work demonstrates a scalable, privacy-preserving alternative for regulatory compliance for ICS driven critical infrastructure networks.


[116] 2510.23078

Numerical Spectrum Linking: Identification of Governing PDE via Koopman-Chebyshev Approximation

A numerical framework is proposed for identifying partial differential equations (PDEs) governing dynamical systems directly from their observation data using Chebyshev polynomial approximation. In contrast to data-driven approaches such as dynamic mode decomposition (DMD), which approximate the Koopman operator without a clear connection to differential operators, the proposed method constructs finite-dimensional Koopman matrices by projecting the dynamics onto a Chebyshev basis, thereby capturing both differential and nonlinear terms. This establishes a numerical link between the Koopman and differential operators. Numerical experiments on benchmark dynamical systems confirm the accuracy and efficiency of the approach, underscoring its potential for interpretable operator learning. The framework also lays a foundation for future integration with symbolic regression, enabling the construction of explicit mathematical models directly from data.


[117] 2510.23129

Combining High Level Scheduling and Low Level Control to Manage Fleets of Mobile Robots

The deployment of mobile robots for material handling in industrial environments requires scalable coordination of large fleets in dynamic settings. This paper presents a two-layer framework that combines high-level scheduling with low-level control. Tasks are assigned and scheduled using the compositional algorithm ComSat, which generates time-parameterized routes for each robot. These schedules are then used by a distributed Model Predictive Control (MPC) system in real time to compute local reference trajectories, accounting for static and dynamic obstacles. The approach ensures safe, collision-free operation, and supports rapid rescheduling in response to disruptions such as robot failures or environmental changes. We evaluate the method in simulated 2D environments with varying road capacities and traffic conditions, demonstrating high task completion rates and robust behavior even under congestion. The modular structure of the framework allows for computational tractability and flexibility, making it suitable for deployment in complex, real-world industrial scenarios.


[118] 2510.23148

Adapting Interleaved Encoders with PPO for Language-Guided Reinforcement Learning in BabyAI

Deep reinforcement learning agents often struggle when tasks require understanding both vision and language. Conventional architectures typically isolate perception (for example, CNN-based visual encoders) from decision-making (policy networks). This separation can be inefficient, since the policy's failures do not directly help the perception module learn what is important. To address this, we implement the Perception-Decision Interleaving Transformer (PDiT) architecture introduced by Mao et al. (2023), a model that alternates between perception and decision layers within a single transformer. This interleaving allows feedback from decision-making to refine perceptual features dynamically. In addition, we integrate a contrastive loss inspired by CLIP to align textual mission embeddings with visual scene features. We evaluate the PDiT encoders on the BabyAI GoToLocal environment and find that the approach achieves more stable rewards and stronger alignment compared to a standard PPO baseline. The results suggest that interleaved transformer encoders are a promising direction for developing more integrated autonomous agents.


[119] 2510.23235

Grassmanian Interpolation of Low-Pass Graph Filters: Theory and Applications

Low-pass graph filters are fundamental for signal processing on graphs and other non-Euclidean domains. However, the computation of such filters for parametric graph families can be prohibitively expensive as computation of the corresponding low-frequency subspaces, requires the repeated solution of an eigenvalue problem. We suggest a novel algorithm of low-pass graph filter interpolation based on Riemannian interpolation in normal coordinates on the Grassmann manifold. We derive an error bound estimate for the subspace interpolation and suggest two possible applications for induced parametric graph families. First, we argue that the temporal evolution of the node features may be translated to the evolving graph topology via a similarity correction to adjust the homophily degree of the network. Second, we suggest a dot product graph family induced by a given static graph which allows to infer improved message passing scheme for node classification facilitated by the filter interpolation.


[120] 2510.23274

Privacy-Preserving Semantic Communication over Wiretap Channels with Learnable Differential Privacy

While semantic communication (SemCom) improves transmission efficiency by focusing on task-relevant information, it also raises critical privacy concerns. Many existing secure SemCom approaches rely on restrictive or impractical assumptions, such as favorable channel conditions for the legitimate user or prior knowledge of the eavesdropper's model. To address these limitations, this paper proposes a novel secure SemCom framework for image transmission over wiretap channels, leveraging differential privacy (DP) to provide approximate privacy guarantees. Specifically, our approach first extracts disentangled semantic representations from source images using generative adversarial network (GAN) inversion method, and then selectively perturbs private semantic representations with approximate DP noise. Distinct from conventional DP-based protection methods, we introduce DP noise with learnable pattern, instead of traditional white Gaussian or Laplace noise, achieved through adversarial training of neural networks (NNs). This design mitigates the inherent non-invertibility of DP while effectively protecting private information. Moreover, it enables explicitly controllable security levels by adjusting the privacy budget according to specific security requirements, which is not achieved in most existing secure SemCom approaches. Experimental results demonstrate that, compared with the previous DP-based method and direct transmission, the proposed method significantly degrades the reconstruction quality for the eavesdropper, while introducing only slight degradation in task performance. Under comparable security levels, our approach achieves an LPIPS advantage of 0.06-0.29 and an FPPSR advantage of 0.10-0.86 for the legitimate user compared with the previous DP-based method.


[121] 2510.23312

Low-Resource Audio Codec (LRAC): 2025 Challenge Description

While recent neural audio codecs deliver superior speech quality at ultralow bitrates over traditional methods, their practical adoption is hindered by obstacles related to low-resource operation and robustness to acoustic distortions. Edge deployment scenarios demand codecs that operate under stringent compute constraints while maintaining low latency and bitrate. The presence of background noise and reverberation further necessitates designs that are resilient to such degradations. The performance of neural codecs under these constraints and their integration with speech enhancement remain largely unaddressed. To catalyze progress in this area, we introduce the 2025 Low-Resource Audio Codec Challenge, which targets the development of neural and hybrid codecs for resource-constrained applications. Participants are supported with a standardized training dataset, two baseline systems, and a comprehensive evaluation framework. The challenge is expected to yield valuable insights applicable to both codec design and related downstream audio tasks.


[122] 2510.23352

Flexibility aggregation via set projection for distribution grids with multiple interconnections

With the increasing number of flexible energy devices in distribution grids, coordination between Transmission System Operators (TSOs) and Distribution System Operators (DSOs) becomes critical for optimal system operation. One form of coordination is to solve the overall system operation problem in a hierarchical way, computing Feasible Operational Regions (FORs) for the interconnection between TSO/DSO. Most methods for computing FORs rely on the assumption of only one interconnection point between TSO and DSOs, which is often violated in practice. In this work, we propose a method for computing FORs in distribution grids with multiple interconnection points to the transmission grid. We test our method in a grid with two interconnecting points and analyze the properties of the resulting high-dimensional FOR from a power systems perspective.


[123] 2510.23416

Quality-controlled registration of urban MLS point clouds reducing drift effects by adaptive fragmentation

This study presents a novel workflow designed to efficiently and accurately register large-scale mobile laser scanning (MLS) point clouds to a target model point cloud in urban street scenarios. This workflow specifically targets the complexities inherent in urban environments and adeptly addresses the challenges of integrating point clouds that vary in density, noise characteristics, and occlusion scenarios, which are common in bustling city centers. Two methodological advancements are introduced. First, the proposed Semi-sphere Check (SSC) preprocessing technique optimally fragments MLS trajectory data by identifying mutually orthogonal planar surfaces. This step reduces the impact of MLS drift on the accuracy of the entire point cloud registration, while ensuring sufficient geometric features within each fragment to avoid local minima. Second, we propose Planar Voxel-based Generalized Iterative Closest Point (PV-GICP), a fine registration method that selectively utilizes planar surfaces within voxel partitions. This pre-process strategy not only improves registration accuracy but also reduces computation time by more than 50% compared to conventional point-to-plane ICP methods. Experiments on real-world datasets from Munich's inner city demonstrate that our workflow achieves sub-0.01 m average registration accuracy while significantly shortening processing times. The results underscore the potential of the proposed methods to advance automated 3D urban modeling and updating, with direct applications in urban planning, infrastructure management, and dynamic city monitoring.


[124] 2510.23503

Bayes-Split-Edge: Bayesian Optimization for Constrained Collaborative Inference in Wireless Edge Systems

Mobile edge devices (e.g., AR/VR headsets) typically need to complete timely inference tasks while operating with limited on-board computing and energy resources. In this paper, we investigate the problem of collaborative inference in wireless edge networks, where energy-constrained edge devices aim to complete inference tasks within given deadlines. These tasks are carried out using neural networks, and the edge device seeks to optimize inference performance under energy and delay constraints. The inference process can be split between the edge device and an edge server, thereby achieving collaborative inference over wireless networks. We formulate an inference utility optimization problem subject to energy and delay constraints, and propose a novel solution called Bayes-Split-Edge, which leverages Bayesian optimization for collaborative split inference over wireless edge networks. Our solution jointly optimizes the transmission power and the neural network split point. The Bayes-Split-Edge framework incorporates a novel hybrid acquisition function that balances inference task utility, sample efficiency, and constraint violation penalties. We evaluate our approach using the VGG19 model on the ImageNet-Mini dataset, and Resnet101 on Tiny-ImageNet, and real-world mMobile wireless channel datasets. Numerical results demonstrate that Bayes-Split-Edge achieves up to 2.4x reduction in evaluation cost compared to standard Bayesian optimization and achieves near-linear convergence. It also outperforms several baselines, including CMA-ES, DIRECT, exhaustive search, and Proximal Policy Optimization (PPO), while matching exhaustive search performance under tight constraints. These results confirm that the proposed framework provides a sample-efficient solution requiring maximum 20 function evaluations and constraint-aware optimization for wireless split inference in edge computing systems.


[125] 2510.23530

Learning Linearity in Audio Consistency Autoencoders via Implicit Regularization

Audio autoencoders learn useful, compressed audio representations, but their non-linear latent spaces prevent intuitive algebraic manipulation such as mixing or scaling. We introduce a simple training methodology to induce linearity in a high-compression Consistency Autoencoder (CAE) by using data augmentation, thereby inducing homogeneity (equivariance to scalar gain) and additivity (the decoder preserves addition) without altering the model's architecture or loss function. When trained with our method, the CAE exhibits linear behavior in both the encoder and decoder while preserving reconstruction fidelity. We test the practical utility of our learned space on music source composition and separation via simple latent arithmetic. This work presents a straightforward technique for constructing structured latent spaces, enabling more intuitive and efficient audio processing.


[126] 2510.23558

ISA-Bench: Benchmarking Instruction Sensitivity for Large Audio Language Models

Large Audio Language Models (LALMs), which couple acoustic perception with large language models (LLMs) to extract and understand diverse information from audio, have attracted intense interest from both academic and industrial communities. However, existing LALMs are highly sensitive to how instructions are phrased, affecting both (i) instruction-following rates and (ii) task performance. Yet, no existing benchmarks offer a systematic and comprehensive evaluation of this sensitivity. We introduce ISA-Bench, a dynamic benchmark evaluating instruction sensitivity for LALMs along three axes: instruction description, output format, and task composition. We assess recent open-source and proprietary LALMs using ISA-Bench, profiling both compliance and accuracy under controlled instruction variations. Experimental results reveal that even state-of-the-art LALMs suffer significant instruction sensitivity, leading to degraded performance on fundamental audio understanding tasks. To mitigate this issue, we fine-tune Qwen2-Audio on a specifically constructed complex instruction-variant dataset, achieving a marked improvement in instruction-following performance. However, this also induces nontrivial catastrophic forgetting: the model loses some previously mastered task capabilities when exposed to new instruction styles. Our benchmark provides a standardized basis for assessing and improving instruction sensitivity in LALMs, underscoring the need for instruction-robust audio understanding in real-world pipelines.


[127] 2510.23586

From Zonal to Nodal Capacity Expansion Planning: Spatial Aggregation Impacts on a Realistic Test-Case

Solving power system capacity expansion planning (CEP) problems at realistic spatial resolutions is computationally challenging. Thus, a common practice is to solve CEP over zonal models with low spatial resolution rather than over full-scale nodal power networks. Due to improvements in solving large-scale stochastic mixed integer programs, these computational limitations are becoming less relevant, and the assumption that zonal models are realistic and useful approximations of nodal CEP is worth revisiting. This work is the first to conduct a systematic computational study on the assumption that spatial aggregation can reasonably be used for ISO- and interconnect-scale CEP. By considering a realistic, large-scale test network based on the state of California with over 8,000 buses and 10,000 transmission lines, we demonstrate that well-designed small spatial aggregations can yield good approximations but that coarser zonal models result in large distortions of investment decisions.


[128] 2206.09340

A Note on Comparator-Overdrive-Delay Conditioning for Current-Mode Control

Comparator-overdrive-delay conditioning is a new control conditioning approach for high-frequency current-mode control. No existing literature rigorously studies the effect of the comparator overdrive delay on the current-mode control. The results in this paper provide insights into the mechanism of comparator-overdrive-delay conditioning.


[129] 2303.11554

Extended Depth-of-Field Lensless Imaging using an Optimized Radial Mask

The freedom of design of coded masks used by mask-based lensless cameras is an advantage these systems have when compared to lens-based ones. We leverage this freedom of design to propose a shape-preserving optimization scheme for a radial-type amplitude coded mask. Due to the depth-independency of the radial mask's point spread function, they can be used for extending the effective depth of field (DOF) of a lensless imaging system. In this paper we optimized a coded mask for improved frequency response, while retaining its radial characteristics and therefore extended-DOF capabilities. We show that our optimized radial mask achieved better overall frequency response when compared to naive implementations of a radial mask. We also quantitatively and qualitatively demonstrated the extended DOF imaging achieved by our optimized radial mask in simulations by comparing it to different non-radial coded masks. Finally, we built a prototype camera to validate the extended DOF capabilities of our coded mask in real scenarios.


[130] 2309.02265

PESTO: Pitch Estimation with Self-supervised Transposition-equivariant Objective

In this paper, we address the problem of pitch estimation using Self Supervised Learning (SSL). The SSL paradigm we use is equivariance to pitch transposition, which enables our model to accurately perform pitch estimation on monophonic audio after being trained only on a small unlabeled dataset. We use a lightweight ($<$ 30k parameters) Siamese neural network that takes as inputs two different pitch-shifted versions of the same audio represented by its Constant-Q Transform. To prevent the model from collapsing in an encoder-only setting, we propose a novel class-based transposition-equivariant objective which captures pitch information. Furthermore, we design the architecture of our network to be transposition-preserving by introducing learnable Toeplitz matrices. We evaluate our model for the two tasks of singing voice and musical instrument pitch estimation and show that our model is able to generalize across tasks and datasets while being lightweight, hence remaining compatible with low-resource devices and suitable for real-time applications. In particular, our results surpass self-supervised baselines and narrow the performance gap between self-supervised and supervised methods for pitch estimation.


[131] 2310.12144

Dynamic financial processes identification using sparse regressive reservoir computers

In this document, we present key findings in structured matrix approximation theory, with applications to the regressive representation of dynamic financial processes. Initially, we explore a comprehensive approach involving generic nonlinear time delay embedding for time series data extracted from a financial or economic system under examination. Subsequently, we employ sparse least-squares and structured matrix approximation methods to discern approximate representations of the output coupling matrices. These representations play a pivotal role in establishing the regressive models corresponding to the recursive structures inherent in a given financial system. The document further introduces prototypical algorithms that leverage the aforementioned techniques. These algorithms are demonstrated through applications in approximate identification and predictive simulation of dynamic financial and economic processes, encompassing scenarios that may or may not exhibit chaotic behavior.


[132] 2311.07140

A Linear Parameter-Varying Approach to Data Predictive Control

By means of the linear parameter-varying (LPV) Fundamental Lemma, we derive novel data-driven predictive control (DPC) methods for LPV systems. In particular, we present output-feedback and state-feedback-based LPV-DPC methods with terminal ingredients, which guarantee exponential stability and recursive feasibility. We provide methods for the data-based computation of these terminal ingredients. Furthermore, an in-depth analysis of the application and implementation aspects of the LPV-DPC schemes is given, including application for nonlinear systems and handling noisy data. We compare and demonstrate the performance of the proposed methods in a detailed simulation example involving a nonlinear unbalanced disc system.


[133] 2312.06154

Predictive Reliability Assessment of Distribution Grids with Residential Distributed Energy Resources

Distribution system end users are transforming from passive to active participants, marked by the push towards widespread adoption of edge-level Distributed Energy Resources (DERs). This paper addresses the challenges in distribution system planning arising from these dynamic changes. We introduce a bottom-up probabilistic approach that integrates these edge-level DERs into the reliability evaluation process. Our methodology leverages joint probability distributions to characterize and model the penetration of rooftop photovoltaic (PV) systems and energy storage across a distribution network at the individual residential level. Employing a scenario-based approach, we showcase the application of our probabilistic method using a Monte Carlo Simulation process to assess average system reliability indices and their variations at the user level. To validate our approach, we applied this methodology to the RBTS test system across various adoption scenarios, effectively showcasing the capability of our proposed method in quantifying the variation in end-user reliability indices for each scenario within the distribution system.


[134] 2404.14583

A general framework for supporting economic feasibility of generator and storage energy systems through capacity and dispatch optimization

Integration of various electricity-generating technologies (such as natural gas, wind, nuclear, etc.) with storage systems (such as thermal, battery electric, hydrogen, etc.) has the potential to improve the economic competitiveness of modern energy systems. Driven by the need to efficiently assess the economic feasibility of various energy system configurations in early system concept development, this work outlines a versatile computational framework for assessing the net present value of various integrated storage technologies. The subsystems' fundamental dynamics are defined, with a particular emphasis on balancing critical physical and economic domains to enable optimal decision-making in the context of capacity and dispatch optimization. In its presented form, the framework formulates a linear, convex optimization problem that can be efficiently solved using a direct transcription approach in the open-source software DTQP. Three case studies demonstrate and validate the framework's capabilities, highlighting its value and computational efficiency in facilitating the economic assessment of various energy system configurations. In particular, natural gas with thermal storage and carbon capture, wind energy with battery storage, and nuclear with hydrogen are demonstrated.


[135] 2408.01731

Composite Learning Adaptive Control under Non-Persistent Partial Excitation

This paper focuses on relaxing the excitation conditions for the adaptive control of uncertain nonlinear systems. By adopting the spectral decomposition technique, a linear regression equation (LRE) is constructed to quantitatively collect historical excitation information, based on which the parameter estimation error is decomposed into the excited component and the unexcited component. By sufficiently utilizing the collected excitation information, the composite learning and {\mu}-modification terms are designed and incorporated into the "Lyapunov-based" parameter update law. By developing a novel Lyapunov function, it is demonstrated that under non-persistent partial excitation, the control error and the excited parameter estimation error component converge to zero, while the unexcited component remains bounded. Furthermore, the proposed adaptive control scheme can effectively eliminate the effects of parametric uncertainties and enhance the robustness of the closed-loop systems. Simulation results are provided to verify the theoretical findings.


[136] 2411.00617

Continuous and complete liver vessel segmentation with graph-attention guided diffusion

Improving connectivity and completeness are the most challenging aspects of liver vessel segmentation, especially for small vessels. These challenges require both learning the continuous vessel geometry, and focusing on small vessel detection. However, current methods do not explicitly address these two aspects and cannot generalize well when constrained by inconsistent annotations. Here, we take advantage of the generalization of the diffusion model and explicitly integrate connectivity and completeness in our diffusion-based segmentation model. Specifically, we use a graph-attention module that adds knowledge about vessel geometry, and thus adds continuity. Additionally, we perform the graph-attention at multiple-scales, thus focusing on small liver vessels. Our method outperforms eight state-of-the-art medical segmentation methods on two public datasets: 3D-ircadb-01 and LiVS. Our code is available at this https URL.


[137] 2411.04364

Efficient Localization of Directional RF Emitters via Iterated Beampattern Analysis

The localization of directional RF emitters presents significant challenges for electronic warfare applications. Traditional localization methods, designed for omnidirectional emitters, experience degraded performance when applied to directional sources due to pronounced received signal strength (RSS) modulations introduced by directive beampatterns. This paper presents a robust direct position determination (DPD) approach that jointly estimates emitter position and beampattern parameters by incorporating RSS modulation from both path attenuation and directional gain alongside angle of arrival (AOA) and time difference of arrival (TDOA) information. To address the computational challenge of joint optimization over position and beampattern parameters, we develop an alternating maximization algorithm that decomposes the four-dimensional search into efficient iterative two-dimensional optimizations using a generalized beampattern model. Cramer-Rao Lower Bound (CRLB) analysis establishes theoretical performance limits, and numerical simulations demonstrate substantial improvements over conventional methods. At -10 dB SNR, the proposed approach achieves 49% to 61% error reduction compared to AOA-TDOA baselines, with performance approaching the CRLB above -10 dB. The algorithm converges rapidly, requiring 3 to 4 iterations on average, and exhibits robustness to beampattern model mismatch. A contrast-expanded half-power uncertainty metric is introduced to quantify localization confidence, revealing that the proposed method produces concentrated unimodal likelihood surfaces while conventional approaches generate spurious peaks. Sensitivity analysis demonstrates that optimal performance occurs when receivers are positioned at beampattern main lobe edges where RSS gradients are maximized.


[138] 2412.11277

Macro2Micro: A Rapid and Precise Cross-modal Magnetic Resonance Imaging Synthesis using Multi-scale Structural Brain Similarity

The human brain is a complex system requiring both macroscopic and microscopic components for comprehensive understanding. However, mapping nonlinear relationships between these scales remains challenging due to technical limitations and the high cost of multimodal Magnetic Resonance Imaging (MRI) acquisition. To address this, we introduce Macro2Micro, a deep learning framework that predicts brain microstructure from macrostructure using a Generative Adversarial Network (GAN). Based on the hypothesis that microscale structural information can be inferred from macroscale structures, Macro2Micro explicitly encodes multiscale brain information into distinct processing branches. To enhance artifact elimination and output quality, we propose a simple yet effective auxiliary discriminator and learning objective. Extensive experiments demonstrated that Macro2Micro faithfully translates T1-weighted MRIs into corresponding Fractional Anisotropy (FA) images, achieving a 6.8\% improvement in the Structural Similarity Index Measure (SSIM) compared to previous methods, while retaining the individual biological characteristics of the brain. With an inference time of less than 0.01 seconds per MR modality translation, Macro2Micro introduces the potential for real-time multimodal rendering in medical and research applications. The code will be made available upon acceptance.


[139] 2501.08458

RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation

In recent years, significant advancements have been made in deep learning for medical image segmentation, particularly with convolutional neural networks (CNNs) and transformer models. However, CNNs face limitations in capturing long-range dependencies, while transformers suffer from high computational complexity. To address this, we propose RWKV-UNet, a novel model that integrates the RWKV (Receptance Weighted Key Value) structure into the U-Net architecture. This integration enhances the model's ability to capture long-range dependencies and to improve contextual understanding, which is crucial for accurate medical image segmentation. We build a strong encoder with developed Global-Local Spatial Perception (GLSP) blocks combining CNNs and RWKVs. We also propose a Cross-Channel Mix (CCM) module to improve skip connections with multi-scale feature fusion, achieving global channel information integration. Experiments on 11 benchmark datasets show that the RWKV-UNet achieves state-of-the-art performance on various types of medical image segmentation tasks. Additionally, smaller variants, RWKV-UNet-S and RWKV-UNet-T, balance accuracy and computational efficiency, making them suitable for broader clinical applications.


[140] 2501.09049

Dynamic-Aware Spatio-temporal Representation Learning for Dynamic MRI Reconstruction

Dynamic MRI reconstruction, one of inverse problems, has seen a surge by the use of deep learning techniques. Especially, the practical difficulty of obtaining ground truth data has led to the emergence of unsupervised learning approaches. A recent promising method among them is implicit neural representation (INR), which defines the data as a continuous function that maps coordinate values to the corresponding signal values. This allows for filling in missing information only with incomplete measurements and solving the inverse problem effectively. Nevertheless, previous works incorporating this method have faced drawbacks such as long optimization time and the need for extensive hyperparameter tuning. To address these issues, we propose Dynamic-Aware INR (DA-INR), an INR-based model for dynamic MRI reconstruction that captures the spatial and temporal continuity of dynamic MRI data in the image domain and explicitly incorporates the temporal redundancy of the data into the model structure. As a result, DA-INR outperforms other models in reconstruction quality even at extreme undersampling ratios while significantly reducing optimization time and requiring minimal hyperparameter tuning.


[141] 2501.12477

Slot-BERT: Self-supervised Object Discovery in Surgical Video

Object-centric slot attention is a powerful framework for unsupervised learning of structured and explainable representations that can support reasoning about objects and actions, including in surgical videos. While conventional object-centric methods for videos leverage recurrent processing to achieve efficiency, they often struggle with maintaining long-range temporal coherence required for long videos in surgical applications. On the other hand, fully parallel processing of entire videos enhances temporal consistency but introduces significant computational overhead, making it impractical for implementation on hardware in medical facilities. We present Slot-BERT, a bidirectional long-range model that learns object-centric representations in a latent space while ensuring robust temporal coherence. Slot-BERT scales object discovery seamlessly to long videos of unconstrained lengths. A novel slot contrastive loss further reduces redundancy and improves the representation disentanglement by enhancing slot orthogonality. We evaluate Slot-BERT on real-world surgical video datasets from abdominal, cholecystectomy, and thoracic procedures. Our method surpasses state-of-the-art object-centric approaches under unsupervised training achieving superior performance across diverse domains. We also demonstrate efficient zero-shot domain adaptation to data from diverse surgical specialties and databases.


[142] 2502.05365

Dimensionality Reduction with Koopman Generalized Eigenfunctions

This paper presents a methodology to achieve lower-dimensional Koopman quasi-linear representations of nonlinear system dynamics using Koopman generalized eigenfunctions. The proposed approach considers the analytically derived Koopman formulation of rigid body dynamics, but it can be extended to any data-driven or analytically derived generalized eigenfunction set. It achieves a representation for which the number of Koopman observables matches the number of inputs allowing for Koopman linearization control solutions rather than resorting to the least squares approximation method adopted in high dimensional Koopman formulations. Through a linear combination of Koopman generalized eigenfunctions a new set of Koopman generalized eigenfunction is constructed so that the zero order truncation approximate a Koopman eigenfunction which can be used to design linear control strategies to steer the dynamics of the original nonlinear system. The proposed methodology is tested by designing a linear quadratic (LQ) flight controller for a quadrotor UAV. Numerical and Hardware-in-the-loop (HIL) simulations validate the applicability and real-time implementability of the proposed approach in the presence of noise and sensor delays. The main advantage of the proposed method is the realization of a fully actuated Koopman based model which, in the case of the underactuated quadrotor system, allows to achieve trajectory tracking through a single linear control loop.


[143] 2503.00654

ExAMPC: the Data-Driven Explainable and Approximate NMPC with Physical Insights

Amidst the surge in the use of Artificial Intelligence (AI) for control purposes, classical and model-based control methods maintain their popularity due to their transparency and deterministic nature. However, advanced controllers like Nonlinear Model Predictive Control (NMPC), despite proven capabilities, face adoption challenges due to their computational complexity and unpredictable closed-loop performance in complex validation systems. This paper introduces ExAMPC, a methodology bridging classical control and explainable AI by augmenting the NMPC with data-driven insights to improve the trustworthiness and reveal the optimization solution and closed-loop performance's sensitivities to physical variables and system parameters. By employing a low-order spline embedding, we reduce the open-loop trajectory dimensionality by over 95%, and integrate it with SHAP and Symbolic Regression from eXplainable AI (XAI) for an approximate NMPC, enabling intuitive physical insights into the NMPC's optimization routine. The prediction accuracy of the approximate NMPC is enhanced through physics-inspired continuous-time constraints penalties, reducing the predicted continuous trajectory violations by 93%. ExAMPC also enables accurate forecasting of the NMPC's computational requirements with explainable insights on worst-case scenarios. Experimental validation on automated valet parking and autonomous racing with lap-time optimization, demonstrates the methodology's practical effectiveness for potential real-world applications.


[144] 2503.04129

Formally Verified Neural Network Controllers for Incremental Input-to-State Stability of Unknown Discrete-Time Systems

This work aims to synthesize a controller that ensures that an unknown discrete-time system is incrementally input-to-state stable ($\delta$-ISS). In this work, we introduce the notion of $\delta$-ISS control Lyapunov function ($\delta$-ISS-CLF), which, in conjunction with the controller, ensures that the closed-loop system is incrementally ISS. To address the unknown dynamics of the system, we parameterize the controller as well as the $\delta$-ISS-CLF as neural networks and learn them by utilizing the sampled data from the state space of the unknown system. To formally verify the obtained $\delta$-ISS-CLF, we develop a validity condition and incorporate the condition into the training framework to ensure a provable correctness guarantee at the end of the training process. Finally, the usefulness of the proposed approach is proved using multiple case studies - the first one is a scalar system with a non-affine non-polynomial structure, the second example is a one-link manipulator system, the third system is a nonlinear Moore-Grietzer model of the jet engine and the final one is a rotating rigid spacecraft model.


[145] 2503.12926

Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference

With the rapid development of large multimodal models (LMMs), multimodal understanding applications are emerging. As most LMM inference requests originate from edge devices with limited computational capabilities, the predominant inference pipeline involves directly forwarding the input data to an edge server which handles all computations. However, this approach introduces high transmission latency due to limited uplink bandwidth of edge devices and significant computation latency caused by the prohibitive number of visual tokens, thus hindering delay-sensitive tasks and degrading user experience. To address this challenge, we propose a task-oriented feature compression (TOFC) method for multimodal understanding in a device-edge co-inference framework, where visual features are merged by clustering and encoded by a learnable and selective entropy model before feature projection. Specifically, we employ density peaks clustering based on K nearest neighbors to reduce the number of visual features, thereby minimizing both data transmission and computational complexity. Subsequently, a learnable entropy model with hyperprior is utilized to encode and decode merged features, further reducing transmission overhead. To enhance compression efficiency, multiple entropy models are adaptively selected based on the characteristics of the visual features, enabling a more accurate estimation of the probability distribution. Comprehensive experiments on seven visual question answering benchmarks validate the effectiveness of the proposed TOFC method. Results show that TOFC achieves up to 52% reduction in data transmission overhead and 63% reduction in system latency while maintaining identical task performance, compared with neural compression ELIC.


[146] 2504.05657

Nes2Net: A Lightweight Nested Architecture for Foundation Model Driven Speech Anti-spoofing

Speech foundation models have significantly advanced various speech-related tasks by providing exceptional representation capabilities. However, their high-dimensional output features often create a mismatch with downstream task models, which typically require lower-dimensional inputs. A common solution is to apply a dimensionality reduction (DR) layer, but this approach increases parameter overhead, computational costs, and risks losing valuable information. To address these issues, we propose Nested Res2Net (Nes2Net), a lightweight back-end architecture designed to directly process high-dimensional features without DR layers. The nested structure enhances multi-scale feature extraction, improves feature interaction, and preserves high-dimensional information. We first validate Nes2Net on CtrSVDD, a singing voice deepfake detection dataset, and report a 22% performance improvement and an 87% back-end computational cost reduction over the state-of-the-art baseline. Additionally, extensive testing across four diverse datasets: ASVspoof 2021, ASVspoof 5, PartialSpoof, and In-the-Wild, covering fully spoofed speech, adversarial attacks, partial spoofing, and real-world scenarios, consistently highlights Nes2Net's superior robustness and generalization capabilities. The code package and pre-trained models are available at this https URL.


[147] 2505.14473

Security of Gradient Tracking Algorithms Against Malicious Agents

Consensus algorithms are fundamental to multi-agent distributed optimization, and their security under adversarial conditions is an active area of research. While prior works primarily establish conditions for successful global consensus under attack, little is known about system behavior when these conditions are violated. This paper addresses this gap by investigating the robustness of the Wang--Elia algorithm, which is a robust to noise version of gradient tracking algorithm, in the presence of malicious agents. We consider a network of agents collaboratively minimizing a global cost function, where a subset of agents may transmit faulty information to disrupt consensus. To quantify resilience, we formulate a security metric as an optimization problem, which is rooted in centralized attack detection literature. We provide a tractable reformulation of the optimization problem, and derive conditions under which the metric becomes unbounded, identifying undetectable attack signals that reveal inherent vulnerabilities. To facilitate design and analysis, we propose a well-posed variant of the metric and propose design methods to enhance network robustness against stealthy adversarial attacks. Numerical examples demonstrate the effectiveness of the proposed framework to enhance the resilience of multi-agent distributed optimization.


[148] 2505.18190

PhySense: Sensor Placement Optimization for Accurate Physics Sensing

Physics sensing plays a central role in many scientific and engineering domains, which inherently involves two coupled tasks: reconstructing dense physical fields from sparse observations and optimizing scattered sensor placements to observe maximum information. While deep learning has made rapid advances in sparse-data reconstruction, existing methods generally omit optimization of sensor placements, leaving the mutual enhancement between reconstruction and placement on the shelf. To change this suboptimal practice, we propose PhySense, a synergistic two-stage framework that learns to jointly reconstruct physical fields and to optimize sensor placements, both aiming for accurate physics sensing. The first stage involves a flow-based generative model enhanced by cross-attention to adaptively fuse sparse observations. Leveraging the reconstruction feedback, the second stage performs sensor placement via projected gradient descent to satisfy spatial constraints. We further prove that the learning objectives of the two stages are consistent with classical variance-minimization principles, providing theoretical guarantees. Extensive experiments across three challenging benchmarks, especially a 3D geometry dataset, indicate PhySense achieves state-of-the-art physics sensing accuracy and discovers informative sensor placements previously unconsidered. Code is available at this repository: this https URL.


[149] 2506.00011

Movable Antenna Enhanced Federated Fine-Tuning of Large Language Models via Hybrid Client Selection Optimization

Federated fine-tuning of large language models (LLMs) over bandwidth-limited 6G links must meet strict round-time and energy budgets. Analog over-the-air (OTA) aggregation reduces uplink cost but is sensitive to fading and interference, which distort the aggregated gradient. We consider a two-phase workflow (centralized pre-training followed by federated fine-tuning) where the base station uses a movable-antenna (MA) array. In each round, MA element positions and the receive/transmit beamformers are adjusted under minimum-spacing constraints to reshape the channel and improve OTA aggregation without increasing user transmit power. We formulate a mixed-integer, nonconvex resource-allocation problem that jointly selects clients and optimizes the number of global rounds, CPU frequencies, mini-batch sizes, MA positions, and analog weights under end-to-end latency and energy limits. A successive convex approximation-penalty dual decomposition (SCA-PDD) routine alternates convex updates with oblique-manifold beamforming and spacing-aware MA placement. Experiments on OpenLLaMA-v2 (3B) with LoRA and 4-bit quantization on Alpaca and Dolly (10 clients) attain round-30 validation perplexities as low as 2.94 (Alpaca, K=1) and 4.62 (Dolly, K=1). Relative to the strongest non-MA baseline at the same concurrency, this corresponds to 17.4 percent (Alpaca, K=1) and 54.4 percent (Dolly, K=1) lower perplexity; at K=2 the reductions are 14.2 percent (Alpaca) and 13.7 percent (Dolly). Participation fairness also improves across all uplink concurrencies K in {1,2,4,8}, with the largest margins when fewer clients transmit per round.


[150] 2506.01358

Ensemble-Based Peak Demand Probability Density Forecasting with Application to Risk-Aware Power System Scheduling

Power systems face increasing challenges in maintaining resource adequacy due to lower operating margins, rising renewable energy uncertainty, and demand variability. Forecasting the probability distribution of peak demand on shorter timescales is a critical forward-facing issue under increasing volatility. This study introduces a novel ensemble-based machine learning method for peak demand probability density forecasting that extends classical extreme value theory to model time series peaks as nonstationary statistical distributions. The approach employs an ensemble of tree-based learners that recursively partition the covariate space and estimate local generalized extreme value distributions, allowing it to automatically capture complex covariate-dependent parameter variations. Unlike existing approaches, which often suffer from convergence issues or restrictive functional forms, this framework is both flexible and robust. Validation on a case study based on the PJM interconnection demonstrates that the method achieves a 38 percent reduction in committed capacity when generation is scheduled based on a reliability criterion. These improvements provide practical value for power system operation, enabling risk-aware capacity scheduling under peak demand uncertainty and supporting reliability-driven decision making in future energy systems.


[151] 2506.01382

Enabling Scalable Distributed Beamforming via Networked LEO Satellites Towards 6G

In this paper, we propose scalable distributed beamforming schemes over low Earth orbit (LEO) satellite networks that rely solely on statistical channel state information for downlink orthogonal frequency division multiplexing systems. We begin by introducing the system model and presenting a pragmatic yet effective analog beamformer and user-scheduling design. We then derive a closed-form lower bound on the ergodic sum rate, based on the hardening bound, for the digital beamformer design. Next, we formulate a per-satellite power-constrained sum-rate maximization problem, whose centralized solution, obtained via the weighted minimum mean squared error (WMMSE) framework, establishes performance limits and motivates decentralized strategies. We subsequently introduce two decentralized optimization schemes, based on approximating the hardening bound and decentralizing the WMMSE framework, for representative inter-satellite link topologies. In the Ring scheme, satellites update beamformers locally and exchange intermediate parameters sequentially. In the Star scheme, edge satellites update beamformers locally and in parallel, achieving consensus on intermediate parameters at a central satellite using a penalty-dual decomposition framework. Extensive simulations demonstrate that our distributed designs achieve near-centralized performance with superior scalability, substantially outperforming simple closed-form beamformers and single-satellite baselines in sum rate. Additionally, the delay-overhead trade-off between the two topologies is revealed.


[152] 2506.02093

Are Pixel-Wise Metrics Reliable for Sparse-View Computed Tomography Reconstruction?

Widely adopted evaluation metrics for sparse-view CT reconstruction--such as Structural Similarity Index Measure and Peak Signal-to-Noise Ratio--prioritize pixel-wise fidelity but often fail to capture the completeness of critical anatomical structures, particularly small or thin regions that are easily missed. To address this limitation, we propose a suite of novel anatomy-aware evaluation metrics designed to assess structural completeness across anatomical structures, including large organs, small organs, intestines, and vessels. Building on these metrics, we introduce CARE, a Completeness-Aware Reconstruction Enhancement framework that incorporates structural penalties during training to encourage anatomical preservation of significant structures. CARE is model-agnostic and can be seamlessly integrated into analytical, implicit, and generative methods. When applied to these methods, CARE substantially improves structural completeness in CT reconstructions, achieving up to +32% improvement for large organs, +22% for small organs, +40% for intestines, and +36% for vessels.


[153] 2506.04470

A Poisson-Guided Decomposition Network for Extreme Low-Light Image Enhancement

Low-light image denoising and enhancement are challenging, especially when traditional noise assumptions, such as Gaussian noise, do not hold in majority. In many real-world scenarios, such as low-light imaging, noise is signal-dependent and is better represented as Poisson noise. In this work, we address the problem of denoising images degraded by Poisson noise under extreme low-light conditions. We introduce a light-weight deep learning-based method that integrates Retinex based decomposition with Poisson denoising into a unified encoder-decoder network. The model simultaneously enhances illumination and suppresses noise by incorporating a Poisson denoising loss to address signal-dependent noise. Without prior requirement for reflectance and illumination, the network learns an effective decomposition process while ensuring consistent reflectance and smooth illumination without causing any form of color distortion. The experimental results demonstrate the effectiveness and practicality of the proposed low-light illumination enhancement method. Our method significantly improves visibility and brightness in low-light conditions, while preserving image structure and color constancy under ambient illumination.


[154] 2506.06758

A Novel Spreading-Factor-Index-Aided LoRa Scheme: Design and Performance Analysis

LoRa is a widely recognized modulation technology in the field of low power wide area networks (LPWANs). However, the data rate of LoRa is too low to satisfy the requirements of Internet of Things applications. To address this issue, we propose a novel high-data-rate LoRa scheme based on the spreading factor index (SFI). In the proposed SFI-LoRa scheme, the starting frequency bin of a chirp signal is used to transmit information bits, while the combinations of spreading factors are exploited as a set of indices to convey additional information bits. Moreover, the theoretical symbol error rate, data rate, transmission throughput, complexity and energy efficiency of the proposed SFI-LoRa scheme are carefully analyzed. Simulation results not only verify the accuracy of our theoretical analysis, but also demonstrate that the proposed SFI-LoRa scheme can improve the transmission throughput of existing LoRa schemes without sacrificing the BER performance over additive white Gaussian noise, Rayleigh fading, and multipath flat-fading channels. Therefore, the proposed SFI-LoRa scheme is a potential solution for applications requiring a high data rate in the LPWAN domain.


[155] 2506.21208

Adversarial Training: Enhancing Out-of-Distribution Generalization for Learning Wireless Resource Allocation

Deep neural networks (DNNs) have widespread applications for optimizing resource allocation. Yet, their performance is vulnerable to distribution shifts between training and test data, say wireless channels. In this paper, we resort to adversarial training (AT) for enhancing out-of-distribution (OOD) generalizability of DNNs trained in unsupervised manner. We reformulate AT problem to reflect the OOD degradation, and propose a one-step gradient ascent algorithm to solve the AT problem for training DNNs. The proposed method is evaluated by optimizing hybrid precoding. Simulation results showcase the enhanced OOD performance of multiple kinds of DNNs, with approximately 5\(\sim\)20\% improvement, across various channel distributions, even when the samples only from a single distribution (e.g., Rayleigh fading) are used for training.


[156] 2507.21704

Affine Frequency Division Multiplexing (AFDM) for 6G: Properties, Features, and Challenges

Affine frequency division multiplexing (AFDM) is an emerging waveform candidate for future sixth generation (6G) systems offering a range of promising features, such as enhanced robustness in heterogeneous and high-mobility environments, as well as inherent suitability for integrated sensing and communications (ISAC) applications. In addition, unlike other candidates such as orthogonal time-frequency space (OTFS) modulation, AFDM provides several unique advantages that strengthen its relevance to practical deployment and standardization in 6G. Notably, as a natural generalization of orthogonal frequency division multiplexing (OFDM), strong backward compatibility with existing conventional systems is guaranteed, while also offering novel possibilities in waveform design, for example to enable physical-layer security through its inherent chirp parametrization. In all, this article provides an overview of AFDM, emphasizing its suitability as a candidate waveform for 6G standardization. First, we provide a concise introduction to the fundamental properties and unique characteristics of AFDM, followed by highlights of its advantageous features, and finally a discussion of its potential and challenges in 6G standardization efforts and representative requirements.


[157] 2507.22030

ReXGroundingCT: A 3D Chest CT Dataset for Segmentation of Findings from Free-Text Reports

We introduce ReXGroundingCT, the first publicly available dataset linking free-text findings to pixel-level 3D segmentations in chest CT scans. The dataset includes 3,142 non-contrast chest CT scans paired with standardized radiology reports from CT-RATE. Construction followed a structured three-stage pipeline. First, GPT-4 was used to extract and standardize findings, descriptors, and metadata from reports originally written in Turkish and machine-translated into English. Second, GPT-4o-mini categorized each finding into a hierarchical ontology of lung and pleural abnormalities. Third, 3D annotations were produced for all CT volumes: the training set was quality-assured by board-certified radiologists, and the validation and test sets were fully annotated by board-certified radiologists. Additionally, a complementary chain-of-thought dataset was created to provide step-by-step hierarchical anatomical reasoning for localizing findings within the CT volume, using GPT-4o and localization coordinates derived from organ segmentation models. ReXGroundingCT contains 16,301 annotated entities across 8,028 text-to-3D-segmentation pairs, covering diverse radiological patterns from 3,142 non-contrast CT scans. About 79% of findings are focal abnormalities and 21% are non-focal. The dataset includes a public validation set of 50 cases and a private test set of 100 cases, both annotated by board-certified radiologists. The dataset establishes a foundation for enabling free-text finding segmentation and grounded radiology report generation in CT imaging. Model performance on the private test set is hosted on a public leaderboard at this https URL. The dataset is available at this https URL.


[158] 2509.05464

Developing an Open-Source Framework for Quantitative Simulation of Blood Flow and Tissue Motion for Ultrafast Doppler Ultrasound

Ultrafast power Doppler imaging (uPDI) has become a powerful tool for both research and clinical applications. However, existing simulation tools are insufficient for generating quantitatively accurate three-dimensional (3D) flow fields with tissue motion mimicking in vivo conditions. In this study, we present an open-source framework, named 3D-Fully Quantitative Flow (3D-FQFlow), to provide quantitative modeling of 3D vascular hemodynamics with physiologically realistic tissue motion for uPDI. The framework can perform quantitative modeling of both hemodynamics and tissue motion for either user-defined or clinical-derived vasculatures. Besides, it also integrates a GPU-accelerated image processing and reconstruction module. We demonstrate the performance of 3D-FQFlow using both synthetic vascular structures and clinical datasets. This framework could provide essential ground-truth simulation models to support the development, validation, and benchmarking of uPDI techniques. The source code is freely available online athttps://github.com/FortuneOU/3D-FQFlow.


[159] 2510.04264

A Hybrid GNN-IZR Framework for Fast and Empirically Robust AC Power Flow Analysis in Radial Distribution Systems

The Alternating Current Power Flow (ACPF) problem forces a trade-off between the speed of data-driven models and the reliability of analytical solvers. This paper introduces a hybrid framework that synergizes a Graph Neural Network (GNN) with the Implicit Z-Bus Recursive (IZR) method, a robust, non-iterative solver for radial distribution networks. The framework employs a physics-informed GNN for rapid initial predictions and invokes the IZR solver as a failsafe for stressed cases identified by a two-stage trigger. A failure is defined as any solution with a maximum power mismatch exceeding 0.1 p.u., a significant operational deviation. On a challenging test set of 7,500 stressed scenarios for the IEEE 33-bus system, the GNN-only model failed on 13.11 % of cases. In contrast, the hybrid framework identified all potential failures, delegating them to the IZR solver to achieve a 0.00 % failure rate, empirically matching the 100 % success rate of the analytical solver on this specific test set. An expanded ablation study confirms that both physics-informed training and Z-bus sensitivity features are critical, collaboratively reducing the GNN's failure rate from 98.72 % (data-only) to 13.11 %. The hybrid approach demonstrates a pragmatic path to achieving the empirical reliability of an analytical solver while leveraging GNN speed, enabling a significant increase in the number of scenarios analyzable in near real-time.


[160] 2510.06170

Smartphone-based iris recognition through high-quality visible-spectrum iris image capture.V2

Smartphone-based iris recognition in the visible spectrum (VIS) remains difficult due to illumination variability, pigmentation differences, and the absence of standardized capture controls. This work presents a compact end-to-end pipeline that enforces ISO/IEC 29794-6 quality compliance at acquisition and demonstrates that accurate VIS iris recognition is feasible on commodity devices. Using a custom Android application performing real-time framing, sharpness evaluation, and feedback, we introduce the CUVIRIS dataset of 752 compliant images from 47 subjects. A lightweight MobileNetV3-based multi-task segmentation network (LightIrisNet) is developed for efficient on-device processing, and a transformer matcher (IrisFormer) is adapted to the VIS domain. Under a standardized protocol and comparative benchmarking against prior CNN baselines, OSIRIS attains a TAR of 97.9% at FAR=0.01 (EER=0.76%), while IrisFormer, trained only on UBIRIS.v2, achieves an EER of 0.057% on CUVIRIS. The acquisition app, trained models, and a public subset of the dataset are released to support reproducibility. These results confirm that standardized capture and VIS-adapted lightweight models enable accurate and practical iris recognition on smartphones.


[161] 2510.13887

Incomplete Multi-view Clustering via Hierarchical Semantic Alignment and Cooperative Completion

Incomplete multi-view data, where certain views are entirely missing for some samples, poses significant challenges for traditional multi-view clustering methods. Existing deep incomplete multi-view clustering approaches often rely on static fusion strategies or two-stage pipelines, leading to suboptimal fusion results and error propagation issues. To address these limitations, this paper proposes a novel incomplete multi-view clustering framework based on Hierarchical Semantic Alignment and Cooperative Completion (HSACC). HSACC achieves robust cross-view fusion through a dual-level semantic space design. In the low-level semantic space, consistency alignment is ensured by maximizing mutual information across views. In the high-level semantic space, adaptive view weights are dynamically assigned based on the distributional affinity between individual views and an initial fused representation, followed by weighted fusion to generate a unified global representation. Additionally, HSACC implicitly recovers missing views by projecting aligned latent representations into high-dimensional semantic spaces and jointly optimizes reconstruction and clustering objectives, enabling cooperative learning of completion and clustering. Experimental results demonstrate that HSACC significantly outperforms state-of-the-art methods on five benchmark datasets. Ablation studies validate the effectiveness of the hierarchical alignment and dynamic weighting mechanisms, while parameter analysis confirms the model's robustness to hyperparameter variations.


[162] 2510.19360

Multi-Rate Task-Oriented Communication for Multi-Edge Cooperative Inference

The integration of artificial intelligence (AI) with the Internet of Things (IoT) enables task-oriented communication for multi-edge cooperative inference system, where edge devices transmit extracted features of local sensory data to an edge server to perform AI-driven tasks. However, the privacy concerns and limited communication bandwidth pose fundamental challenges, since simultaneous transmission of extracted features with a single fixed compression ratio from all devices leads to severe inefficiency in communication resource utilization. To address this challenge, we propose a framework that dynamically adjusts the code rate in feature extraction based on its importance to the downstream inference task by adopting a rate-adaptive quantization (RAQ) scheme. Furthermore, to select the code rate for each edge device under limited bandwidth constraint, a dynamic programming (DP) approach is leveraged to allocate the code rate across discrete code rate options. Experiments on multi-view datasets demonstrate that the proposed frameworks significantly outperform the frameworks using fixed-rate quantization, achieving a favorable balance between communication efficiency and inference performance under limited bandwidth conditions.


[163] 2510.21014

ReFESS-QI: Reference-Free Evaluation For Speech Separation With Joint Quality And Intelligibility Scoring

Source separation is a crucial pre-processing step for various speech processing tasks, such as automatic speech recognition (ASR). Traditionally, the evaluation metrics for speech separation rely on the matched reference audios and corresponding transcriptions to assess audio quality and intelligibility. However, they cannot be used to evaluate real-world mixtures for which no reference exists. This paper introduces a text-free reference-free evaluation framework based on self-supervised learning (SSL) representations. The proposed framework utilize the mixture and separated tracks to predict jointly audio quality, through the Scale Invariant Signal to Noise Ratio (SI-SNR) metric, and speech intelligibility through the Word Error Rate (WER) metric. We conducted experiments on the WHAMR! dataset, which shows a WER estimation with a mean absolute error (MAE) of 17% and a Pearson correlation coefficient (PCC) of 0.77; and SI-SNR estimation with an MAE of 1.38 and PCC of 0.95. We further demonstrate the robustness of our estimator by using various SSL representations.


[164] 2510.21280

WhaleVAD-BPN: Improving Baleen Whale Call Detection with Boundary Proposal Networks and Post-processing Optimisation

While recent sound event detection (SED) systems can identify baleen whale calls in marine audio, challenges related to false positive and minority-class detection persist. We propose the boundary proposal network (BPN), which extends an existing lightweight SED system. The BPN is inspired by work in image object detection and aims to reduce the number of false positive detections. It achieves this by using intermediate latent representations computed within the backbone classification model to gate the final output. When added to an existing SED system, the BPN achieves a 16.8 % absolute increase in precision, as well as 21.3 % and 9.4 % improvements in the F1-score for minority-class d-calls and bp-calls, respectively. We further consider two approaches to the selection of post-processing hyperparameters: a forward-search and a backward-search. By separately optimising event-level and frame-level hyperparameters, these two approaches lead to considerable performance improvements over parameters selected using empirical methods. The complete WhaleVAD-BPN system achieves a cross-validated development F1-score of 0.475, which is a 9.8 % absolute improvement over the baseline.


[165] 2001.08747

Reducing the Representation Error of GAN Image Priors Using the Deep Decoder

Generative models, such as GANs, learn an explicit low-dimensional representation of a particular class of images, and so they may be used as natural image priors for solving inverse problems such as image restoration and compressive sensing. GAN priors have demonstrated impressive performance on these tasks, but they can exhibit substantial representation error for both in-distribution and out-of-distribution images, because of the mismatch between the learned, approximate image distribution and the data generating distribution. In this paper, we demonstrate a method for reducing the representation error of GAN priors by modeling images as the linear combination of a GAN prior with a Deep Decoder. The deep decoder is an underparameterized and most importantly unlearned natural signal model similar to the Deep Image Prior. No knowledge of the specific inverse problem is needed in the training of the GAN underlying our method. For compressive sensing and image superresolution, our hybrid model exhibits consistently higher PSNRs than both the GAN priors and Deep Decoder separately, both on in-distribution and out-of-distribution images. This model provides a method for extensibly and cheaply leveraging both the benefits of learned and unlearned image recovery priors in inverse problems.


[166] 2301.00922

Faster Reinforcement Learning by Freezing Slow States

We study infinite horizon Markov decision processes (MDPs) with "fast-slow" structure, where some state variables evolve rapidly ("fast states") while others change more gradually ("slow states"). This structure commonly arises in practice when decisions must be made at high frequencies over long horizons, and where slowly changing information still plays a critical role in determining optimal actions. Examples include inventory control under slowly changing demand indicators or dynamic pricing with gradually shifting consumer behavior. Modeling the problem at the natural decision frequency leads to MDPs with discount factors close to one, making them computationally challenging. We propose a novel approximation strategy that "freezes" slow states during phases of lower-level planning and subsequently applies value iteration to an auxiliary upper-level MDP that evolves on a slower timescale. Freezing states for short periods of time leads to easier-to-solve lower-level problems, while a slower upper-level timescale allows for a more favorable discount factor. On the theoretical side, we analyze the regret incurred by our frozen-state approach, which leads to simple insights on how to trade off regret versus computational cost. Empirically, we benchmark our new frozen-state methods on three domains, (i) inventory control with fixed order costs, (ii) a gridworld problem with spatial tasks, and (iii) dynamic pricing with reference-price effects. We demonstrate that the new methods produce high-quality policies with significantly less computation, and we show that simply omitting slow states is often a poor heuristic.


[167] 2307.07030

Accelerated Gradient Methods for Nonconvex Optimization: Escape Trajectories From Strict Saddle Points and Convergence to Local Minima

This paper considers the problem of understanding the behavior of a general class of accelerated gradient methods on smooth nonconvex functions. Motivated by some recent works that have proposed effective algorithms, based on Polyak's heavy ball method and the Nesterov accelerated gradient method, to achieve convergence to a local minimum of nonconvex functions, this work proposes a broad class of Nesterov-type accelerated methods and puts forth a rigorous study of these methods encompassing the escape from saddle points and convergence to local minima through both an asymptotic and a non-asymptotic analysis. In the asymptotic regime, this paper answers an open question of whether Nesterov's accelerated gradient method (NAG) with variable momentum parameter avoids strict saddle points almost surely. This work also develops two metrics of asymptotic rates of convergence and divergence, and evaluates these two metrics for several popular standard accelerated methods such as the NAG and Nesterov's accelerated gradient with constant momentum (NCM) near strict saddle points. In the non-asymptotic regime, this work provides an analysis that leads to the "linear" exit time estimates from strict saddle neighborhoods for trajectories of these accelerated methods as well the necessary conditions for the existence of such trajectories. Finally, this work studies a sub-class of accelerated methods that can converge in convex neighborhoods of nonconvex functions with a near optimal rate to a local minimum and at the same time this sub-class offers superior saddle-escape behavior compared to that of NAG.


[168] 2310.14283

Bandwidth Efficient Livestreaming in Mobile Wireless Networks: A Peer-to-Peer ACIDE Solution

In mobile wireless networks, livestreaming in high user density areas presents two typical challenges: the wireless bandwidth is depleted and the number of users is limited. In this study, a media distribution model utilizing peer to peer communications, Active Control in an Intelligent and Distributed Environment, is proposed for bandwidth efficient livestreaming. The basic idea is to group users with identical livestream interest in a cluster of n peers. Instead of sending n copies of a livestream package, only one copy is sent to the cluster. A package is divided into n blocks. Each user receives one block from the base station and the remaining n-1 blocks from the other peers. Two optimization problems are addressed. The first problem is minimizing the bandwidth needed to guarantee a continuous live media play on all peers. A solution is proposed to find the optimal block sizes such that the wireless bandwidth is minimized. The second problem is maximizing the number of peers admitted to a cluster, given a fixed wireless bandwidth. This problem is NP-complete and a greedy strategy is proposed to calculate a feasible solution for peer selection. The proposed model improves the bandwidth efficiency and allows more users to be served.


[169] 2403.10990

$Δ_T$ Noise in Mesoscopic Hybrid Junctions: Influence of Barrier Strength and Thermal Bias

Quantum noise is a fundamental probe of quantum transport phenomena, offering insights into current correlations and wave-particle duality. A particularly intriguing form of such noise, $\Delta_T$ noise, emerges under a finite temperature difference in the absence of charge current at zero voltage bias. In this work, we investigate $\Delta_T$ noise in mesoscopic hybrid junctions incorporating insulating barriers, where the average charge current remains zero at zero bias. Using quantum shot noise measurements, we demonstrate that $\Delta_T$ noise in metal-insulator-superconductor (NIS) junctions is approximately $16$ times greater than in metal-insulator-metal (NIN) counterparts. Our analysis further reveals that $\Delta_T$ noise exhibits a non-monotonic dependence on barrier strength, rising to a peak before declining, while increasing monotonically with the applied temperature bias. These findings underscore the rich interplay between thermal gradients and barrier properties in determining quantum noise characteristics in hybrid mesoscopic systems.


[170] 2404.15243

UCINet0: A Machine Learning based Receiver for 5G NR PUCCH Format 0

Accurate decoding of Uplink Control Information (UCI) on the Physical Uplink Control Channel (PUCCH) is essential for enabling 5G wireless links. This paper explores an AI/ML-based receiver design for PUCCH Format 0. Format 0 signaling encodes the UCI content within the phase of a known base waveform and even supports multiplexing of up to 12 users within the same time-frequency resources. The proposed neural network classifier, which we term UCINet0, is capable of predicting when no user is transmitting on the PUCCH, as well as decoding the UCI content for any number of multiplexed users (up to 12). The test results with simulated, hardware-captured (lab) and field datasets show that the UCINet0 model outperforms conventional correlation-based decoders across all SNR ranges and multiple fading scenarios.


[171] 2405.14144

A Single Motor Nano Aerial Vehicle with Novel Peer-to-Peer Communication and Sensing Mechanism

Communication and position sensing are among the most important capabilities for swarm robots to interact with their peers and perform tasks collaboratively. However, the hardware required to facilitate communication and position sensing is often too complicated, expensive, and bulky to be carried on swarm robots. Here we present Maneuverable Piccolissimo 3 (MP3), a minimalist, single motor drone capable of executing inter-robot communication via infrared light and triangulation-based sensing of relative bearing, distance, and elevation using message arrival time. Thanks to its novel design, MP3 can communicate with peers and localize itself using simple components, keeping its size and mass small and making it inherently safe for human interaction. We present the hardware and software design of MP3 and demonstrate its capability to localize itself, fly stably, and maneuver in the environment using peer-to-peer communication and sensing.


[172] 2407.11654

R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models

Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML), where components of large ML models are outsourced to remote servers. A significant challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming that could jeopardize the learning process. This is particularly pronounced for embedding parameters in large language models (LLMs) and vision language models (VLMs), which are learned feature vectors essential for domain understanding. In this paper, rigorous insights are provided into the influence of jamming embeddings in SFL by deriving an expression for the ML training loss divergence and showing that it is upper-bounded by the mean squared error (MSE). Based on this analysis, a physical layer framework is developed for resilient SFL with LLMs (R-SFLLM) over wireless networks. R-SFLLM leverages wireless sensing data to gather information on the jamming directions-of-arrival (DoAs) for the purpose of devising a novel, sensing-assisted anti-jamming strategy while jointly optimizing beamforming, user scheduling, and resource allocation. Extensive experiments using both LLMs and VLMs demonstrate R-SFLLM's effectiveness, achieving close-to-baseline performance across various natural language processing (NLP) and computer vision (CV) tasks, datasets, and modalities. The proposed methodology further introduces an adversarial training component, where controlled noise exposure significantly enhances the model's resilience to perturbed parameters during training. The results show that more noise-sensitive models, such as RoBERTa, benefit from this feature, especially when resource allocation is unfair. It is also shown that worst-case jamming in particular translates into worst-case model outcomes, thereby necessitating the need for jamming-resilient SFL protocols.


[173] 2409.18361

iWalker: Imperative Visual Planning for Walking Humanoid Robot

Humanoid robots, designed to operate in human-centric environments, serve as a fundamental platform for a broad range of tasks. Although humanoid robots have been extensively studied for decades, a majority of existing humanoid robots still heavily rely on complex modular frameworks, leading to inflexibility and potential compounded errors from independent sensing, planning, and acting components. In response, we propose an end-to-end humanoid sense-plan-act walking system, enabling vision-based obstacle avoidance and footstep planning for whole body balancing simultaneously. We designed two imperative learning (IL)-based bilevel optimizations for model-predictive step planning and whole body balancing, respectively, to achieve self-supervised learning for humanoid robot walking. This enables the robot to learn from arbitrary unlabeled data, improving its adaptability and generalization capabilities. We refer to our method as iWalker and demonstrate its effectiveness in both simulated and real-world environments, representing a significant advancement toward autonomous humanoid robots.


[174] 2410.15742

DeepVigor+: Scalable and Accurate Semi-Analytical Fault Resilience Analysis for Deep Neural Network

The growing exploitation of Machine Learning (ML) in safety-critical applications necessitates rigorous safety analysis. Hardware reliability assessment is a major concern with respect to measuring the level of safety in ML-based systems. Quantifying the reliability of emerging ML models, including Convolutional Neural Networks (CNNs), is highly complex due to their enormous size in terms of the number of parameters and computations. Conventionally, Fault Injection (FI) is applied to perform a reliability measurement. However, performing FI on modern-day CNNs is prohibitively time-consuming if an acceptable confidence level is to be achieved. To speed up FI for large CNNs, statistical FI (SFI) has been proposed, but its runtimes are still considerably long. In this work, we introduce DeepVigor+, a scalable, fast, and accurate semi-analytical method as an efficient alternative for reliability measurement in CNNs. DeepVigor+ implements a fault propagation analysis model and attempts to acquire Vulnerability Factors (VFs) as reliability metrics in an optimal way. The results indicate that DeepVigor+ obtains VFs for CNN models with an error less than $1\%$, i.e., the objective in SFI, but with $14.9$ up to $26.9$ times fewer simulations than the best-known state-of-the-art SFI. DeepVigor+ enables an accurate reliability analysis for large and deep CNNs within a few minutes, rather than achieving the same results in days or weeks.


[175] 2411.04949

Global Optimal Closed-Form Solutions for Intelligent Surfaces With Mutual Coupling: Is Mutual Coupling Detrimental or Beneficial?

Reconfigurable Intelligent Surface (RIS) is a breakthrough technology enabling the dynamic control of the propagation environment in wireless communications through programmable surfaces. To improve the flexibility of conventional diagonal RIS (D-RIS), beyond diagonal RIS (BD-RIS) has emerged as a family of more general RIS architectures. However, D-RIS and BD-RIS have been commonly explored neglecting mutual coupling effects, while the global optimization of RIS with mutual coupling, its performance limits, and scaling laws remain unexplored. This study addresses these gaps by deriving global optimal closed-form solutions for BD-RIS with mutual coupling to maximize the channel gain, specifically fully- and tree-connected RISs. Besides, we provide the expression of the maximum channel gain achievable in the presence of mutual coupling and its scaling law in closed form. By using the derived scaling laws, we analytically prove that mutual coupling increases the channel gain on average under Rayleigh fading channels. Our theoretical analysis, confirmed by numerical simulations, shows that both fully- and tree-connected RISs with mutual coupling achieve the same channel gain upper bound when optimized with the proposed global optimal solutions. Furthermore, we observe that a mutual coupling-unaware optimization of RIS can cause a channel gain degradation of up to 5 dB.


[176] 2411.06309

Physics-Compliant Modeling and Scaling Laws of Multi-RIS Aided MIMO Systems

Reconfigurable intelligent surface (RIS) enables the control of wireless channels to improve coverage. To further extend coverage, multi-RIS aided systems have been explored, where multiple RISs steer the signal via a multi-hop path. However, deriving a physics-compliant channel model for multi-RIS aided systems is still an open problem. In this study, we fill this gap by modeling multi-RIS aided systems through multiport network theory, and deriving a channel model accounting for impedance mismatch, mutual coupling, and structural scattering. The derived physics-compliant model differs from the model widely used in literature, which omits the RIS structural scattering. To quantify this difference, we derive the channel gain scaling laws of the two models under line-of-sight (LoS) and multipath channels. Theoretical insights, validated by numerical results, show an important discrepancy between the physics-compliant and the widely used models, increasing with the number of RISs and multipath richness. In a multi-hop system aided by four 128-element RISs with multipath channels, optimizing the RISs using the widely used model and applying their solutions to the physics-compliant model achieves only 7% of the maximum channel gain. This highlights how severely mismatched channel models can be, calling for more accurate models in communication theory.


[177] 2411.19611

Memristive Nanowire Network for Energy Efficient Audio Classification: Pre-Processing-Free Reservoir Computing with Reduced Latency

Efficient audio feature extraction is critical for low-latency, resource-constrained speech recognition. Conventional preprocessing techniques, such as Mel Spectrogram, Perceptual Linear Prediction (PLP), and Learnable Spectrogram, achieve high classification accuracy but require large feature sets and significant computation. The low-latency and power efficiency benefits of neuromorphic computing offer a strong potential for audio classification. Here, we introduce memristive nanowire networks as a neuromorphic hardware preprocessing layer for spoken-digit classification, a capability not previously demonstrated. Nanowire networks extract compact, informative features directly from raw audio, achieving a favorable trade-off between accuracy, dimensionality reduction from the original audio size (data compression) , and training time efficiency. Compared with state-of-the-art software techniques, nanowire features reach 98.95% accuracy with 66 times data compression (XGBoost) and 97.9% accuracy with 255 times compression (Random Forest) in sub-second training latency. Across multiple classifiers nanowire features consistently achieve more than 90% accuracy with more than 62.5 times compression, outperforming features extracted by conventional state-of-the-art techniques such as MFCC in efficiency without loss of performance. Moreover, nanowire features achieve 96.5% accuracy classifying multispeaker audios, outperforming all state-of-the-art feature accuracies while achieving the highest data compression and lowest training time. Nanowire network preprocessing also enhances linear separability of audio data, improving simple classifier performance and generalizing across speakers. These results demonstrate that memristive nanowire networks provide a novel, low-latency, and data-efficient feature extraction approach, enabling high-performance neuromorphic audio classification.


[178] 2412.00538

Prognostic Framework for Robotic Manipulators Operating Under Dynamic Task Severities

Robotic manipulators are critical in many applications but are known to degrade over time. This degradation is influenced by the nature of the tasks performed by the robot. Tasks with higher severity, such as handling heavy payloads, can accelerate the degradation process. One way this degradation is reflected is in the position accuracy of the robot's end-effector. In this paper, we present a prognostic modeling framework that predicts a robotic manipulator's Remaining Useful Life (RUL) while accounting for the effects of task severity. Our framework represents the robot's position accuracy as a Brownian motion process with a random drift parameter that is influenced by task severity. The dynamic nature of task severity is modeled using a continuous-time Markov chain (CTMC). To evaluate RUL, we discuss two approaches -- (1) a novel closed-form expression for Remaining Lifetime Distribution (RLD), and (2) Monte Carlo simulations, commonly used in prognostics literature. Theoretical results establish the equivalence between these RUL computation approaches. We validate our framework through experiments using two distinct physics-based simulators for planar and spatial robot fleets. Our findings show that robots in both fleets experience shorter RUL when handling a higher proportion of high-severity tasks.


[179] 2501.06488

NVS-SQA: Exploring Self-Supervised Quality Representation Learning for Neurally Synthesized Scenes without References

Neural View Synthesis (NVS), such as NeRF and 3D Gaussian Splatting, effectively creates photorealistic scenes from sparse viewpoints, typically evaluated by quality assessment methods like PSNR, SSIM, and LPIPS. However, these full-reference methods, which compare synthesized views to reference views, may not fully capture the perceptual quality of neurally synthesized scenes (NSS), particularly due to the limited availability of dense reference views. Furthermore, the challenges in acquiring human perceptual labels hinder the creation of extensive labeled datasets, risking model overfitting and reduced generalizability. To address these issues, we propose NVS-SQA, a NSS quality assessment method to learn no-reference quality representations through self-supervision without reliance on human labels. Traditional self-supervised learning predominantly relies on the "same instance, similar representation" assumption and extensive datasets. However, given that these conditions do not apply in NSS quality assessment, we employ heuristic cues and quality scores as learning objectives, along with a specialized contrastive pair preparation process to improve the effectiveness and efficiency of learning. The results show that NVS-SQA outperforms 17 no-reference methods by a large margin (i.e., on average 109.5% in SRCC, 98.6% in PLCC, and 91.5% in KRCC over the second best) and even exceeds 16 full-reference methods across all evaluation metrics (i.e., 22.9% in SRCC, 19.1% in PLCC, and 18.6% in KRCC over the second best).


[180] 2501.13457

Zero-Shot Trajectory Planning for Signal Temporal Logic Tasks

Signal Temporal Logic (STL) is a powerful specification language for describing complex temporal behaviors of continuous signals, making it well-suited for high-level robotic task descriptions. However, generating executable plans for STL tasks is challenging, as it requires consideration of the coupling between the task specification and the system dynamics. Existing approaches either follow a model-based setting that explicitly requires knowledge of the system dynamics or adopt a task-oriented data-driven approach to learn plans for specific tasks. In this work, we address the problem of generating executable STL plans for systems with unknown dynamics. We propose a hierarchical planning framework that enables zero-shot generalization to new STL tasks by leveraging only task-agnostic trajectory data during offline training. The framework consists of three key components: (i) decomposing the STL specification into several progresses and time constraints, (ii) searching for timed waypoints that satisfy all progresses under time constraints, and (iii) generating trajectory segments using a pre-trained diffusion model and stitching them into complete trajectories. We formally prove that our method guarantees STL satisfaction, and simulation results demonstrate its effectiveness in generating dynamically feasible trajectories across diverse long-horizon STL tasks.


[181] 2502.04465

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples and code are available at this https URL.


[182] 2503.10919

Data-Driven Soft Robot Control via Adiabatic Spectral Submanifolds

The mechanical complexity of soft robots creates significant challenges for their model-based control. Specifically, linear data-driven models have struggled to control soft robots on complex, spatially extended paths that explore regions with significant nonlinear behavior. To account for these nonlinearities, we develop here a model-predictive control strategy based on the recent theory of adiabatic spectral submanifolds (aSSMs). This theory is applicable because the internal vibrations of heavily overdamped robots decay at a speed that is much faster than the desired speed of the robot along its intended path. In that case, low-dimensional attracting invariant manifolds (aSSMs) emanate from the path and carry the dominant dynamics of the robot. Aided by this recent theory, we devise an aSSM-based model-predictive control scheme purely from data. We demonstrate our data-driven model's effectiveness in tracking dynamic trajectories across diverse tasks, validated on a high-fidelity, high-dimensional finite-element model of a soft trunk robot and a Cosserat rod-based elastic soft arm. Notably, we find that five- or six-dimensional aSSM-reduced models outperform the tracking performance of other data-driven modeling methods by a factor up to $10$ across all closed-loop control tasks.


[183] 2504.15863

DERD-Net: Learning Depth from Event-based Ray Densities

Event cameras offer a promising avenue for multi-view stereo depth estimation and Simultaneous Localization And Mapping (SLAM) due to their ability to detect blur-free 3D edges at high-speed and over broad illumination conditions. However, traditional deep learning frameworks designed for conventional cameras struggle with the asynchronous, stream-like nature of event data, as their architectures are optimized for discrete, image-like inputs. We propose a scalable, flexible and adaptable framework for pixel-wise depth estimation with event cameras in both monocular and stereo setups. The 3D scene structure is encoded into disparity space images (DSIs), representing spatial densities of rays obtained by back-projecting events into space via known camera poses. Our neural network processes local subregions of the DSIs combining 3D convolutions and a recurrent structure to recognize valuable patterns for depth prediction. Local processing enables fast inference with full parallelization and ensures constant ultra-low model complexity and memory costs, regardless of camera resolution. Experiments on standard benchmarks (MVSEC and DSEC datasets) demonstrate unprecedented effectiveness: (i) using purely monocular data, our method achieves comparable results to existing stereo methods; (ii) when applied to stereo data, it strongly outperforms all state-of-the-art (SOTA) approaches, reducing the mean absolute error by at least 42%; (iii) our method also allows for increases in depth completeness by more than 3-fold while still yielding a reduction in median absolute error of at least 30%. Given its remarkable performance and effective processing of event-data, our framework holds strong potential to become a standard approach for using deep learning for event-based depth estimation and SLAM. Project page: this https URL


[184] 2504.17959

CIVIL: Causal and Intuitive Visual Imitation Learning

Today's robots attempt to learn new tasks by imitating human examples. These robots watch the human complete the task, and then try to match the actions taken by the human expert. However, this standard approach to visual imitation learning is fundamentally limited: the robot observes what the human does, but not why the human chooses those behaviors. Without understanding which features of the system or environment factor into the human's decisions, robot learners often misinterpret the human's examples. In practice, this results in causal confusion, inefficient learning, and robot policies that fail when the environment changes. We therefore propose a shift in perspective: instead of asking human teachers just to show what actions the robot should take, we also enable humans to intuitively indicate why they made those decisions. Under our paradigm human teachers attach markers to task-relevant objects and use natural language prompts to describe their state representation. Our proposed algorithm, CIVIL, leverages this augmented demonstration data to filter the robot's visual observations and extract a feature representation that aligns with the human teacher. CIVIL then applies these causal features to train a transformer-based policy that -- when tested on the robot -- is able to emulate human behaviors without being confused by visual distractors or irrelevant items. Our simulations and real-world experiments demonstrate that robots trained with CIVIL learn both what actions to take and why to take those actions, resulting in better performance than state-of-the-art baselines. From the human's perspective, our user study reveals that this new training paradigm actually reduces the total time required for the robot to learn the task, and also improves the robot's performance in previously unseen scenarios. See videos at our project website: this https URL


[185] 2504.20383

Neural Stereo Video Compression with Hybrid Disparity Compensation

Disparity compensation represents the primary strategy in stereo video compression (SVC) for exploiting cross-view redundancy. These mechanisms can be broadly categorized into two types: one that employs explicit horizontal shifting, and another that utilizes an implicit cross-attention mechanism to reduce cross-view disparity redundancy. In this work, we propose a hybrid disparity compensation (HDC) strategy that leverages explicit pixel displacement as a robust prior feature to simplify optimization and perform implicit cross-attention mechanisms for subsequent warping operations, thereby capturing a broader range of disparity information. Specifically, HDC first computes a similarity map by fusing the horizontally shifted cross-view features to capture pixel displacement information. This similarity map is then normalized into an "explicit pixel-wise attention score" to perform the cross-attention mechanism, implicitly aligning features from one view to another. Building upon HDC, we introduce a novel end-to-end optimized neural stereo video compression framework, which integrates HDC-based modules into key coding operations, including cross-view feature extraction and reconstruction (HDC-FER) and cross-view entropy modeling (HDC-EM). Extensive experiments on SVC benchmarks, including KITTI 2012, KITTI 2015, and Nagoya, which cover both autonomous driving and general scenes, demonstrate that our framework outperforms both neural and traditional SVC methodologies.


[186] 2505.04472

Opinion Dynamics on Signed Graphs and Graphons

In this paper, we make use of graphon theory to study opinion dynamics on large undirected networks. The opinion dynamics models that we take into consideration allow for negative interactions between the individuals, whose opinions can thus grow apart. We consider both the repelling and the opposing models of negative interactions, which have been studied in the literature. We define the repelling and the opposing dynamics on signed graphons and we show that their initial value problem solutions exist and are unique. We then show that, in a suitable sense, the graphon dynamics is a good approximation of the dynamics on large graphs that converge to a graphon. This result applies to large random graphs that are sampled according to a graphon (W-random graphs), for which we provide a new convergence result under very general assumptions.


[187] 2505.10438

Koopman Eigenfunction-Based Identification and Optimal Nonlinear Control of Turbojet Engine

Gas turbine engines are complex and highly nonlinear dynamical systems. Deriving their physics-based models can be challenging because it requires performance characteristics that are not always available, often leading to many simplifying assumptions. This paper discusses the limitations of conventional experimental methods used to derive component-level and locally linear parameter-varying models, and addresses these issues by employing identification techniques based on data collected from standard engine operation under closed-loop control. The rotor dynamics are estimated using the sparse identification of nonlinear dynamics. Subsequently, the autonomous part of the dynamics is mapped into an optimally constructed Koopman eigenfunction space. This process involves eigenvalue optimization using metaheuristic algorithms and temporal projection, followed by gradient-based eigenfunction identification. The resulting Koopman model is validated against an in-house reference component-level model. A globally optimal nonlinear feedback controller and a Kalman estimator are then designed within the eigenfunction space and compared to traditional and gain-scheduled proportional-integral controllers, as well as a proposed internal model control approach. The eigenmode structure enables targeting individual modes during optimization, leading to improved performance tuning. Results demonstrate that the Koopman-based controller surpasses other benchmark controllers in both reference tracking and disturbance rejection under sea-level and varying flight conditions, due to its global nature.


[188] 2505.12258

An Information-Theoretic Framework for Receiver Quantization in Communication

We investigate information-theoretic limits and design of communication under receiver quantization. Unlike most existing studies, this work is more focused on the impact of resolution reduction from high to low. We consider a standard transceiver architecture, which includes i.i.d. complex Gaussian codebook at the transmitter, and a symmetric quantizer cascaded with a nearest neighbor decoder at the receiver. Employing the generalized mutual information (GMI), an achievable rate under general quantization rules is obtained in an analytical form, which shows that the rate loss due to quantization is $\log\left(1+\gamma\mathsf{SNR}\right)$, where $\gamma$ is determined by thresholds and levels of the quantizer. Based on this result, the performance under uniform receiver quantization is analyzed comprehensively. We show that the front-end gain control, which determines the loading factor of quantization, has an increasing impact on performance as the resolution decreases. In particular, we prove that the unique loading factor that minimizes the MSE also maximizes the GMI, and the corresponding irreducible rate loss is given by $\log\left(1+\mathsf {mmse}\cdot\mathsf{SNR}\right)$, where mmse is the minimum MSE normalized by the variance of quantizer input, and is equal to the minimum of $\gamma$. A geometrical interpretation for the optimal uniform quantization at the receiver is further established. Moreover, by asymptotic analysis, we characterize the impact of biased gain control, showing how small rate losses decay to zero and providing rate approximations under large bias. From asymptotic expressions of the optimal loading factor and mmse, approximations and several per-bit rules for performance are also provided. Finally we discuss more types of receiver quantization and show that the consistency between achievable rate maximization and MSE minimization does not hold in general.


[189] 2505.16327

Cooperative NOMA Meets Emerging Technologies: A Survey for Next-Generation Wireless Networks

The emerging demands of sixth-generation wireless networks, such as ultra-connectivity, native intelligence, and cross-domain convergence, are bringing renewed focus to cooperative non-orthogonal multiple access (C-NOMA) as a fundamental enabler of scalable, efficient, and intelligent communication systems. C-NOMA builds on the core benefits of NOMA by leveraging user cooperation and relay strategies to enhance spectral efficiency, coverage, and energy performance. This article presents a unified and forward-looking survey on the integration of C-NOMA with key enabling technologies, including radio frequency energy harvesting, cognitive radio networks, reconfigurable intelligent surfaces, space-air-ground integrated networks, and integrated sensing and communication-assisted semantic communication. Foundational principles and relaying protocols are first introduced to establish the technical relevance of C-NOMA. Then, a focused investigation is conducted into protocol-level synergies, architectural models, and deployment strategies across these technologies. Beyond integration, this article emphasizes the orchestration of C-NOMA across future application domains such as digital twins, extended reality, and e-health. In addition, it provides an extensive and in-depth review of recent literature, categorized by relaying schemes, system models, performance metrics, and optimization paradigms, including model-based, heuristic, and AI-driven approaches. Finally, open challenges and future research directions are outlined, spanning standardization, security, and cross-layer design, positioning C-NOMA as a key pillar of intelligent next-generation network architectures.


[190] 2506.00358

$\texttt{AVROBUSTBENCH}$: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur $\textit{simultaneously}$ in both audio and visual modalities, we introduce $\texttt{AVROBUSTBENCH}$, a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. $\texttt{AVROBUSTBENCH}$ comprises four audio-visual benchmark datasets, $\texttt{AUDIOSET-2C}$, $\texttt{VGGSOUND-2C}$, $\texttt{KINETICS-2C}$, and $\texttt{EPICKITCHENS-2C}$, each incorporating 75 bimodal audio-visual corruptions that are $\textit{co-occurring}$ and $\textit{correlated}$. Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on $\texttt{VGGSOUND-2C}$ and $\texttt{KINETICS-2C}$, offer minimal improvements in performance under bimodal corruptions. We further propose $\texttt{AV2C}$, a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on $\texttt{VGGSOUND-2C}$. We hope that $\texttt{AVROBUSTBENCH}$ will steer the development of more effective and robust audio-visual TTA approaches. Our code is available $\href{this https URL}{here}$.


[191] 2506.01213

On the Stability of Graph Convolutional Neural Networks: A Probabilistic Perspective

Graph convolutional neural networks (GCNNs) have emerged as powerful tools for analyzing graph-structured data, achieving remarkable success across diverse applications. However, the theoretical understanding of the stability of these models, i.e., their sensitivity to small changes in the graph structure, remains in rather limited settings, hampering the development and deployment of robust and trustworthy models in practice. To fill this gap, we study how perturbations in the graph topology affect GCNN outputs and propose a novel formulation for analyzing model stability. Unlike prior studies that focus only on worst-case perturbations, our distribution-aware formulation characterizes output perturbations across a broad range of input data. This way, our framework enables, for the first time, a probabilistic perspective on the interplay between the statistical properties of the node data and perturbations in the graph topology. We conduct extensive experiments to validate our theoretical findings and demonstrate their benefits over existing baselines, in terms of both representation stability and adversarial attacks on downstream tasks. Our results demonstrate the practical significance of the proposed formulation and highlight the importance of incorporating data distribution into stability analysis.


[192] 2506.19885

FlightKooba: A Fast Interpretable FTP Model

Flight trajectory prediction (FTP) and similar time series tasks typically require capturing smooth latent dynamics hidden within noisy signals. However, existing deep learning models face significant challenges of high computational cost and insufficient interpretability due to their complex black-box nature. This paper introduces FlightKooba, a novel modeling approach designed to extract such underlying dynamics analytically. Our framework uniquely integrates HiPPO theory, Koopman operator theory, and control theory. By leveraging Legendre polynomial bases, it constructs Koopman operators analytically, thereby avoiding large-scale parameter training. The method's core strengths lie in its exceptional computational efficiency and inherent interpretability. Experiments on multiple public datasets validate our design philosophy: for signals exhibiting strong periodicity or clear physical laws (e.g., in aviation, meteorology, and traffic flow), FlightKooba delivers competitive prediction accuracy while reducing trainable parameters by several orders of magnitude and achieving the fastest training speed. Furthermore, we analyze the model's theoretical boundaries, clarifying its inherent low-pass filtering characteristics that render it unsuitable for sequences dominated by high-frequency noise. In summary, FlightKooba offers a powerful, efficient, and interpretable new alternative for time series analysis, particularly in resource-constrained environments.


[193] 2507.05177

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Empathetic interaction is a cornerstone of human-machine communication, due to the need for understanding speech enriched with paralinguistic cues and generating emotional and expressive responses. However, the most powerful empathetic LSLMs are increasingly closed off, leaving the crucial details about the architecture, data and development opaque to researchers. Given the critical need for transparent research into the LSLMs and empathetic behavior, we present OpenS2S, a fully open-source, transparent and end-to-end LSLM designed to enable empathetic speech interactions. Based on our empathetic speech-to-text model BLSP-Emo, OpenS2S further employs a streaming interleaved decoding architecture to achieve low-latency speech generation. To facilitate end-to-end training, OpenS2S incorporates an automated data construction pipeline that synthesizes diverse, high-quality empathetic speech dialogues at low cost. By leveraging large language models to generate empathetic content and controllable text-to-speech systems to introduce speaker and emotional variation, we construct a scalable training corpus with rich paralinguistic diversity and minimal human supervision. We release the fully open-source OpenS2S model, including the dataset, model weights, pre-training and fine-tuning codes, to empower the broader research community and accelerate innovation in empathetic speech systems. The project webpage can be accessed at this https URL


[194] 2507.05604

Kernel Density Steering: Inference-Time Scaling via Mode Seeking for Image Restoration

Diffusion models show promise for image restoration, but existing methods often struggle with inconsistent fidelity and undesirable artifacts. To address this, we introduce Kernel Density Steering (KDS), a novel inference-time framework promoting robust, high-fidelity outputs through explicit local mode-seeking. KDS employs an $N$-particle ensemble of diffusion samples, computing patch-wise kernel density estimation gradients from their collective outputs. These gradients steer patches in each particle towards shared, higher-density regions identified within the ensemble. This collective local mode-seeking mechanism, acting as "collective wisdom", steers samples away from spurious modes prone to artifacts, arising from independent sampling or model imperfections, and towards more robust, high-fidelity structures. This allows us to obtain better quality samples at the expense of higher compute by simultaneously sampling multiple particles. As a plug-and-play framework, KDS requires no retraining or external verifiers, seamlessly integrating with various diffusion samplers. Extensive numerical validations demonstrate KDS substantially improves both quantitative and qualitative performance on challenging real-world super-resolution and image inpainting tasks.


[195] 2507.16343

Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries

Most existing sound event detection~(SED) algorithms operate under a closed-set assumption, restricting their detection capabilities to predefined classes. While recent efforts have explored language-driven zero-shot SED by exploiting audio-language models, their performance is still far from satisfactory due to the lack of fine-grained alignment and cross-modal feature fusion. In this work, we propose the Detect Any Sound Model (DASM), a query-based framework for open-vocabulary SED guided by multi-modal queries. DASM formulates SED as a frame-level retrieval task, where audio features are matched against query vectors derived from text or audio prompts. To support this formulation, DASM introduces a dual-stream decoder that explicitly decouples event recognition and temporal localization: a cross-modality event decoder performs query-feature fusion and determines the presence of sound events at the clip-level, while a context network models temporal dependencies for frame-level localization. Additionally, an inference-time attention masking strategy is proposed to leverage semantic relations between base and novel classes, substantially enhancing generalization to novel classes. Experiments on the AudioSet Strong dataset demonstrate that DASM effectively balances localization accuracy with generalization to novel classes, outperforming CLAP-based methods in open-vocabulary setting (+ 7.8 PSDS) and the baseline in the closed-set setting (+ 6.9 PSDS). Furthermore, in cross-dataset zero-shot evaluation on DESED, DASM achieves a PSDS1 score of 42.2, even exceeding the supervised CRNN baseline. The project page is available at this https URL.


[196] 2507.16390

DASPack: Controlled Data Compression for Distributed Acoustic Sensing

We present DASPack, a high-performance, open-source compression tool specifically designed for distributed acoustic sensing (DAS) data. As DAS becomes a key technology for real-time, high-density, and long-range monitoring in fields such as geophysics, infrastructure surveillance, and environmental sensing, the volume of collected data is rapidly increasing. Large-scale DAS deployments already generate hundreds of terabytes and are expected to increase in the coming years, making long-term storage a major challenge. Despite this urgent need, few compression methods have proven to be both practical and scalable in real-world scenarios. DASPack is a fully operational solution that consistently outperforms existing techniques for DAS data. It enables both controlled lossy and lossless compression by allowing users to choose the maximum absolute difference per datum between the original and compressed data. The compression pipeline combines wavelet transforms, linear predictive coding, and entropy coding to optimise efficiency. Our method achieves up to 3x file size reductions for strain and strain rate data in lossless mode across diverse datasets. In lossy mode, compression improves to 6x with near-perfect signal fidelity, and up to 10x is reached with acceptable signal degradation. It delivers fast throughput (100-200 MB/s using a single-thread and up to 750 MB/s using 8-threads), enabling real-time deployment even under high data rates. We validated its performance on 15 datasets from a variety of acquisition environments, demonstrating its speed, robustness, and broad applicability. DASPack provides a practical foundation for long-term, sustainable DAS data management in large-scale monitoring networks.


[197] 2507.22313

Time-Resolved EEG Decoding of Semantic Processing Reveals Altered Neural Dynamics in Depression and Suicidality

Depression and suicidality affect cognitive and emotional processes, yet objective, task-evoked neural readouts of mental health remain limited. We investigated the spatiotemporal dynamics of affective semantic processing using multivariate decoding of time-resolved, 64-channel electroencephalography (EEG). Participants (N=137) performed a sentence-evaluation task with emotionally salient, self-referential statements. We identified robust neural signatures of semantic processing, with peak decoding accuracy between 300-600 ms -- a window associated with rapid, stimulus-driven semantic evaluation and conflict monitoring. Relative to healthy controls, individuals with depression and suicidal ideation showed earlier onset, longer duration, and greater amplitude decoding responses, along with broader cross-temporal generalization and enhanced contributions from frontocentral and parietotemporal components. These findings suggest altered sensitivity and impaired disengagement from emotionally salient content in the clinical groups, advancing our understanding of the neurocognitive basis of mental health and establishing a compact and interpretable EEG-based index of semantic-evaluation dynamics with potential diagnostic relevance.


[198] 2508.01488

PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective

In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-$Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ($130$k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO's practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model's low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.


[199] 2508.03543

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Text-to-speech (TTS) has shown great progress in recent years. However, most existing TTS systems offer only coarse and rigid emotion control, typically via discrete emotion labels or a carefully crafted and detailed emotional text prompt, making fine-grained emotion manipulation either inaccessible or unstable. These models also require extensive, high-quality datasets for training. To address these limitations, we propose EmoSteer-TTS, a novel training-free approach, to achieve fine-grained speech emotion control (conversion, interpolation, erasure) by activation steering. We first empirically observe that modifying a subset of the internal activations within a flow matching-based TTS model can effectively alter the emotional tone of synthesized speech. Building on this insight, we then develop a training-free and efficient algorithm, including activation extraction, emotional token searching, and inference-time steering, which can be seamlessly integrated into a wide range of pretrained models (e.g., F5-TTS, CosyVoice2, and E2-TTS). In addition, to derive effective steering vectors, we construct a curated emotional speech dataset with diverse speakers. Extensive experiments demonstrate that EmoSteer-TTS enables fine-grained, interpretable, and continuous control over speech emotion, outperforming the state-of-the-art (SOTA). To the best of our knowledge, this is the first method that achieves training-free and continuous fine-grained emotion control in TTS. Demo samples are available at this https URL.


[200] 2508.04273

Audio Does Matter: Importance-Aware Multi-Granularity Fusion for Video Moment Retrieval

Video Moment Retrieval (VMR) aims to retrieve a specific moment semantically related to the given query. To tackle this task, most existing VMR methods solely focus on the visual and textual modalities while neglecting the complementary but important audio modality. Although a few recent works try to tackle the joint audio-vision-text reasoning, they treat all modalities equally and simply embed them without fine-grained interaction for moment retrieval. These designs are counter-practical as: Not all audios are helpful for video moment retrieval, and the audio of some videos may be complete noise or background sound that is meaningless to the moment determination. To this end, we propose a novel Importance-aware Multi-Granularity fusion model (IMG), which learns to dynamically and selectively aggregate the audio-vision-text contexts for VMR. Specifically, after integrating the textual guidance with vision and audio separately, we first design a pseudo-label-supervised audio importance predictor that predicts the importance score of the audio, and accordingly assigns weights to mitigate the interference caused by noisy audio. Then, we design a multi-granularity audio fusion module that adaptively fuses audio and visual modalities at local-, event-, and global-level, fully capturing their complementary contexts. We further propose a cross-modal knowledge distillation strategy to address the challenge of missing audio modality during inference. To evaluate our method, we further construct a new VMR dataset, i.e., Charades-AudioMatter, where audio-related samples are manually selected and re-organized from the original Charades-STA to validate the model's capability in utilizing audio modality. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art with audio-video fusion in VMR methods. Our code is available at this https URL.


[201] 2508.07841

Learning Robust Satellite Attitude Dynamics with Physics-Informed Normalising Flow

Attitude control is a fundamental aspect of spacecraft operations. Model Predictive Control (MPC) has emerged as a powerful strategy for these tasks, relying on accurate models of the system dynamics to optimize control actions over a prediction horizon. In scenarios where physics models are incomplete, difficult to derive, or computationally expensive, machine learning offers a flexible alternative by learning the system behavior directly from data. However, purely data-driven models often struggle with generalization and stability, especially when applied to inputs outside their training domain. To address these limitations, we investigate the benefits of incorporating Physics-Informed Neural Networks (PINNs) into the learning of spacecraft attitude dynamics, comparing their performance with that of purely data-driven approaches. Using a Real-valued Non-Volume Preserving (Real NVP) neural network architecture with a self-attention mechanism, we trained several models on simulated data generated with the Basilisk simulator. Two training strategies were considered: a purely data-driven baseline and a physics-informed variant to improve robustness and stability. Our results demonstrate that the inclusion of physics-based information significantly enhances the performance in terms of the mean relative error with the best architectures found by 27.08%. These advantages are particularly evident when the learned models are integrated into an MPC framework, where PINN-based models consistently outperform their purely data-driven counterparts in terms of control accuracy and robustness, and achieve improved settling times when compared to traditional MPC approaches, yielding improvements of up to 62%, when subject to observation noise and RWs friction.


[202] 2508.08141

Pindrop it! Audio and Visual Deepfake Countermeasures for Robust Detection and Fine Grained-Localization

The field of visual and audio generation is burgeoning with new state-of-the-art methods. This rapid proliferation of new techniques underscores the need for robust solutions for detecting synthetic content in videos. In particular, when fine-grained alterations via localized manipulations are performed in visual, audio, or both domains, these subtle modifications add challenges to the detection algorithms. This paper presents solutions for the problems of deepfake video classification and localization. The methods were submitted to the ACM 1M Deepfakes Detection Challenge, achieving the best performance in the temporal localization task and a top four ranking in the classification task for the TestA split of the evaluation dataset.


[203] 2509.23729

LUQ: Layerwise Ultra-Low Bit Quantization for Multimodal Large Language Models

Large Language Models (LLMs) with multimodal capabilities have revolutionized vision-language tasks, but their deployment often requires huge memory and computational resources. While post-training quantization (PTQ) has successfully compressed language models to as low as 1-bit precision without significant performance loss, its effectiveness for multimodal LLMs (MLLMs) remains relatively unexplored. In this paper, we present the first study on ultra-low bit (<4-bit) quantization for multimodal LLMs. Our analysis reveals that multimodal tokens and intermediate layer activations produced by them exhibit significantly higher statistical variance and entropy compared to text tokens, making them less tolerant to ultra-low bit quantization. However, the activation distributions of multimodal tokens varies significantly over different layers, with some layers having lower entropy activation distributions. We empirically show that such layers in these models can better tolerate ultra-low bit quantization. Building on these insights, we propose a novel strategy for MLLM quantization, LUQ: Layerwise Ultra-Low Bit Quantization, which selectively applies ultra-low bit quantization to layers that are more resilient to it. Additionally, we also show that using a mix of multimodal tokens (image and text) for PTQ boosts VQA performance in the ultra-low bit regime. We evaluate our method on LLaVA-1.5 and Qwen-2.5-VL across 9 popular VQA benchmarks. The resulting LUQ models use 40% and 31% less memory than their 4-bit counterparts, respectively, while exhibiting a performance degradation of less than 10% on the MME benchmark.


[204] 2510.05109

Tiny but Mighty: A Software-Hardware Co-Design Approach for Efficient Multimodal Inference on Battery-Powered Small Devices

Large Multimodal Models (LMMs) are inherently modular, consisting of vision and audio encoders, projectors, and large language models. Yet, they are almost always executed monolithically, which underutilizes the heterogeneous accelerators (NPUs, GPUs, DSPs) in modern SoCs and leads to high end-to-end latency. In this paper, we present NANOMIND, a hardware--software co-design inference framework for Large Multimodal Models (LMMs) that breaks large models into modular ``bricks'' (vision, language, audio, etc.) and maps each to its ideal accelerator. The key insight is that large models can be broken into modular components and scheduled to run on the most appropriate compute units. It performs module-level dynamic offloading across accelerators on unified-memory SoCs. By combining customized hardware design, system-level scheduling, and optimized low-bit computation kernels, we demonstrate our framework with a compact, battery-powered device capable of running LMMs entirely on device. This prototype functions as a self-contained intelligent assistant that requires no network connectivity, while achieving higher throughput and superior power efficiency under strict resource constraints. The design further bypasses CPU bottlenecks and reduces redundant memory usage through token-aware buffer management and module-level coordination. Our system outperforms existing implementations in resource efficiency, cutting energy consumption by 42.3\% and GPU memory usage by 11.2\%. This enables a battery-powered device to run LLaVA-OneVision with a camera for nearly half a day and LLaMA-3-8B for voice interactions up to almost 20.8 hours.


[205] 2510.11507

Automatic Music Sample Identification with Multi-Track Contrastive Learning

Sampling, the technique of reusing pieces of existing audio tracks to create new music content, is a very common practice in modern music production. In this paper, we tackle the challenging task of automatic sample identification, that is, detecting such sampled content and retrieving the material from which it originates. To do so, we adopt a self-supervised learning approach that leverages a multi-track dataset to create positive pairs of artificial mixes, and design a novel contrastive learning objective. We show that such method significantly outperforms previous state-of-the-art baselines, that is robust to various genres, and that scales well when increasing the number of noise songs in the reference database. In addition, we extensively analyze the contribution of the different components of our training pipeline and highlight, in particular, the need for high-quality separated stems for this task.


[206] 2510.14511

Stability Criteria and Motor Performance in Delayed Haptic Dyadic Interactions Mediated by Robots

This paper establishes analytical stability criteria for robot-mediated human-human (dyadic) interaction systems, focusing on haptic communication under network-induced time delays. Through frequency-domain analysis supported by numerical simulations, we identify both delay-independent and delay-dependent stability criteria. The delay-independent criterion guarantees stability irrespective of the delay, whereas the delay-dependent criterion is characterised by a maximum tolerable delay before instability occurs. The criteria demonstrate dependence on controller and robot dynamic parameters, where increasing stiffness reduces the maximum tolerable delay in a non-linear manner, thereby heightening system vulnerability. The proposed criteria can be generalised to a wide range of robot-mediated interactions and serve as design guidelines for stable remote dyadic systems. Experiments with robots performing human-like movements further illustrate the correlation between stability and motor performance. The findings of this paper suggest the prerequisites for effective delay-compensation strategies.