New articles on Electrical Engineering and Systems Science


[1] 2508.16601

Notes on Deterministic and Stochastic Approaches in Electromagnetic Information Theory

This paper investigates the relationship between the Number of Degrees of Freedom ($N_{\rm DoF}$) of the field in deterministic and stochastic source models within Electromagnetic Information Theory (EIT). Our findings demonstrate a fundamental connection between these two approaches. Specifically, we show that a deterministic model and a stochastic model with a spatially incoherent and homogeneous source yield not only the same $N_{\rm DoF}$ but also identical eigenvalues and basis functions for field representation. This key equivalence not only explains the effectiveness of deterministic approaches in EIT but also corroborates the use of classical electromagnetic methods within this new discipline.


[2] 2508.16650

Predicting brain tumour enhancement from non-contrast MR imaging with artificial intelligence

Brain tumour imaging assessment typically requires both pre- and post-contrast MRI, but gadolinium administration is not always desirable, such as in frequent follow-up, renal impairment, allergy, or paediatric patients. We aimed to develop and validate a deep learning model capable of predicting brain tumour contrast enhancement from non-contrast MRI sequences alone. We assembled 11089 brain MRI studies from 10 international datasets spanning adult and paediatric populations with various neuro-oncological states, including glioma, meningioma, metastases, and post-resection appearances. Deep learning models (nnU-Net, SegResNet, SwinUNETR) were trained to predict and segment enhancing tumour using only non-contrast T1-, T2-, and T2/FLAIR-weighted images. Performance was evaluated on 1109 held-out test patients using patient-level detection metrics and voxel-level segmentation accuracy. Model predictions were compared against 11 expert radiologists who each reviewed 100 randomly selected patients. The best-performing nnU-Net achieved 83% balanced accuracy, 91.5% sensitivity, and 74.4% specificity in detecting enhancing tumour. Enhancement volume predictions strongly correlated with ground truth (R2 0.859). The model outperformed expert radiologists, who achieved 69.8% accuracy, 75.9% sensitivity, and 64.7% specificity. 76.8% of test patients had Dice over 0.3 (acceptable detection), 67.5% had Dice over 0.5 (good detection), and 50.2% had Dice over 0.7 (excellent detection). Deep learning can identify contrast-enhancing brain tumours from non-contrast MRI with clinically relevant performance. These models show promise as screening tools and may reduce gadolinium dependence in neuro-oncology imaging. Future work should evaluate clinical utility alongside radiology experts.


[3] 2508.16730

Analysis of Transferability Estimation Metrics for Surgical Phase Recognition

Fine-tuning pre-trained models has become a cornerstone of modern machine learning, allowing practitioners to achieve high performance with limited labeled data. In surgical video analysis, where expert annotations are especially time-consuming and costly, identifying the most suitable pre-trained model for a downstream task is both critical and challenging. Source-independent transferability estimation (SITE) offers a solution by predicting how well a model will fine-tune on target data using only its embeddings or outputs, without requiring full retraining. In this work, we formalize SITE for surgical phase recognition and provide the first comprehensive benchmark of three representative metrics, LogME, H-Score, and TransRate, on two diverse datasets (RAMIE and AutoLaparo). Our results show that LogME, particularly when aggregated by the minimum per-subset score, aligns most closely with fine-tuning accuracy; H-Score yields only weak predictive power; and TransRate often inverses true model rankings. Ablation studies show that when candidate models have similar performances, transferability estimates lose discriminative power, emphasizing the importance of maintaining model diversity or using additional validation. We conclude with practical guidelines for model selection and outline future directions toward domain-specific metrics, theoretical foundations, and interactive benchmarking tools.


[4] 2508.16735

A Practical Approach to the Design of an S-Band Image-Rejecting Dual-Conversion Super-Heterodyne RF Chain of a Receiver Considering Spur Signals

This paper presents a typical design of the RF section of a radar receiver, the chain within a superheterodyne dual-conversion architecture. A significant challenge in this framework is the occurrence of spur signals, which negatively impact the dynamic range of the RF chain. When addressing this issue, the paper introduces an innovative approach to mitigate (or even wipe out) these undesired effects, utilizing two mutually verifying MATLAB codes. These codes have been tested with two distinct commercial mixers and could be applied to any superheterodyne configuration with various mixers. The presented method makes the Spurious-Free Dynamic Range (SFDR) of the chain the least different from the dynamic range of the chain. Also, the selection of other components gets optimized to align with spurious signals consideration, with explanations provided for these choices. Moreover, two filters of the RF chain, the second and the third, have been designed to reduce implementation costs. Various Microwave software and full-wave analyses were employed for detailed design and analysis, with the results compared to evaluate their performance.


[5] 2508.16803

A predictive modular approach to constraint satisfaction under uncertainty - with application to glycosylation in continuous monoclonal antibody biosimilar production

The paper proposes a modular-based approach to constraint handling in process optimization and control. This is partly motivated by the recent interest in learning-based methods, e.g., within bioproduction, for which constraint handling under uncertainty is a challenge. The proposed constraint handler, called predictive filter, is combined with an adaptive constraint margin and a constraint violation cost monitor to minimize the cost of violating soft constraints due to model uncertainty and disturbances. The module can be combined with any controller and is based on minimally modifying the controller output, in a least squares sense, such that constraints are satisfied within the considered horizon. The proposed method is computationally efficient and suitable for real-time applications. The effectiveness of the method is illustrated through a realistic simulation case study of glycosylation constraint satisfaction in continuous monoclonal antibody biosimilar production using Chinese hamster ovary cells, for which the metabolic network model consists of 23 extracellular metabolites and 126 reactions.


[6] 2508.16814

Optimal Coordination of Local Flexibility from Electric Vehicles with Social Impact Consideration

The integration of renewable energy sources (RES) and the convergence of transport electrification, creates a significant challenge for distribution network management e.g. voltage and frequency violations, particularly in rural and remote areas. This paper investigates how smart charging of electric vehicles (EVs) can help reduce renewable energy curtailment and alleviate stress on local distribution networks. We implement a customised AC Optimal Power Flow (AC OPF) formulation which integrates into the optimisation an indicator reflecting the social impact of flexibility from EV users, based on the analysis of historical EV charging behaviours. The contribution of EV owners to reducing wind curtailment is optimised to enhance the acceptability of flexibility procurement, as the method targets EV users whose charging habits are most likely to align with flexibility requirements. Our method integrates social, technological, and economic perspectives with optimal flexibility coordination, and utilises clustering of EVs through a kmeans algorithm. To ensure scalability, we introduce a polar coordinate-based dimension reduction technique. The flexibility optimisation approach is demonstrated on the Orkney grid model, incorporating demand and wind farm generation data, as well as multi year charging data from 106 EVs. Results indicate that, by building upon the existing habits of EV users, curtailment can be reduced by 99.5% during a typical summer week the period when curtailment is most prevalent. This research demonstrates a foundational and transferable approach which is cognisant of socio techno economic factors towards accelerating decarbonisation and tackling the stochastic challenges of new demand and generation patterns on local distribution networks.


[7] 2508.16827

Grid-Aware Flexibility Operation of Behind-the-Meter Assets: A review of Objectives and Constraints

The high penetration of distributed energy resources (DERs) in low-voltage distribution networks (LVDNs) often leads to network instability and congestion. Discovering the flexibility potential of behind- the-meter (BTM) assets offers a promising solution to these challenges, providing benefits for both prosumers and grid operators. This review focuses on the objectives and constraints associated with the operation of BTM flexibility resources in LVDNs. We propose a new classification framework for network-aware flexibility modelling that incorporates prosumer objectives, flexibility sources, and both local and grid-level constraints. This review identifies research gaps in prosumer-centric grid considerations, control strategies, flexibility preferences, and scenarios in the use of BTM resources.


[8] 2508.16834

Fairness for distribution network hosting capacity

The integration of distributed generation (DG) is essential to the energy transition but poses challenges for lowvoltage (LV) distribution networks (DNs) with limited hosting capacity (HC). This study incorporates multiple fairness criteria, utilitarian, egalitarian, bounded, and bargaining, into the HC optimisation framework to assess their impact. When applied to LV feeders of different sizes and topologies, the analysis shows that bargaining and upper-bounded fairness provide the best balance between efficiency and fairness. Efficiency refers to maximising the social welfare of the LV DNs, while fairness is proportional to the minimisation of disparity in opportunity for installing DG. Feeder topology significantly influences fairness outcomes, while feeder size affects total HC and the inherent fairness of feeders. These results emphasise the importance of regulatory incentives and network designs in order to facilitate fair and efficient DG integration.


[9] 2508.16882

Multimodal Medical Endoscopic Image Analysis via Progressive Disentangle-aware Contrastive Learning

Accurate segmentation of laryngo-pharyngeal tumors is crucial for precise diagnosis and effective treatment planning. However, traditional single-modality imaging methods often fall short of capturing the complex anatomical and pathological features of these tumors. In this study, we present an innovative multi-modality representation learning framework based on the `Align-Disentangle-Fusion' mechanism that seamlessly integrates 2D White Light Imaging (WLI) and Narrow Band Imaging (NBI) pairs to enhance segmentation performance. A cornerstone of our approach is multi-scale distribution alignment, which mitigates modality discrepancies by aligning features across multiple transformer layers. Furthermore, a progressive feature disentanglement strategy is developed with the designed preliminary disentanglement and disentangle-aware contrastive learning to effectively separate modality-specific and shared features, enabling robust multimodal contrastive learning and efficient semantic fusion. Comprehensive experiments on multiple datasets demonstrate that our method consistently outperforms state-of-the-art approaches, achieving superior accuracy across diverse real clinical scenarios.


[10] 2508.16888

Dual Orthogonal Projections-Based Multiuser Interference Cancellation for mmWave Beamforming in XL-MIMO Systems

This paper investigates multiuser interference (MUI) cancellation for millimeter-wave (mmWave) beamforming in extremely large-scale multiple-input multiple-output (XL-MIMO) communication systems. We propose a linear algorithm, termed iterative dual orthogonal projections (DOP), which alternates between two orthogonal projections: one to eliminate MUI and the other to refine combiners, ensuring a monotonic increase in spectral efficiency. Theoretical analysis and simulation results show that, with each iteration, the signal power for each user increases monotonically, the equivalent noise power after receive combining decreases monotonically, and the spectral efficiency improves accordingly and converges rapidly, closely approaching the theoretical optimum determined by dirty paper coding (DPC), outperforming existing linear algorithms in spectral efficiency.


[11] 2508.16897

Generating Synthetic Contrast-Enhanced Chest CT Images from Non-Contrast Scans Using Slice-Consistent Brownian Bridge Diffusion Network

Contrast-enhanced computed tomography (CT) imaging is essential for diagnosing and monitoring thoracic diseases, including aortic pathologies. However, contrast agents pose risks such as nephrotoxicity and allergic-like reactions. The ability to generate high-fidelity synthetic contrast-enhanced CT angiography (CTA) images without contrast administration would be transformative, enhancing patient safety and accessibility while reducing healthcare costs. In this study, we propose the first bridge diffusion-based solution for synthesizing contrast-enhanced CTA images from non-contrast CT scans. Our approach builds on the Slice-Consistent Brownian Bridge Diffusion Model (SC-BBDM), leveraging its ability to model complex mappings while maintaining consistency across slices. Unlike conventional slice-wise synthesis methods, our framework preserves full 3D anatomical integrity while operating in a high-resolution 2D fashion, allowing seamless volumetric interpretation under a low memory budget. To ensure robust spatial alignment, we implement a comprehensive preprocessing pipeline that includes resampling, registration using the Symmetric Normalization method, and a sophisticated dilated segmentation mask to extract the aorta and surrounding structures. We create two datasets from the Coltea-Lung dataset: one containing only the aorta and another including both the aorta and heart, enabling a detailed analysis of anatomical context. We compare our approach against baseline methods on both datasets, demonstrating its effectiveness in preserving vascular structures while enhancing contrast fidelity.


[12] 2508.16908

Localization using Angle-of-Arrival Triangulation

Indoor localization is a long-standing challenge in mobile computing, with significant implications for enabling location-aware and intelligent applications within smart environments such as homes, offices, and retail spaces. As AI assistants such as Amazon Alexa and Google Nest become increasingly pervasive, microphone-equipped devices are emerging as key components of everyday life and home automation. This paper introduces a passive, infrastructure-light system for localizing human speakers using speech signals captured by two or more spatially distributed smart devices. The proposed approach, GCC+, extends the Generalized Cross-Correlation with Phase Transform (GCC-PHAT) method to estimate the Angle-of-Arrival (AoA) of audio signals at each device and applies robust triangulation techniques to infer the speaker's two-dimensional position. To further improve temporal resolution and localization accuracy, feature-space expansion and subsample interpolation techniques are employed for precise Time Difference of Arrival (TDoA) estimation. The system operates without requiring hardware modifications, prior calibration, explicit user cooperation, or knowledge of the speaker's signal content, thereby offering a highly practical solution for real-world deployment. Experimental evaluation in a real-world home environment yields a median AoA estimation error of 2.2 degrees and a median localization error of 1.25 m, demonstrating the feasibility and effectiveness of audio-based localization for enabling context-aware, privacy-preserving ambient intelligence.


[13] 2508.16913

Chat-Driven Reconfiguration of Model Predictive Control

Traditional control personalization requires users to understand optimization parameters and provide repetitive numerical feedback, creating significant barriers for non-expert users. To deal with this issue, we propose ChatMPC, a model predictive control framework that enables users to personalize control systems and adapt to environmental changes through natural language interaction. The framework operates in two modes: personalization, where users iteratively adjust control behavior to their preferences, and co-development, where users provide real-time environmental information that complements sensor data. We establish convergence guarantees under different user behavior models, demonstrating exponential convergence for consistent feedback and finite-time convergence with logarithmic interaction complexity for tolerance-based users. We validate ChatMPC through experiments in robot navigation with personalized obstacle avoidance and semi-autonomous driving with conversational obstacle reporting. Both experiments achieve real-time performance and demonstrate effective adaptation to user preferences and environmental changes.


[14] 2508.16918

An Adaptive Environment-Aware Transformer Autoencoder for UAV-FSO with Dynamic Complexity Control

The rise of sixth-generation (6G) wireless networks sets high demands on UAV-assisted Free Space Optical (FSO) communications, where the channel environment becomes more complex and variable due to both atmospheric turbulence and UAV-induced vibrations. These factors increase the challenge of maintaining reliable communication and require adaptive processing methods. Autoencoders are promising as they learn optimal encodings from channel data. However, existing autoencoder designs are generic and lack the specific adaptability and computational flexibility needed for UAV-FSO scenarios. To address this, we propose AEAT-AE (Adaptive Environment-aware Transformer Autoencoder), a Transformer-based framework that integrates environmental parameters into both encoder and decoder via a cross-attention mechanism. Moreover, AEAT-AE incorporates a Deep Q-Network (DQN) that dynamically selects which layers of the Transformer autoencoder to activate based on real-time environmental inputs, effectively balancing performance and computational cost. Experiments demonstrate that AEAT-AE outperforms conventional methods in bit error rate while maintaining efficient runtime, representing a novel tailored solution for next-generation UAV-FSO communications.


[15] 2508.16930

HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation

Recent advances in video generation produce visually realistic content, yet the absence of synchronized audio severely compromises immersion. To address key challenges in video-to-audio generation, including multimodal data scarcity, modality imbalance and limited audio quality in existing methods, we propose HunyuanVideo-Foley, an end-to-end text-video-to-audio framework that synthesizes high-fidelity audio precisely aligned with visual dynamics and semantic context. Our approach incorporates three core innovations: (1) a scalable data pipeline curating 100k-hour multimodal datasets through automated annotation; (2) a representation alignment strategy using self-supervised audio features to guide latent diffusion training, efficiently improving audio quality and generation stability; (3) a novel multimodal diffusion transformer resolving modal competition, containing dual-stream audio-video fusion through joint attention, and textual semantic injection via cross-attention. Comprehensive evaluations demonstrate that HunyuanVideo-Foley achieves new state-of-the-art performance across audio fidelity, visual-semantic alignment, temporal alignment and distribution matching. The demo page is available at: this https URL.


[16] 2508.16946

Spatially Correlated Blockage Aware Placement of RIS in IIoT Networks

We study the impact of deploying reconfigurable intelligent surfaces (RISs) in mitigating coverage gaps and enhancing transmission reliability in an industrial internet of things (IIoT) network. First, we consider a single blockage scenario and characterize the correlation between blocking events of the base station (BS)-user and the RIS-user links and study its impact on the probability of establishing a viable reflected link. Then, by considering multiple blockages, we derive the distribution of the signal to noise ratio (SNR) as a function of data size, blockage density, the number of RISs, and the deployment area. We analyze the impact of normalized blockage radius and identify the threshold beyond which the assumption of independent blockages deviates from the ground truth of correlated blocking. Finally, we compare the outage performance of this RIS-assisted system with that operated with network- controlled relays, and demonstrate that while the relays provide a higher reliability beyond a certain blockage threshold, increasing the number of RISs may help mitigate this effect. These insights offer valuable design guidelines for deploying RIS-aided IIoT networks in dense blockage environments.


[17] 2508.16980

Beamforming Control in RIS-Aided Wireless Communications: A Predictive Physics-Based Approach

Integrating reconfigurable intelligent surfaces (RIS) into wireless communication systems is a promising approach for enhancing coverage and data rates by intelligently redirecting signals, through a process known as beamforming. However, the process of RIS beamforming (or passive beamforming) control is associated with multiple latency-inducing factors. As a result, by the time the beamforming is effectively updated, the channel conditions may have already changed. For example, the low update rate of localization systems becomes a critical limitation, as a mobile UE's position may change significantly between two consecutive measurements. To address this issue, this work proposes a practical and scalable physics-based solution that is effective across a wide range of UE movement models. Specifically, we propose a kinematic observer and predictor to enable proactive RIS control. From low-rate position estimates provided by a localizer, the kinematic observer infers the UE's speed and acceleration. These motion parameters are then used by a predictor to estimate the UE's future positions at a higher rate, allowing the RIS to adjust promptly and compensate for inherent delays in both the RIS control and localization systems. Numerical results validate the effectiveness of the proposed approach, demonstrating real-time RIS adjustments with low computational complexity, even in scenarios involving rapid UE movement.


[18] 2508.17033

Geometric Decentralized Stability Condition for Power Systems Based on Projecting DW Shells

The development of decentralized stability conditions has gained considerable attention due to the need to analyze heterogeneous multi-converter power systems. A recent advance is the application of the small-phase theorem, which extends the passivity theory. However, it requires the transfer function matrix to be sectorial, which may not hold in some frequency range and will result in conservatism. This letter tackles this problem by leveraging the Davis-Wielandt (DW) shells for decentralized stability analysis. We develop a geometric decentralized stability condition that visually displays how heterogeneous converters interact with the power grid and enable modular system analysis.


[19] 2508.17051

Radio Frequency Identification: Decades at a Time

In this article, we briefly review the history of the use of radio signals to identify objects, and of the key Radio Frequency Identification (RFID) standards for ultra-high-frequency (UHF) and near-field communications that enabled broad use of these technologies in daily life. We will compare the vision for the future presented by the Auto-ID Lab in the early 21st century with the reality we see today, two decades and a little after. We will review some of the applications in which UHF RFID technology has become hugely successful, others where High Frequency Near-field Communications (HF NFC) is preferred, and applications where optical identification or active wireless communications are dominant. We will then examine some possible future paths for RFID technology. We anticipate that UHF read capability will become widely available for cellphones, making it as universal as NFC and Bluetooth are today. We will look at more sophisticated radio interfaces, such as multiple-antenna phased arrays for readers, and tunnel diode reflection for tags. We will discuss the integration of information from Artificial Intelligence (AI)-based image processing, barcodes, NFC and UHF tags, into a digital twin of the real environment experienced by the human user. We will examine the role of RFID with sensing in improving the management of perishable goods. The role that RFID might play in a truly circular economy, with intelligent recycling and reuse, will be discussed. Finally, we survey the many hazards and obstacles that obstruct the path to an RF-informed future.


[20] 2508.17134

Pinhole Effect on Linkability and Dispersion in Speaker Anonymization

Speaker anonymization aims to conceal speaker-specific attributes in speech signals, making the anonymized speech unlinkable to the original speaker identity. Recent approaches achieve this by disentangling speech into content and speaker components, replacing the latter with pseudo speakers. The anonymized speech can be mapped either to a common pseudo speaker shared across utterances or to distinct pseudo speakers unique to each utterance. This paper investigates the impact of these mapping strategies on three key dimensions: speaker linkability, dispersion in the anonymized speaker space, and de-identification from the original identity. Our findings show that using distinct pseudo speakers increases speaker dispersion and reduces linkability compared to common pseudo-speaker mapping, thereby enhancing privacy preservation. These observations are interpreted through the proposed pinhole effect, a conceptual framework introduced to explain the relationship between mapping strategies and anonymization performance. The hypothesis is validated through empirical evaluation.


[21] 2508.17142

Frequency Response Identification of Low-Order Systems: Finite-Sample Analysis

This paper proposes a frequency-domain system identification method for learning low-order systems. The identification problem is formulated as the minimization of the l2 norm between the identified and measured frequency responses, with the nuclear norm of the Loewner matrix serving as a regularization term. This formulation results in an optimization problem that can be efficiently solved using standard convex optimization techniques. We derive an upper bound on the sampled-frequency complexity of the identification process and subsequently extend this bound to characterize the identification error over all frequencies. A detailed analysis of the sample complexity is provided, along with a thorough interpretation of its terms and dependencies. Finally, the efficacy of the proposed method is demonstrated through an example, along with numerical simulations validating the growth rate of the sample complexity bound.


[22] 2508.17149

Enhancing Energy and Spectral Efficiency in IoT-Cellular Networks via Active SIM-Equipped LEO Satellites

This paper investigates a low Earth orbit (LEO) satellite communication system enhanced by an active stacked intelligent metasurface (ASIM), mounted on the backplate of the satellite solar panels to efficiently utilize limited onboard space and reduce the main satellite power amplifier requirements. The system serves multiple ground users via rate-splitting multiple access (RSMA) and IoT devices through a symbiotic radio network. Multi-layer sequential processing in the ASIM improves effective channel gains and suppresses inter-user interference, outperforming active RIS and beyond-diagonal RIS designs. Three optimization approaches are evaluated: block coordinate descent with successive convex approximation (BCD-SCA), model-assisted multi-agent constraint soft actor-critic (MA-CSAC), and multi-constraint proximal policy optimization (MCPPO). Simulation results show that BCD-SCA converges fast and stably in convex scenarios without learning, MCPPO achieves rapid initial convergence with moderate stability, and MA-CSAC attains the highest long-term spectral and energy efficiency in large-scale networks. Energy-spectral efficiency trade-offs are analyzed for different ASIM elements, satellite antennas, and transmit power. Overall, the study demonstrates that integrating multi-layer ASIM with suitable optimization algorithms offers a scalable, energy-efficient, and high-performance solution for next-generation LEO satellite communications.


[23] 2508.17223

Deep Learning Architectures for Medical Image Denoising: A Comparative Study of CNN-DAE, CADTra, and DCMIEDNet

Medical imaging modalities are inherently susceptible to noise contamination that degrades diagnostic utility and clinical assessment accuracy. This paper presents a comprehensive comparative evaluation of three state-of-the-art deep learning architectures for MRI brain image denoising: CNN-DAE, CADTra, and DCMIEDNet. We systematically evaluate these models across multiple Gaussian noise intensities ($\sigma = 10, 15, 25$) using the Figshare MRI Brain Dataset. Our experimental results demonstrate that DCMIEDNet achieves superior performance at lower noise levels, with PSNR values of $32.921 \pm 2.350$ dB and $30.943 \pm 2.339$ dB for $\sigma = 10$ and $15$ respectively. However, CADTra exhibits greater robustness under severe noise conditions ($\sigma = 25$), achieving the highest PSNR of $27.671 \pm 2.091$ dB. All deep learning approaches significantly outperform traditional wavelet-based methods, with improvements ranging from 5-8 dB across tested conditions. This study establishes quantitative benchmarks for medical image denoising and provides insights into architecture-specific strengths for varying noise intensities.


[24] 2508.17226

Safety Under State Uncertainty: Robustifying Control Barrier Functions

Safety-critical control is a crucial aspect of modern systems, and Control Barrier Functions (CBFs) have gained popularity as the framework of choice for ensuring safety. However, implementing a CBF requires exact knowledge of the true state, a requirement that is often violated in real-world applications where only noisy or estimated state information is available. This paper introduces the notion of Robust Control Barrier Functions (R-CBF) for ensuring safety under such state uncertainty without requiring prior knowledge of the magnitude of uncertainty. We formally characterize the class of robustifying terms that ensure robust closed-loop safety and show how a robustly safe controller can be constructed. We demonstrate the effectiveness of this approach through simulations and compare it to existing methods, highlighting the additional robustness and convergence guarantees it provides.


[25] 2508.17246

Graphon Signal Processing for Spiking and Biological Neural Networks

Graph Signal Processing (GSP) extends classical signal processing to signals defined on graphs, enabling filtering, spectral analysis, and sampling of data generated by networks of various kinds. Graphon Signal Processing (GnSP) develops this framework further by employing the theory of graphons. Graphons are measurable functions on the unit square that represent graphs and limits of convergent graph sequences. The use of graphons provides stability of GSP methods to stochastic variability in network data and improves computational efficiency for very large networks. We use GnSP to address the stimulus identification problem (SIP) in computational and biological neural networks. The SIP is an inverse problem that aims to infer the unknown stimulus s from the observed network output f. We first validate the approach in spiking neural network simulations and then analyze calcium imaging recordings. Graphon-based spectral projections yield trial-invariant, lowdimensional embeddings that improve stimulus classification over Principal Component Analysis and discrete GSP baselines. The embeddings remain stable under variations in network stochasticity, providing robustness to different network sizes and noise levels. To the best of our knowledge, this is the first application of GnSP to biological neural networks, opening new avenues for graphon-based analysis in neuroscience.


[26] 2508.17248

One Equation to Rule Them All -- Part I: Direct Data-Driven Cascade Stabilisation

In this article we present a framework for direct data-driven control for general problems involving interconnections of dynamical systems. We first develop a method to determine the solution of a Sylvester equation from data. Such solution is used to describe a subspace that plays a role in a large variety of problems. We then provide an error analysis of the impact that noise has on this solution. This is a crucial contribution because, thanks to the interconnection approach developed throughout the article, we are able to track how the noise propagates at each stage, and thereby provide bounds on the final designs. Among the many potential problems that can be solved with this framework, we focus on three representatives: cascade stabilisation, model order reduction, and output regulation. This manuscript studies the first problem, while the companion Part II addresses the other two. For each of these settings we show how the problems can be recast in our framework. In the context of cascade stabilisation, we consider the 2-cascade problem, the effect of noise through the cascade, as well as N-cascade case, and we demonstrate that our proposed method is data efficient. The proposed designs are illustrated by means of a numerical example.


[27] 2508.17251

One Equation to Rule Them All -- Part II: Direct Data-Driven Reduction and Regulation

The Sylvester equation underpins a wide spectrum of control synthesis and systems analysis tools associated with cascade interconnections. In the preceding Part I [1] of this article, it was shown that such an equation can be reformulated using data, enabling the production of a collection of data-driven stabilisation procedures. In this second part of the article, we continue to develop the framework established in Part I to solve two important control-theoretic problems: model order reduction and output regulation. For the model order reduction problem we provide a solution from input-state measurements, from input-output measurements, and we study the effect of the noise. For the output regulation problem, we provide data-driven solutions for the static and dynamic feedback problem. The proposed designs are illustrated by means of examples.


[28] 2508.17326

Semantic Diffusion Posterior Sampling for Cardiac Ultrasound Dehazing

Echocardiography plays a central role in cardiac imaging, offering dynamic views of the heart that are essential for diagnosis and monitoring. However, image quality can be significantly degraded by haze arising from multipath reverberations, particularly in difficult-to-image patients. In this work, we propose a semantic-guided, diffusion-based dehazing algorithm developed for the MICCAI Dehazing Echocardiography Challenge (DehazingEcho2025). Our method integrates a pixel-wise noise model, derived from semantic segmentation of hazy inputs into a diffusion posterior sampling framework guided by a generative prior trained on clean ultrasound data. Quantitative evaluation on the challenge dataset demonstrates strong performance across contrast and fidelity metrics. Code for the submitted algorithm is available at this https URL.


[29] 2508.17351

A Hybrid Approach for Unified Image Quality Assessment: Permutation Entropy-Based Features Fused with Random Forest for Natural-Scene and Screen-Content Images for Cross-Content Applications

Image Quality Assessment (IQA) plays a vital role in applications such as image compression, restoration, and multimedia streaming. However, existing metrics often struggle to generalize across diverse image types - particularly between natural-scene images (NSIs) and screen-content images (SCIs) - due to their differing structural and perceptual characteristics. To address this limitation, we propose a novel full-reference IQA framework: Permutation Entropy-based Features Fused with Random Forest (PEFRF). PEFRF captures structural complexity by extracting permutation entropy from the gradient maps of reference, distorted, and fused images, forming a robust feature vector. These features are then input into a Random Forest regressor trained on subjective quality scores to predict final image quality. The framework is evaluated on 13 benchmark datasets comprising over 21,000 images and 40+ state-of-the-art IQA metrics. Experimental results demonstrate that PEFRF consistently outperforms existing methods across various distortion types and content domains, establishing its effectiveness as a unified and statistically significant solution for cross-content image quality assessment.


[30] 2508.17354

Toward Multi-Functional LAWNs with ISAC: Opportunities, Challenges, and the Road Ahead

Integrated sensing and communication (ISAC) has been envisioned as a foundational technology for future low-altitude wireless networks (LAWNs), enabling real-time environmental perception and data exchange across aerial-ground systems. In this article, we first explore the roles of ISAC in LAWNs from both node-level and network-level perspectives. We highlight the performance gains achieved through hierarchical integration and cooperation, wherein key design trade-offs are demonstrated. Apart from physical-layer enhancements, emerging LAWN applications demand broader functionalities. To this end, we propose a multi-functional LAWN framework that extends ISAC with capabilities in control, computation, wireless power transfer, and large language model (LLM)-based intelligence. We further provide a representative case study to present the benefits of ISAC-enabled LAWNs and the promising research directions are finally outlined.


[31] 2508.17374

Analysis of Circuit-based Per-Panel Diode Model of Photovoltaic Array

Solar photovoltaic systems are increasing in size and number on the grid. In regions with high penetration, such as California, PV systems serve multiple functions, including peak shaving and demand response. Therefore, the criticality of PV systems to grid operations calls for accurate models. The current practice is to represent the PV array, composed of multiple PV panels, with an aggregated single-diode model (SDM). The highly abstract model has a limited ability to capture real-world behaviors, such as partial shading and hotspots. Thus, we develop a circuit-based per-panel PV array model that uses a single diode model for each panel and interconnects them to form an array. This approach bridges the tradeoff between cell-level physics and control-dependent system-level behavior. We establish conditions for mathematical equivalence between the proposed per-panel array circuit model and the aggregated single-diode array model. We generate empirical evidence by running simulations using parameters derived from real-world PV panels. Results indicate that the proposed per-panel array model can represent the electrical behavior of the array under non-ideal conditions, such as partial shading, more accurately. With maximum power point tracking control, the proposed model is 21.2% more accurate when estimating the real power output of an array under a partial shading scenario and 8.1% more accurate under a hot spot scenario.


[32] 2508.17390

Modular electronic microrobots with on board sensor-program steered locomotion

True microrobots, in contrast with externally controlled microparticles, must harvest or carry their own source of energy, as well as their own (preferably programmable) microcontroller of actuators for locomotion, using information acquired from their own sensors. Building on recent published work [1], we demonstrate here, for the first time, that microrobotic smartlets, hitherto buoyancy divers, can also be equipped to navigate in 2D on surfaces, with on-board control responding to both sensor information and their internal electronic program. Fabricating modular microrobots, with all dimensions of 1mm and below, has been difficult to achieve because of competing demands for the limited surface area and the challenges of integrating and interconnecting the diverse functionalities of energy harvesting, actuation, sensing, communication, docking and control. A novel high density heterogeneous integration, via soft-substrate micro flip-chip bonding of custom CMOS and LED microchiplets onto fold-up polymer surfaces, compatible with roll-up isotropic ambient light harvesting, now makes this possible. Fabricating electrolytic bubble actuators on multiple cube-faces and connecting them to a custom sensor-controlled on-board microchiplet (lablet), allows the smartlets to locomote on wet surfaces, changing direction in response to both timed programmed control as well as programmed response to locally sensed signals. Such locomoted robotic microcubes can also move to and selectively dock with other modules via patterned surfaces. This is powered by ambient light in natural aqueous media on smooth surfaces.


[33] 2508.17428

py360tool: Um framework para manipulação de vídeo 360$^\circ$ com ladrilhos

Streaming 360$^\circ$ videos for virtual reality demands a lot of bandwidth. To optimize this transmission, videos are divided into "tiles" and selectively distributed to the user based on what they are looking at. This interactive approach makes it difficult to assess quality and user experience. To solve this, the paper presents py360tools, a Python library that automates client-side tasks like video reconstruction, tile selection, and viewport extraction. This facilitates the reproduction, simulation, and analysis of 360$^\circ$ video streaming sessions.


[34] 2508.17430

Input-Output Data-Driven Sensor Selection for Cyber-Physical Systems

In this paper, we consider the problem of input-output data-driven sensor selection for unknown cyber-physical systems (CPS). In particular, out of a large set of sensors available for use, we choose a subset of them that maximizes a metric of observability of the CPS. The considered observability metric is related to the $\mathcal{H}_2$ system norm, which quantifies the average output energy of the selected sensors over a finite or an infinite horizon. However, its computation inherently requires knowledge of the unknown matrices of the system, so we draw connections from the reinforcement learning literature and design an input-output data-driven algorithm to compute it in a model-free manner. We then use the derived data-driven metric expression to choose the best sensors of the system in polynomial time, effectively obtaining a provably convergent model-free sensor selection process. Additionally, we show how the proposed data-driven approach can be exploited to select sensors that optimize volumetric measures of observability, while also noting its applicability to the dual problem of actuator selection. Simulations are performed to demonstrate the validity and effectiveness of the proposed approach.


[35] 2508.17433

Coordinated UAV Beamforming and Control for Directional Jamming and Nulling

Efficient mobile jamming against eavesdroppers in wireless networks necessitates accurate coordination between mobility and antenna beamforming. We study the coordinated beamforming and control problem for a UAV that carries two omnidirectional antennas, and which uses them to jam an eavesdropper while leaving a friendly client unaffected. The UAV can shape its jamming beampattern by controlling its position, its antennas' orientation, and the phases of the antennas' interference signals. We derive a closed-form expression for the antennas' phases that guarantees zero jamming impact on the client. In addition, we determine the antennas' orientation and the UAV's position that maximizes jamming impact on the eavesdropper through an optimal control problem, optimizing the orientation pointwise and the position through the UAV's control input. Simulations show how this coordinated beamforming and control scheme enables directional GPS denial while guaranteeing zero interference towards a friendly direction.


[36] 2508.17473

A Consensus Algorithm for Second-Order Systems Evolving on Lie Groups

In this paper, a consensus algorithm is proposed for interacting multi-agents, which can be modeled as simple Mechanical Control Systems (MCS) evolving on a general Lie group. The standard Laplacian flow consensus algorithm for double integrator systems evolving on Euclidean spaces is extended to a general Lie group. A tracking error function is defined on a general smooth manifold for measuring the error between the configurations of two interacting agents. The stability of the desired consensus equilibrium is proved using a generalized version of Lyapunov theory and LaSalle's invariance principle applicable for systems evolving on a smooth manifold. The proposed consensus control input requires only the configuration information of the neighboring agents and does not require their velocities and inertia tensors. The design of tracking error function and consensus control inputs are demonstrated through an application of attitude consensus problem for multiple communicating rigid bodies. The consensus algorithm is numerically validated by demonstrating the attitude consensus problem.


[37] 2508.17505

A Data-Driven Forced Oscillation Locating Method for Power Systems with Inverter-Based Resources

Forced Oscillations (FO) stemming from external periodic disturbances threaten power system security and stability. The increasing penetration of Inverter-Based Resources(IBRs) further introduces FO, leading to new challenges in identifying and locating FO sources in modern power systems. In this paper, a novel data-driven method for locating FO in power systems with IBRs is proposed. Unlike previous works, a unified representation of FO originating from IBRs is considered, which further facilitates the development of the FO locating algorithm. Leveraging on Sparse Identification for a Nonlinear Dynamical (SINDy), a purely data-driven methodology is developed for locating the source of FO by interpreting the proposed model from measurements. Numerical results on the WECC 240-bus system validate the performance of the proposed approach in successfully locating FO in the presence of IBRs.


[38] 2508.17526

Near-Field Integrated Imaging and Communication in Distributed MIMO Networks

In this work, we propose a general framework for wireless imaging in distributed MIMO wideband communication systems, considering multi-view non-isotropic targets and near-field propagation effects. For indoor scenarios where the objective is to image small-scale objects with high resolution, we propose a range migration algorithm (RMA)-based scheme using three kinds of array architectures: the full array, boundary array, and distributed boundary array. With non-isotropic near-field channels, we establish the Fourier transformation (FT)-based relationship between the imaging reflectivity and the distributed spatial-domain signals and discuss the corresponding theoretical properties. Next, for outdoor scenarios where the objective is to reconstruct the large-scale three-dimensional (3D) environment with coarse resolution, we propose a sparse Bayesian learning (SBL)-based algorithm to solve the multiple measurement vector (MMV) problem, which further addresses the non-isotropic reflectivity across different subcarriers. Numerical results demonstrate the effectiveness of the proposed algorithms in acquiring high-resolution small objects and accurately reconstructing large-scale environments.


[39] 2508.17577

Fast RLS Identification Leveraging the Linearized System Sparsity: Predictive Cost Adaptive Control for Quadrotors

This paper presents a centralized predictive cost adaptive control (PCAC) strategy for the position and attitude control of quadrotors. PCAC is an optimal, prediction-based control method that uses recursive least squares (RLS) to identify model parameters online, enabling adaptability in dynamic environments. Addressing challenges with black-box approaches in systems with complex couplings and fast dynamics, this study leverages the unique sparsity of quadrotor models linearized around hover points. By identifying only essential parameters related to nonlinear couplings and dynamics, this approach reduces the number of parameters to estimate, accelerates identification, and enhances stability during transients. Furthermore, the proposed control scheme removes the need for an attitude setpoint, typically required in conventional cascaded control designs.


[40] 2508.17607

Steerable Invariant Beamformer Using a Differential Line Array of Omnidirectional and Directional Microphones with Null Constraints

Line differential microphone arrays have attracted attention for their ability to achieve frequency-invariant beampatterns and high directivity. Recently, the Jacobi-Anger expansion-based approach has enabled the design of fully steerable-invariant differential beamformers for line arrays combining omnidirectional and directional microphones. However, this approach relies on the analytical expression of the ideal beam pattern and the proper selection of truncation order, which is not always practical. This paper introduces a null-constraint-based method for designing frequency- and steerable-invariant differential beamformers using a line array of omnidirectional and directional microphones. The approach employs a multi-constraint optimisation framework, where the reference filter and ideal beam pattern are first determined based on specified nulls and desired direction. Subsequently, the white noise gain constraint is derived from the reference filter, and the beampattern constraint is from the ideal beam pattern. The optimal filter is then obtained by considering constraints related to the beampattern, nulls, and white noise gain. This method achieves a balance between white noise gain and mean square error, allowing robust, frequency- and steerableinvariant differential beamforming performance. It addresses limitations in beam pattern flexibility and truncation errors, offering greater design freedom and improved practical applicability. Simulations and experiments demonstrate that this method outperforms the Jacobi-Anger expansion-based approach in three key aspects: an extended effective range, improved main lobe and null alignment, and greater flexibility in microphone array configuration and beam pattern design, requiring only steering direction and nulls instead of an analytic beam pattern expression.


[41] 2508.17640

Multimodal Radio and Vision Fusion for Robust Localization in Urban V2I Communications

Accurate localization is critical for vehicle-to-infrastructure (V2I) communication systems, especially in urban areas where GPS signals are often obstructed by tall buildings, leading to significant positioning errors, necessitating alternative or complementary techniques for reliable and precise positioning in applications like autonomous driving and smart city infrastructure. This paper proposes a multimodal contrastive learning regression based localization framework for V2I scenarios that combines channel state information (CSI) with visual information to achieve improved accuracy and reliability. The approach leverages the complementary strengths of wireless and visual data to overcome the limitations of traditional localization methods, offering a robust solution for V2I applications. Simulation results demonstrate that the proposed CSI and vision fusion model significantly outperforms traditional methods and single modal models, achieving superior localization accuracy and precision in complex urban environments.


[42] 2508.17704

Symbol Detection Using an Integrate-and-Fire Time Encoding Receiver

Event-driven sampling is a promising alternative to uniform sampling methods, particularly for systems constrained by power and hardware cost. A notable example of this sampling approach is the integrate-and-fire time encoding machine (IF-TEM), which encodes an analog signal into a sequence of time stamps by generating an event each time the integral of the input signal reaches a fixed threshold. In this paper, we propose a receiver architecture that estimates the sequence of transmitted symbols directly from the encoded time stamps, called time encodings, produced by the IF-TEM sampler on the received signal. We show that waveform reconstruction from time encodings is not necessary for symbol detection. We develop an analytical approximation for the symbol error probability (SEP) of the proposed IF-TEM-based receiver and show that it closely matches the SEP results obtained through Monte Carlo simulations. Additionally, we demonstrate that narrowing the 3 dB bandwidth of the transmit pulse shaping filter degrades the proposed IF-TEM receiver's performance, highlighting a trade-off between spectral efficiency and error resilience.


[43] 2508.17710

Blind Channel Estimation for RIS-Assisted Millimeter Wave Communication Systems

In the research of RIS-assisted communication systems, channel estimation is a problem of vital importance for further performance optimization. In order to reduce the pilot overhead to the greatest extent, blind channel estimation methods are required, which can estimate the channel and the transmit signals at the same time without transmitting pilot sequence. Different from existing researches in traditional MIMO systems, the RIS-assisted two-hop channel brings new challenges to the blind channel estimation design. Hence, a novel blind channel estimation method based on compressed sensing for RIS-assisted multiuser millimeter wave communication systems is proposed for the first time in this paper. Specifically, for accurately estimating the RIS-assisted two-hop channel without transmitting pilots, we propose a block-wise transmission scheme. Among different blocks of data transmission, RIS elements are reconfigured for better estimating the cascade channel. Inside each block, data for each user are mapped to a codeword for realizing the transmit signal recovery and equivalent channel estimation simultaneously. Simulation results demonstrate that our method can achieve a considerable accuracy of channel estimation and transmit signal recovery.


[44] 2508.17717

Deception in Asymmetric Information Homicidal Chauffeur Game

The classic Homicidal Chauffeur game is a pursuit-evasion game played in an unbounded planar environment between a pursuer constrained to move with fixed speed on curves with bounded curvature, and a slower evader with fixed speed but with simple kinematics. We introduce a new variant of this game with asymmetric information in which the evader has the ability to choose its speed among a finite set of choices that is unknown to the pursuer a priori. Therefore the pursuer is required to estimate the evader's maximum speed based on the observations so far. This formulation leads to the question of whether the evader can exploit this asymmetry by moving deceptively by first picking a slower speed to move with and then switching to a faster speed when a specified relative configuration is attained to increase the capture time as compared to moving with the maximum speed at all times. Our contributions are as follows. First, we derive optimal feedback Nash equilibrium strategies for the complete information case of this game in which the evader is allowed to vary its speed in a given interval. Second, for the version with asymmetric information, we characterize regions of initial player locations in the game space from which the evader does not have any advantage in using deceptive strategies. Finally, we provide numerical evidence of regions in the game space from which the evader can increase the capture time by moving deceptively.


[45] 2508.17742

EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation of EEG Foundation Models

Electroencephalography (EEG) foundation models are poised to significantly advance brain signal analysis by learning robust representations from large-scale, unlabeled datasets. However, their rapid proliferation has outpaced the development of standardized evaluation benchmarks, which complicates direct model comparisons and hinders systematic scientific progress. This fragmentation fosters scientific inefficiency and obscures genuine architectural advancements. To address this critical gap, we introduce EEG-FM-Bench, the first comprehensive benchmark for the systematic and standardized evaluation of EEG foundation models (EEG-FMs). Our contributions are threefold: (1) we curate a diverse suite of downstream tasks and datasets from canonical EEG paradigms, implementing standardized processing and evaluation protocols within a unified open-source framework; (2) we benchmark prominent state-of-the-art foundation models to establish comprehensive baseline results for a clear comparison of the current landscape; (3) we perform qualitative analyses of the learned representations to provide insights into model behavior and inform future architectural design. Through extensive experiments, we find that fine-grained spatio-temporal feature interaction, multitask unified training and neuropsychological priors would contribute to enhancing model performance and generalization capabilities. By offering a unified platform for fair comparison and reproducible research, EEG-FM-Bench seeks to catalyze progress and guide the community toward the development of more robust and generalizable EEG-FMs. Code is released at this https URL.


[46] 2508.17768

Towards Trustworthy Breast Tumor Segmentation in Ultrasound using Monte Carlo Dropout and Deep Ensembles for Epistemic Uncertainty Estimation

Automated segmentation of BUS images is important for precise lesion delineation and tumor characterization, but is challenged by inherent artifacts and dataset inconsistencies. In this work, we evaluate the use of a modified Residual Encoder U-Net for breast ultrasound segmentation, with a focus on uncertainty quantification. We identify and correct for data duplication in the BUSI dataset, and use a deduplicated subset for more reliable estimates of generalization performance. Epistemic uncertainty is quantified using Monte Carlo dropout, deep ensembles, and their combination. Models are benchmarked on both in-distribution and out-of-distribution datasets to demonstrate how they generalize to unseen cross-domain data. Our approach achieves state-of-the-art segmentation accuracy on the Breast-Lesion-USG dataset with in-distribution validation, and provides calibrated uncertainty estimates that effectively signal regions of low model confidence. Performance declines and increased uncertainty observed in out-of-distribution evaluation highlight the persistent challenge of domain shift in medical imaging, and the importance of integrated uncertainty modeling for trustworthy clinical deployment. \footnote{Code available at: this https URL}


[47] 2508.17769

Multiple STAR-RISs-Empowered Multi-User Communications with Diversified QoS Provisioning

This paper proposes a quality-of-service (QoS)-aware multi-user communication framework facilitated by multiple simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RISs). The user groups are established based on their QoS requirements specified by the minimum data rate, which is provisioned by the optimized transmission and reflection configurations of the STAR-RISs. Particularly, we formulate an optimization problem to maximize the aggregate link rate across all users, under group-specified rate requirements by jointly considering the transmit beamforming and STAR-RIS configurations. Then, we employ the Lagrangian duality with quadratic transformation to tackle the non-convexity of the objective. We decompose the problem within a block coordinate descent framework, and the subproblems are solved through convex approximation and iterated to approach the optimal solution. Simulation results demonstrate the effectiveness of the proposed method in enhancing the system sum rate with guaranteed QoS performance for heterogeneous users, offering valuable insights for the deployment of STAR-RISs in future QoS-aware wireless networks.


[48] 2508.17772

A Comprehensive Incremental and Ensemble Learning Approach for Forecasting Individual Electric Vehicle Charging Parameters

Electric vehicles (EVs) have the potential to reduce grid stress through smart charging strategies while simultaneously meeting user demand. This requires accurate forecasts of key charging parameters, such as energy demand and connection time. Although previous studies have made progress in this area, they have overlooked the importance of dynamic training to capture recent patterns and have excluded EV sessions with limited information, missing potential opportunities to use these data. To address these limitations, this study proposes a dual-model approach incorporating incremental learning with six machine-learning models to predict EV charging session parameters. This approach includes dynamic training updates, optimal features, and hyperparameter set selection for each model to make it more robust and inclusive. Using a data set of 170,000 measurements from the real world electric vehicle session, week-long charging parameters were predicted over a one-year period. The findings reveal a significant difference between workplace and residential charging locations regarding connection duration predictability, with workplace sessions being more predictable. The proposed stacking ensemble learning method enhanced forecasting accuracy, improving R2 by 2.83% to 43.44% across all parameters and location settings. A comparison of the two models reveals that incorporating user IDs as a feature, along with the associated historical data, is the most significant factor influencing the accuracy of the forecast. Forecasts can be used effectively in smart charging and grid management applications by incorporating uncertainty quantification techniques, allowing charge point operators to optimize charging schedules and energy management.


[49] 2508.17774

Linear Power System Modeling and Analysis Across Wide Operating Ranges: A Hierarchical Neural State-Space Equation Approach

Developing a unified small-signal model for modern, large-scale power systems that remains accurate across a wide range of operating ranges presents a formidable challenge. Traditional methods, spanning mechanistic modeling, modal identification, and deep learning, have yet to fully overcome persistent limitations in accuracy, universal applicability, and interpretability. In this paper, a novel hierarchical neural state-space equation approach is proposed to overcome these obstacles, achieving strong representation, high interpretability, and superior adaptability to both system scale and varying operating points. Specifically, we first introduce neural state-space equations integrated with virtual state observers to accurately characterize the dynamics of power system devices, even in the presence of unmeasurable states. Subsequently, a hierarchical architecture is designed to handle the modeling complexity across a wide range of operating conditions, flexibly decoupling device and grid models to effectively mitigate the curse of dimensionality. Finally, a set of spatiotemporal data transformations and a multi-stage training strategy with a multi-objective loss function is employed to enhance the models efficiency and generalization. Numerical results on the two-machine three-bus system and the Guangdong Power Grid verify the superior performance of the proposed method, presenting it as a powerful new tool for small-signal stability analysis.


[50] 2508.17805

A Predictive Framework for Adversarial Energy Depletion in Inbound Threat Scenarios

This paper presents a predictive framework for adversarial energy-depletion defense against a maneuverable inbound threat (IT). The IT solves a receding-horizon problem to minimize its own energy while reaching a high-value asset (HVA) and avoiding interceptors and static lethal zones modeled by Gaussian barriers. Expendable interceptors (EIs), coordinated by a central node (CN), maintain proximity to the HVA and patrol centers via radius-based tether costs, deny attack corridors by harassing and containing the IT, and commit to intercept only when a geometric feasibility test is confirmed. No explicit opponent-energy term is used, and the formulation is optimization-implementable. No simulations are included.


[51] 2508.17840

Optimal Pairwise Comparison Procedures for Subjective Evaluation

Audio signal processing algorithms are frequently assessed through subjective listening tests in which participants directly score degraded signals on a unidimensional numerical scale. However, this approach is susceptible to inconsistencies in scale calibration between assessors. Pairwise comparisons between degraded signals offer a more intuitive alternative, eliciting the relative scores of candidate signals with lower measurement error and reduced participant fatigue. Yet, due to the quadratic growth of the number of necessary comparisons, a complete set of pairwise comparisons becomes unfeasible for large datasets. This paper compares pairwise comparison procedures to identify the most efficient methods for approximating true quality scores with minimal comparisons. A novel sampling procedure is proposed and benchmarked against state-of-the-art methods on simulated datasets. Bayesian sampling produces the most robust score estimates among previously established methods, while the proposed procedure consistently converges fastest on the underlying ranking with comparable score accuracy.


[52] 2508.17852

Cross-Domain Lifelong Reinforcement Learning for Wireless Sensor Networks

Wireless sensor networks (WSNs) with energy harvesting (EH) are expected to play a vital role in intelligent 6G systems, especially in industrial sensing and control, where continuous operation and sustainable energy use are critical. Given limited energy resources, WSNs must operate efficiently to ensure long-term performance. Their deployment, however, is challenged by dynamic environments where EH conditions, network scale, and traffic rates change over time. In this work, we address system dynamics that yield different learning tasks, where decision variables remain fixed but strategies vary, as well as learning domains, where both decision space and strategies evolve. To handle such scenarios, we propose a cross-domain lifelong reinforcement learning (CD-L2RL) framework for energy-efficient WSN design. Our CD-L2RL algorithm leverages prior experience to accelerate adaptation across tasks and domains. Unlike conventional approaches based on Markov decision processes or Lyapunov optimization, which assume relatively stable environments, our solution achieves rapid policy adaptation by reusing knowledge from past tasks and domains to ensure continuous operations. We validate the approach through extensive simulations under diverse conditions. Results show that our method improves adaptation speed by up to 35% over standard reinforcement learning and up to 70% over Lyapunov-based optimization, while also increasing total harvested energy. These findings highlight the strong potential of CD-L2RL for deployment in dynamic 6G WSNs.


[53] 2508.17873

Compressed Learning for Nanosurface Deficiency Recognition Using Angle-resolved Scatterometry Data

Nanoscale manufacturing requires high-precision surface inspection to guarantee the quality of the produced nanostructures. For production environments, angle-resolved scatterometry offers a non- invasive and in-line compatible alternative to traditional surface inspection methods, such as scanning electron microscopy. However, angle-resolved scatterometry currently suffers from long data acquisition time. Our study addresses the issue of slow data acquisition by proposing a compressed learning framework for the accurate recognition of nanosurface deficiencies using angle-resolved scatterometry data. The framework uses the particle swarm optimization algorithm with a sampling scheme customized for scattering patterns. This combination allows the identification of optimal sampling points in scatterometry data that maximize the detection accuracy of five different levels of deficiency in ZnO nanosurfaces. The proposed method significantly reduces the amount of sampled data while maintaining a high accuracy in deficiency detection, even in noisy environments. Notably, by sampling only 1% of the data, the method achieves an accuracy of over 86%, which further improves to 94% when the sampling rate is increased to 6%. These results demonstrate a favorable balance between data reduction and classification performance. The obtained results also show that the compressed learning framework effectively identifies critical sampling areas.


[54] 2508.17920

Prompt-based Multimodal Semantic Communication for Multi-spectral Image Segmentation

Multimodal semantic communication has gained widespread attention due to its ability to enhance downstream task performance. A key challenge in such systems is the effective fusion of features from different modalities, which requires the extraction of rich and diverse semantic representations from each modality. To this end, we propose ProMSC-MIS, a Prompt-based Multimodal Semantic Communication system for Multi-spectral Image Segmentation. Specifically, we propose a pre-training algorithm where features from one modality serve as prompts for another, guiding unimodal semantic encoders to learn diverse and complementary semantic representations. We further introduce a semantic fusion module that combines cross-attention mechanisms and squeeze-and-excitation (SE) networks to effectively fuse cross-modal features. Simulation results show that ProMSC-MIS significantly outperforms benchmark methods across various channel-source compression levels, while maintaining low computational complexity and storage overhead. Our scheme has great potential for applications such as autonomous driving and nighttime surveillance.


[55] 2508.17942

Synchrosqueezed X-Ray Wavelet-Chirplet Transform for Accurate Chirp Rate Estimation and Retrieval of Modes from Multicomponent Signals with Crossover Instantaneous Frequencies

Recent advances in the chirplet transform and wavelet-chirplet transform (WCT) have enabled the estimation of instantaneous frequencies (IFs) and chirprates, as well as mode retrieval from multicomponent signals with crossover IF curves. However, chirprate estimation via these approaches remains less accurate than IF estimation, primarily due to the slow decay of the chirplet transform or WCT along the chirprate direction. To address this, the synchrosqueezed chirplet transform (SCT) and multiple SCT methods were proposed, achieving moderate improvements in IF and chirprate estimation accuracy. Nevertheless, a novel approach is still needed to enhance the transform's decay along the chirprate direction. This paper introduces an X-ray transform-based wavelet-chirprate transform, termed the X-ray wavelet-chirplet transform (XWCT), which exhibits superior decay along the chirprate direction compared to the WCT. Furthermore, third-order synchrosqueezed variants of the WCT and XWCT are developed to yield sharp time-frequency-chirprate representations of signals. Experimental results demonstrate that the XWCT achieves significantly faster decay along the chirprate axis, while the third-order synchrosqueezed XWCT enables accurate IF and chirprate estimation, as well as mode retrieval, without requiring multiple synchrosqueezing operations.


[56] 2508.17960

A Unified Transformer Architecture for Low-Latency and Scalable Wireless Signal Processing

We propose a unified Transformer-based architecture for wireless signal processing tasks, offering a low-latency, task-adaptive alternative to conventional receiver pipelines. Unlike traditional modular designs, our model integrates channel estimation, interpolation, and demapping into a single, compact attention-driven architecture designed for real-time deployment. The model's structure allows dynamic adaptation to diverse output formats by simply modifying the final projection layer, enabling consistent reuse across receiver subsystems. Experimental results demonstrate strong generalization to varying user counts, modulation schemes, and pilot configurations, while satisfying latency constraints imposed by practical systems. The architecture is evaluated across three core use cases: (1) an End-to-End Receiver, which replaces the entire baseband processing pipeline from pilot symbols to bit-level decisions; (2) Channel Frequency Interpolation, implemented and tested within a 3GPP-compliant OAI+Aerial system; and (3) Channel Estimation, where the model infers full-band channel responses from sparse pilot observations. In all cases, our approach outperforms classical baselines in terms of accuracy, robustness, and computational efficiency. This work presents a deployable, data-driven alternative to hand-engineered PHY-layer blocks, and lays the foundation for intelligent, software-defined signal processing in next-generation wireless communication systems.


[57] 2508.17965

TuningIQA: Fine-Grained Blind Image Quality Assessment for Livestreaming Camera Tuning

Livestreaming has become increasingly prevalent in modern visual communication, where automatic camera quality tuning is essential for delivering superior user Quality of Experience (QoE). Such tuning requires accurate blind image quality assessment (BIQA) to guide parameter optimization decisions. Unfortunately, the existing BIQA models typically only predict an overall coarse-grained quality score, which cannot provide fine-grained perceptual guidance for precise camera parameter tuning. To bridge this gap, we first establish FGLive-10K, a comprehensive fine-grained BIQA database containing 10,185 high-resolution images captured under varying camera parameter configurations across diverse livestreaming scenarios. The dataset features 50,925 multi-attribute quality annotations and 19,234 fine-grained pairwise preference annotations. Based on FGLive-10K, we further develop TuningIQA, a fine-grained BIQA metric for livestreaming camera tuning, which integrates human-aware feature extraction and graph-based camera parameter fusion. Extensive experiments and comparisons demonstrate that TuningIQA significantly outperforms state-of-the-art BIQA methods in both score regression and fine-grained quality ranking, achieving superior performance when deployed for livestreaming camera tuning.


[58] 2508.17980

Objective and Subjective Evaluation of Diffusion-Based Speech Enhancement for Dysarthric Speech

Dysarthric speech poses significant challenges for automatic speech recognition (ASR) systems due to its high variability and reduced intelligibility. In this work we explore the use of diffusion models for dysarthric speech enhancement, which is based on the hypothesis that using diffusion-based speech enhancement moves the distribution of dysarthric speech closer to that of typical speech, which could potentially improve dysarthric speech recognition performance. We assess the effect of two diffusion-based and one signal-processing-based speech enhancement algorithms on intelligibility and speech quality of two English dysarthric speech corpora. We applied speech enhancement to both typical and dysarthric speech and evaluate the ASR performance using Whisper-Turbo, and the subjective and objective speech quality of the original and enhanced dysarthric speech. We also fine-tuned Whisper-Turbo on the enhanced speech to assess its impact on recognition performance.


[59] 2508.18006

Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters

In this paper we investigate cross-lingual Text-To-Speech (TTS) synthesis through the lens of adapters, in the context of lightweight TTS systems. In particular, we compare the tasks of unseen speaker and language adaptation with the goal of synthesising a target voice in a target language, in which the target voice has no recordings therein. Results from objective evaluations demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information, allowing pre-trained models to learn unseen speaker identities or languages, while avoiding catastrophic forgetting of the original model's speaker or language information. Additionally, to measure how native the generated voices are in terms of accent, we propose and validate an objective metric inspired by mispronunciation detection techniques in second-language (L2) learners. The paper also provides insights into the impact of adapter placement, configuration and the number of speakers used.


[60] 2508.18009

Positioning via Probabilistic Graphical Models in RIS-Aided Systems with Channel Estimation Errors

We propose a 6D Bayesian-based localization framework to estimate the position and rotation angles of a mobile station (MS) within an indoor reconfigurable intelligent surface (RIS)-aided system. This framework relies on a probabilistic graphical model to represent the joint probability distribution of random variables through their conditional dependencies and employs the No-U-Turn Sampler (NUTS) to approximate the posterior distribution based on the estimated channel parameters. Our framework estimates both the position and rotation of the mobile station (MS), in the presence of channel parameter estimation errors. We derive the Cramer-Rao lower bound (CRLB) for the proposed scenario and use it to evaluate the system's position error bound (PEB) and rotation error bound (REB). We compare the system performances with and without RIS. The results demonstrate that the RIS can enhance positioning accuracy significantly.


[61] 2508.18201

On Asymptotic Analysis of the Two-Stage Approach: Towards Data-Driven Parameter Estimation

In this paper, we analyze the asymptotic properties of the Two-Stage (TS) estimator -- a simulation-based parameter estimation method that constructs estimators offline from synthetic data. While TS offers significant computational advantages compared to standard approaches to estimation, its statistical properties have not been previously analyzed in the literature. Under simple assumptions, we establish that the TS estimator is strongly consistent and asymptotically normal, providing the first theoretical guarantees for this class of estimators.


[62] 2508.18203

Tractable Stochastic Hybrid Model Predictive Control using Gaussian Processes for Repetitive Tasks in Unseen Environments

Improving the predictive accuracy of a dynamics model is crucial to obtaining good control performance and safety from Model Predictive Controllers (MPC). One approach involves learning unmodelled (residual) dynamics, in addition to nominal models derived from first principles. Varying residual models across an environment manifest as modes of a piecewise residual (PWR) model that requires a) identifying how modes are distributed across the environment and b) solving a computationally intensive Mixed Integer Nonlinear Program (MINLP) problem for control. We develop an iterative mapping algorithm capable of predicting time-varying mode distributions. We then develop and solve two tractable approximations of the MINLP to combine with the predictor in closed-loop to solve the overall control problem. In simulation, we first demonstrate how the approximations improve performance by 4-18% in comparison to the MINLP while achieving significantly lower computation times (upto 250x faster). We then demonstrate how the proposed mapping algorithm incrementally improves controller performance (upto 3x) over multiple iterations of a trajectory tracking control task even when the mode distributions change over time.


[63] 2508.18214

AI Data Centers Need Pioneers to Deliver Scalable Power via Offgrid AI

The scalable computing revolution of the late '80s through mid- '00s forged a new technical and economic model for computing that delivered massive societal impact, but its economic benefit has driven scalability to sizes that are now exhausting the energy grid's capacity. Our time demands a new revolution in scalable energy, mirroring in key ways the scalable computing revolution; e.g., compelling economic forces, use of mass-market components, overcoming foibles of those components, judicious use of physical locality, and the the difficult integration into an effective system. The offgrid AI approach closely fits this mold, combining local mostly renewable generation and storage to power an AI data center, starting offgrid. Obstacles to delivering this approach are social, technical, and project, but the potential is massive. I argue that the offgrid-AI approach needs pioneers among both system developers and AI-data-center operators to move it quickly from concept to large-scale deployment.


[64] 2508.18246

Flight-Ready Precise and Robust Carrier-Phase GNSS Navigation Software for Distributed Space Systems

This paper presents the full requirements analysis, design, development, and testing of high-precision navigation flight software for Distributed Space Systems (DSS) using Carrier Phase Differential GNSS (CDGNSS). Five main contributions are made. First, a survey of flown and upcoming DSS missions with stringent precision requirements is conducted, from which a thorough requirements analysis is distilled to guide development and testing. Second, a real-time navigation functional architecture is designed, and adopts a sparse and regularized Consider Kalman Filter with options for numerical stability in-flight. The filter rigorously accounts for uncertainties in process noise, measurement noise, and biases. It tracks float ambiguities with integer resolution where possible. The covariance correlation structure is preserved under all navigation modes, including contingencies and outages. Third, a lightweight, memoryless Fault Detection, Isolation, and Recovery (FDIR) module is developed to guard against anomalous measurements, providing statistical screening and ensuring robust navigation. Fourth, the software architecture is proposed for ease of integration, with strategies presented for modularity and computational efficiency tailored to constrained flight systems. Fifth, a comprehensive test campaign is conducted, mapped to a requirements verification matrix, spanning unit, interface, software-in-the-loop, and real-time hardware-in-the-loop tests, emphasizing gradual test fidelity for efficient fault isolation. Finally, flight-like results are demonstrated using the VISORS mission, due to the generalizability of the VISORS navigation operations, and the stringency which demands sub-centimeter relative position and sub-millimeter-per-second velocity accuracy. This architecture aims to serve as a reference for next-generation DSS missions adopting CDGNSS.


[65] 2508.08237

VGGSounder: Audio-Visual Evaluations for Foundation Models

The emergence of audio-visual foundation models underscores the importance of reliably assessing their multi-modal understanding. The VGGSound dataset is commonly used as a benchmark for evaluation audio-visual classification. However, our analysis identifies several limitations of VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These lead to distorted evaluations of auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set that extends VGGSound and is specifically designed to evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance. Furthermore, we reveal model limitations by analysing performance degradation when adding another input modality with our new modality confusion metric.


[66] 2508.16667

BrainPath: Generating Subject-Specific Brain Aging Trajectories

Quantifying and forecasting individual brain aging trajectories is critical for understanding neurodegenerative disease and the heterogeneity of aging, yet current approaches remain limited. Most models predict chronological age, an imperfect surrogate for biological aging, or generate synthetic MRIs that enhance data diversity but fail to capture subject-specific trajectories. Here, we present BrainPath, a 3D generative framework that learns longitudinal brain aging dynamics during training and, at inference, predicts anatomically faithful MRIs at arbitrary timepoints from a single baseline scan. BrainPath integrates an age calibration loss, a swap learning strategy, and an age perceptual loss to preserve subtle, biologically meaningful variations. Across held-out ADNI and an independent NACC dataset, BrainPath outperforms state-of-the-art reference models in structural similarity (SSIM), mean squared error (MSE), peak signal-to-noise ratio (PSNR), and MRI age-difference accuracy, while capturing realistic and temporally consistent aging patterns. Beyond methodological innovation, BrainPath enables personalized mapping of brain aging, synthetic follow-up scan prediction, and trajectory-based analyses, providing a foundation for precision modeling of brain aging and supporting research into neurodegeneration and aging interventions.


[67] 2508.16774

CarboNet: A Finite-Time Combustion-Tolerant Compartmental Network for Tropospheric Carbon Control

While governments and international organizations have set the net-zero target to prevent a climate event horizon, practical solutions are lacking mainly because of the impracticability to completely replace combustion processes. Hence, in this paper, we first design a compartmental network whose states must remain in the nonnegative orthant for physical consistency and in which the carbon dioxide emissions result from the combustion of diesel in vehicles and gas in house heaters. Then, we designed both full-state and output-feedback linear-quadratic regulators of the compartmental network to bring the mass of carbon dioxide to the pre-industrial era, which is reached in approximately 25 and 60 days, respectively. The output feedback tolerates for 6 days the combustion taking place in 5,000 vehicles and in 10,000 house heating systems, it meets the net-zero target, and it nullifies the extraction of finite natural resources. The tropospheric temperature with closed-loop reaches the equilibrium at 133 °C after 16.4 years; while such an high value requires to further investigate with climate experts the model of the dynamics of the temperature, this work is a first step in designing optimal network control systems for climate stability. Source code is publicly available.


[68] 2508.16790

TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling

Speech tokenizers serve as foundational components for speech language models, yet current designs exhibit several limitations, including: 1) dependence on multi-layer residual vector quantization structures or high frame rates, 2) reliance on auxiliary pre-trained models for semantic distillation, and 3) requirements for complex two-stage training processes. In this work, we introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach designed to overcome these challenges. TaDiCodec employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS). Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. We also validate the compatibility of TaDiCodec in language model based zero-shot text-to-speech with both autoregressive modeling and masked generative modeling, demonstrating its effectiveness and efficiency for speech language modeling, as well as a significantly small reconstruction-generation gap. We will open source our code and model checkpoints. Audio samples are are available at https:/tadicodec.github.io/. We release code and model checkpoints at https:/github.com/HeCheng0625/Diffusion-Speech-Tokenizer.


[69] 2508.16807

Autonomous UAV Flight Navigation in Confined Spaces: A Reinforcement Learning Approach

Inspecting confined industrial infrastructure, such as ventilation shafts, is a hazardous and inefficient task for humans. Unmanned Aerial Vehicles (UAVs) offer a promising alternative, but GPS-denied environments require robust control policies to prevent collisions. Deep Reinforcement Learning (DRL) has emerged as a powerful framework for developing such policies, and this paper provides a comparative study of two leading DRL algorithms for this task: the on-policy Proximal Policy Optimization (PPO) and the off-policy Soft Actor-Critic (SAC). The training was conducted with procedurally generated duct environments in Genesis simulation environment. A reward function was designed to guide a drone through a series of waypoints while applying a significant penalty for collisions. PPO learned a stable policy that completed all evaluation episodes without collision, producing smooth trajectories. By contrast, SAC consistently converged to a suboptimal behavior that traversed only the initial segments before failure. These results suggest that, in hazard-dense navigation, the training stability of on-policy methods can outweigh the nominal sample efficiency of off-policy algorithms. More broadly, the study provides evidence that procedurally generated, high-fidelity simulations are effective testbeds for developing and benchmarking robust navigation policies.


[70] 2508.16817

Predictability Enables Parallelization of Nonlinear State Space Models

The rise of parallel computing hardware has made it increasingly important to understand which nonlinear state space models can be efficiently parallelized. Recent advances like DEER (arXiv:2309.12252) or DeepPCR (arXiv:2309.16318) have shown that evaluating a state space model can be recast as solving a parallelizable optimization problem, and sometimes this approach can yield dramatic speed-ups in evaluation time. However, the factors that govern the difficulty of these optimization problems remain unclear, limiting the larger adoption of the technique. In this work, we establish a precise relationship between the dynamics of a nonlinear system and the conditioning of its corresponding optimization formulation. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behavior, impacts the number of optimization steps required for evaluation. In predictable systems, the state trajectory can be computed in $O((\log T)^2)$ time, where $T$ is the sequence length, a major improvement over the conventional sequential approach. In contrast, chaotic or unpredictable systems exhibit poor conditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis demonstrates that for predictable systems, the optimization problem is always well-conditioned, whereas for unpredictable systems, the conditioning degrades exponentially as a function of the sequence length. We validate our claims through extensive experiments, providing practical guidance on when nonlinear dynamical systems can be efficiently parallelized, and highlighting predictability as a key design principle for parallelizable models.


[71] 2508.16819

Fairness of Energy Distribution Mechanisms in Collective Self-Consumption Schemes

In several European countries, regulatory frameworks now allow households to form energy communities and trade energy locally via local energy markets (LEMs). While multiple mechanisms exist to allocate locally produced energy among members, their fairness remains insufficiently understood despite energy justice being a key concern for communities. This paper first provides a thorough description of the collective self-consumption process in France, offering a real world framework for researchers. We then review the main types of fairness relevant to LEMs and identify appropriate indicators for each, including a new scalable indicator to evaluate meritocratic fairness. Using simulations across 250 randomly generated residential communities of 20 households, we assess and compare fairness across different LEM distribution mechanisms. Results show that average financial savings reach 12% with 40% PV uptake. Among the four widely used LEM mechanisms assessed, glass-filling with prioritization yields the highest egalitarian and min max fairness. Double auction and pro rata schemes promote meritocracy, while standard glass filling offers a strong balance across fairness objectives.


[72] 2508.16830

AIM 2025 Low-light RAW Video Denoising Challenge: Dataset, Methods and Results

This paper reviews the AIM 2025 (Advances in Image Manipulation) Low-Light RAW Video Denoising Challenge. The task is to develop methods that denoise low-light RAW video by exploiting temporal redundancy while operating under exposure-time limits imposed by frame rate and adapting to sensor-specific, signal-dependent noise. We introduce a new benchmark of 756 ten-frame sequences captured with 14 smartphone camera sensors across nine conditions (illumination: 1/5/10 lx; exposure: 1/24, 1/60, 1/120 s), with high-SNR references obtained via burst averaging. Participants process linear RAW sequences and output the denoised 10th frame while preserving the Bayer pattern. Submissions are evaluated on a private test set using full-reference PSNR and SSIM, with final ranking given by the mean of per-metric ranks. This report describes the dataset, challenge protocol, and submitted approaches.


[73] 2508.16852

Gaussian Primitive Optimized Deformable Retinal Image Registration

Deformable retinal image registration is notoriously difficult due to large homogeneous regions and sparse but critical vascular features, which cause limited gradient signals in standard learning-based frameworks. In this paper, we introduce Gaussian Primitive Optimization (GPO), a novel iterative framework that performs structured message passing to overcome these challenges. After an initial coarse alignment, we extract keypoints at salient anatomical structures (e.g., major vessels) to serve as a minimal set of descriptor-based control nodes (DCN). Each node is modelled as a Gaussian primitive with trainable position, displacement, and radius, thus adapting its spatial influence to local deformation scales. A K-Nearest Neighbors (KNN) Gaussian interpolation then blends and propagates displacement signals from these information-rich nodes to construct a globally coherent displacement field; focusing interpolation on the top (K) neighbors reduces computational overhead while preserving local detail. By strategically anchoring nodes in high-gradient regions, GPO ensures robust gradient flow, mitigating vanishing gradient signal in textureless areas. The framework is optimized end-to-end via a multi-term loss that enforces both keypoint consistency and intensity alignment. Experiments on the FIRE dataset show that GPO reduces the target registration error from 6.2\,px to ~2.4\,px and increases the AUC at 25\,px from 0.770 to 0.938, substantially outperforming existing methods. The source code can be accessed via this https URL.


[74] 2508.16887

MDIQA: Unified Image Quality Assessment for Multi-dimensional Evaluation and Restoration

Recent advancements in image quality assessment (IQA), driven by sophisticated deep neural network designs, have significantly improved the ability to approach human perceptions. However, most existing methods are obsessed with fitting the overall score, neglecting the fact that humans typically evaluate image quality from different dimensions before arriving at an overall quality assessment. To overcome this problem, we propose a multi-dimensional image quality assessment (MDIQA) framework. Specifically, we model image quality across various perceptual dimensions, including five technical and four aesthetic dimensions, to capture the multifaceted nature of human visual perception within distinct branches. Each branch of our MDIQA is initially trained under the guidance of a separate dimension, and the respective features are then amalgamated to generate the final IQA score. Additionally, when the MDIQA model is ready, we can deploy it for a flexible training of image restoration (IR) models, enabling the restoration results to better align with varying user preferences through the adjustment of perceptual dimension weights. Extensive experiments demonstrate that our MDIQA achieves superior performance and can be effectively and flexibly applied to image restoration tasks. The code is available: this https URL.


[75] 2508.16901

Relative Navigation and Dynamic Target Tracking for Autonomous Underwater Proximity Operations

Estimating a target's 6-DoF motion in underwater proximity operations is difficult because the chaser lacks target-side proprioception and the available relative observations are sparse, noisy, and often partial (e.g., Ultra-Short Baseline (USBL) positions). Without a motion prior, factor-graph maximum a posteriori estimation is underconstrained: consecutive target states are weakly linked and orientation can drift. We propose a generalized constant-twist motion prior defined on the tangent space of Lie groups that enforces temporally consistent trajectories across all degrees of freedom; in SE(3) it couples translation and rotation in the body frame. We present a ternary factor and derive its closed-form Jacobians based on standard Lie group operations, enabling drop-in use for trajectories on arbitrary Lie groups. We evaluate two deployment modes: (A) an SE(3)-only representation that regularizes orientation even when only position is measured, and (B) a mode with boundary factors that switches the target representation between SE(3) and 3D position while applying the same generalized constant-twist prior across representation changes. Validation on a real-world dynamic docking scenario dataset shows consistent ego-target trajectory estimation through USBL-only and optical relative measurement segments with an improved relative tracking accuracy compared to the noisy measurements to the target. Because the construction relies on standard Lie group primitives, it is portable across state manifolds and sensing modalities.


[76] 2508.16933

TSPC-PFD: TSPC-Based Low-Power High-Resolution CMOS Phase Frequency Detector

Phase Frequency Detectors (PFDs) are essential components in Phase-Locked Loop (PLL) and Delay-Locked Loop (DLL) systems, responsible for comparing phase and frequency differences and generating up/down signals to regulate charge pumps and/or, consequently, Voltage-Controlled Oscillators (VCOs). Conventional PFD designs often suffer from significant dead zones and blind zones, which degrade phase detection accuracy and increase jitter in high-speed applications. This paper addresses PFD design challenges and presents a novel low-power True Single-Phase Clock (TSPC)-based PFD. The proposed design eliminates the blind zone entirely while achieving a minimal dead zone of 40 ps. The proposed PFD, implemented using TSMC 28 nm technology, demonstrates a low-power consumption of 4.41 uW at 3 GHz input frequency with a layout area of $10.42\mu m^2$.


[77] 2508.16944

Stability Optimization and Analysis of Energy Flow Networks versus Different Centrality Measurement

Optimizing the stability and control performance of complex networks often hinges on effectively identifying critical nodes for targeted intervention. Due to their inherent complexity and high dimensionality, large-scale energy flow networks, prevalent in domains like power grids, transportation, and financial systems, present unique challenges in selecting optimal nodes for resource allocation. While numerous centrality measurements, such as Katz centrality, eigenvector centrality, closeness centrality, betweenness centrality, and PageRank, have been proposed to evaluate node importance, the impact of different centrality metrics on stability outcomes remains inadequately understood. Moreover, networks manifest diverse structural characteristics-including small-world, scale-free, and random graph properties-which further complicates the optimization problem. This paper systematically investigates how various node centrality measurements influence control stability across representative complex network structures. A unified energy-flow dynamical model is developed, and performance metrics such as the L1 norm are employed to quantify the network stability implications of employing different centrality metrics. Extensive numerical simulations over statistically generated network ensembles reveal significant variances in stability outcomes, highlighting the crucial role of centrality selection. The findings underscore the sensitivity of energy-flow stability to seemingly minor changes in topological node rankings, providing practical insights for enhancing control efficiency and robustness in real-world networked systems.


[78] 2508.17038

A Rapid Iterative Trajectory Planning Method for Automated Parking through Differential Flatness

As autonomous driving continues to advance, automated parking is becoming increasingly essential. However, significant challenges arise when implementing path velocity decomposition (PVD) trajectory planning for automated parking. The primary challenge is ensuring rapid and precise collision-free trajectory planning, which is often in conflict. The secondary challenge involves maintaining sufficient control feasibility of the planned trajectory, particularly at gear shifting points (GSP). This paper proposes a PVD-based rapid iterative trajectory planning (RITP) method to solve the above challenges. The proposed method effectively balances the necessity for time efficiency and precise collision avoidance through a novel collision avoidance framework. Moreover, it enhances the overall control feasibility of the planned trajectory by incorporating the vehicle kinematics model and including terminal smoothing constraints (TSC) at GSP during path planning. Specifically, the proposed method leverages differential flatness to ensure the planned path adheres to the vehicle kinematic model. Additionally, it utilizes TSC to maintain curvature continuity at GSP, thereby enhancing the control feasibility of the overall trajectory. The simulation results demonstrate superior time efficiency and tracking errors compared to model-integrated and other iteration-based trajectory planning methods. In the real-world experiment, the proposed method was implemented and validated on a ROS-based vehicle, demonstrating the applicability of the RITP method for real vehicles.


[79] 2508.17094

PowerChain: Automating Distribution Grid Analysis with Agentic AI Workflows

Due to the rapid pace of electrification and decarbonization, distribution grid (DG) operation and planning are becoming more complex, necessitating advanced computational analyses to ensure grid reliability and resilience. State-of-the-art DG analyses rely on disparate workflows of complex models, functions, and data pipelines, which require expert knowledge and are challenging to automate. Many small-scale utilities and cooperatives lack a large R&D workforce and therefore cannot use advanced analysis at scale. To address this gap, we develop a novel agentic AI system, PowerChain, to solve unseen DG analysis tasks via automated agentic orchestration and large language models (LLMs) function-calling. Given a natural language query, PowerChain dynamically generates and executes an ordered sequence of domain-aware functions guided by the semantics of an expert-built power systems function pool and a select reference set of known, expert-generated workflow-query pairs. Our results show that PowerChain can produce expert-level workflows with both GPT-5 and open-source Qwen models on complex, unseen DG analysis tasks operating on real utility data.


[80] 2508.17096

Convolutional Neural Networks for Accurate Measurement of Train Speed

In this study, we explore the use of Convolutional Neural Networks for improving train speed estimation accuracy, addressing the complex challenges of modern railway systems. We investigate three CNN architectures - single-branch 2D, single-branch 1D, and multiple-branch models - and compare them with the Adaptive Kalman Filter. We analyse their performance using simulated train operation datasets with and without Wheel Slide Protection activation. Our results reveal that CNN-based approaches, especially the multiple-branch model, demonstrate superior accuracy and robustness compared to traditional methods, particularly under challenging operational conditions. These findings highlight the potential of deep learning techniques to enhance railway safety and operational efficiency by more effectively capturing intricate patterns in complex transportation datasets.


[81] 2508.17124

Towards Deeper Understanding of Natural User Interactions in Virtual Reality Based Assembly Tasks

We explore natural user interactions using a virtual reality simulation of a robot arm for assembly tasks. Using a Wizard-of-Oz study, participants completed collaborative LEGO and instructive PCB assembly tasks, with the robot responding under experimenter control. We collected voice, hand tracking, and gaze data from users. Statistical analyses revealed that instructive and collaborative scenarios elicit distinct behaviors and adopted strategies, particularly as tasks progress. Users tended to use put-that-there language in spatially ambiguous contexts and more descriptive instructions in spatially clear ones. Our contributions include the identification of natural interaction strategies through analyses of collected data, as well as the supporting dataset, to guide the understanding and design of natural multimodal user interfaces for instructive interaction with systems in virtual reality.


[82] 2508.17143

Performance Validation of Coded Wavefront Sensing for Quantitative Phase Imaging of Static and Dynamic Specimens Using Digital Holographic Microscopy

Coded wavefront sensing (Coded-WFS) is a snapshot quantitative phase imaging (QPI) technique that has been shown to successfully leverage the memory effect to retrieve the phase of biological specimens. In this paper, we perform QPI on static silica beads and dynamic HEK cells using Coded-WFS. The accuracy of the retrieved phase map is validated using digital holographic microscopy (DHM) for the same specimens. We report comparisons of simultaneous bright-field intensity and optical path delay.


[83] 2508.17163

Generative AI for Multimedia Communication: Recent Advances, An Information-Theoretic Framework, and Future Opportunities

Recent breakthroughs in generative artificial intelligence (AI) are transforming multimedia communication. This paper systematically reviews key recent advancements across generative AI for multimedia communication, emphasizing transformative models like diffusion and transformers. However, conventional information-theoretic frameworks fail to address semantic fidelity, critical to human perception. We propose an innovative semantic information-theoretic framework, introducing semantic entropy, mutual information, channel capacity, and rate-distortion concepts specifically adapted to multimedia applications. This framework redefines multimedia communication from purely syntactic data transmission to semantic information conveyance. We further highlight future opportunities and critical research directions. We chart a path toward robust, efficient, and semantically meaningful multimedia communication systems by bridging generative AI innovations with information theory. This exploratory paper aims to inspire a semantic-first paradigm shift, offering a fresh perspective with significant implications for future multimedia research.


[84] 2508.17166

Generative Flow Networks for Personalized Multimedia Systems: A Case Study on Short Video Feeds

Multimedia systems underpin modern digital interactions, facilitating seamless integration and optimization of resources across diverse multimedia applications. To meet growing personalization demands, multimedia systems must efficiently manage competing resource needs, adaptive content, and user-specific data handling. This paper introduces Generative Flow Networks (GFlowNets, GFNs) as a brave new framework for enabling personalized multimedia systems. By integrating multi-candidate generative modeling with flow-based principles, GFlowNets offer a scalable and flexible solution for enhancing user-specific multimedia experiences. To illustrate the effectiveness of GFlowNets, we focus on short video feeds, a multimedia application characterized by high personalization demands and significant resource constraints, as a case study. Our proposed GFlowNet-based personalized feeds algorithm demonstrates superior performance compared to traditional rule-based and reinforcement learning methods across critical metrics, including video quality, resource utilization efficiency, and delivery cost. Moreover, we propose a unified GFlowNet-based framework generalizable to other multimedia systems, highlighting its adaptability and wide-ranging applicability. These findings underscore the potential of GFlowNets to advance personalized multimedia systems by addressing complex optimization challenges and supporting sophisticated multimedia application scenarios.


[85] 2508.17173

Collaborative-Online-Learning-Enabled Distributionally Robust Motion Control for Multi-Robot Systems

This paper develops a novel COllaborative-Online-Learning (COOL)-enabled motion control framework for multi-robot systems to avoid collision amid randomly moving obstacles whose motion distributions are partially observable through decentralized data streams. To address the notable challenge of data acquisition due to occlusion, a COOL approach based on the Dirichlet process mixture model is proposed to efficiently extract motion distribution information by exchanging among robots selected learning structures. By leveraging the fine-grained local-moment information learned through COOL, a data-stream-driven ambiguity set for obstacle motion is constructed. We then introduce a novel ambiguity set propagation method, which theoretically admits the derivation of the ambiguity sets for obstacle positions over the entire prediction horizon by utilizing obstacle current positions and the ambiguity set for obstacle motion. Additionally, we develop a compression scheme with its safety guarantee to automatically adjust the complexity and granularity of the ambiguity set by aggregating basic ambiguity sets that are close in a measure space, thereby striking an attractive trade-off between control performance and computation time. Then the probabilistic collision-free trajectories are generated through distributionally robust optimization problems. The distributionally robust obstacle avoidance constraints based on the compressed ambiguity set are equivalently reformulated by deriving separating hyperplanes through tractable semi-definite programming. Finally, we establish the probabilistic collision avoidance guarantee and the long-term tracking performance guarantee for the proposed framework. The numerical simulations are used to demonstrate the efficacy and superiority of the proposed approach compared with state-of-the-art methods.


[86] 2508.17185

Linear Dynamics meets Linear MDPs: Closed-Form Optimal Policies via Reinforcement Learning

Many applications -- including power systems, robotics, and economics -- involve a dynamical system interacting with a stochastic and hard-to-model environment. We adopt a reinforcement learning approach to control such systems. Specifically, we consider a deterministic, discrete-time, linear, time-invariant dynamical system coupled with a feature-based linear Markov process with an unknown transition kernel. The objective is to learn a control policy that optimizes a quadratic cost over the system state, the Markov process, and the control input. Leveraging both components of the system, we derive an explicit parametric form for the optimal state-action value function and the corresponding optimal policy. Our model is distinct in combining aspects of both classical Linear Quadratic Regulator (LQR) and linear Markov decision process (MDP) frameworks. This combination retains the implementation simplicity of LQR, while allowing for sophisticated stochastic modeling afforded by linear MDPs, without estimating the transition probabilities, thereby enabling direct policy improvement. We use tools from control theory to provide theoretical guarantees on the stability of the system under the learned policy and provide a sample complexity analysis for its convergence to the optimal policy. We illustrate our results via a numerical example that demonstrates the effectiveness of our approach in learning the optimal control policy under partially known stochastic dynamics.


[87] 2508.17205

Multi-Agent Visual-Language Reasoning for Comprehensive Highway Scene Understanding

This paper introduces a multi-agent framework for comprehensive highway scene understanding, designed around a mixture-of-experts strategy. In this framework, a large generic vision-language model (VLM), such as GPT-4o, is contextualized with domain knowledge to generates task-specific chain-of-thought (CoT) prompts. These fine-grained prompts are then used to guide a smaller, efficient VLM (e.g., Qwen2.5-VL-7B) in reasoning over short videos, along with complementary modalities as applicable. The framework simultaneously addresses multiple critical perception tasks, including weather classification, pavement wetness assessment, and traffic congestion detection, achieving robust multi-task reasoning while balancing accuracy and computational efficiency. To support empirical validation, we curated three specialized datasets aligned with these tasks. Notably, the pavement wetness dataset is multimodal, combining video streams with road weather sensor data, highlighting the benefits of multimodal reasoning. Experimental results demonstrate consistently strong performance across diverse traffic and environmental conditions. From a deployment perspective, the framework can be readily integrated with existing traffic camera systems and strategically applied to high-risk rural locations, such as sharp curves, flood-prone lowlands, or icy bridges. By continuously monitoring the targeted sites, the system enhances situational awareness and delivers timely alerts, even in resource-constrained environments.


[88] 2508.17210

Blind Deconvolution of Nonstationary Graph Signals over Shift-Invariant Channels

In this paper, we investigate blind deconvolution of nonstationary graph signals from noisy observations, transmitted through an unknown shift-invariant channel. The deconvolution process assumes that the observer has access to the covariance structure of the original graph signals. To evaluate the effectiveness of our channel estimation and blind deconvolution method, we conduct numerical experiments using a temperature dataset in the Brest region of France.


[89] 2508.17229

Multi-Metric Preference Alignment for Generative Speech Restoration

Recent generative models have significantly advanced speech restoration tasks, yet their training objectives often misalign with human perceptual preferences, resulting in suboptimal quality. While post-training alignment has proven effective in other generative domains like text and image generation, its application to generative speech restoration remains largely under-explored. This work investigates the challenges of applying preference-based post-training to this task, focusing on how to define a robust preference signal and curate high-quality data to avoid reward hacking. To address these challenges, we propose a multi-metric preference alignment strategy. We construct a new dataset, GenSR-Pref, comprising 80K preference pairs, where each chosen sample is unanimously favored by a complementary suite of metrics covering perceptual quality, signal fidelity, content consistency, and timbre preservation. This principled approach ensures a holistic preference signal. Applying Direct Preference Optimization (DPO) with our dataset, we observe consistent and significant performance gains across three diverse generative paradigms: autoregressive models (AR), masked generative models (MGM), and flow-matching models (FM) on various restoration benchmarks, in both objective and subjective evaluations. Ablation studies confirm the superiority of our multi-metric strategy over single-metric approaches in mitigating reward hacking. Furthermore, we demonstrate that our aligned models can serve as powerful ''data annotators'', generating high-quality pseudo-labels to serve as a supervision signal for traditional discriminative models in data-scarce scenarios like singing voice restoration. Demo Page:this https URL


[90] 2508.17397

Enhancing Underwater Images via Deep Learning: A Comparative Study of VGG19 and ResNet50-Based Approaches

This paper addresses the challenging problem of image enhancement in complex underwater scenes by proposing a solution based on deep learning. The proposed method skillfully integrates two deep convolutional neural network models, VGG19 and ResNet50, leveraging their powerful feature extraction capabilities to perform multi-scale and multi-level deep feature analysis of underwater images. By constructing a unified model, the complementary advantages of the two models are effectively integrated, achieving a more comprehensive and accurate image enhancement this http URL objectively evaluate the enhancement effect, this paper introduces image quality assessment metrics such as PSNR, UCIQE, and UIQM to quantitatively compare images before and after enhancement and deeply analyzes the performance of different models in different this http URL, to improve the practicality and stability of the underwater visual enhancement system, this paper also provides practical suggestions from aspects such as model optimization, multi-model fusion, and hardware selection, aiming to provide strong technical support for visual enhancement tasks in complex underwater environments.


[91] 2508.17466

Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation

Quadruped robots have emerged as highly efficient and versatile platforms, excelling in navigating complex and unstructured terrains where traditional wheeled robots might fail. Equipping these robots with manipulator arms unlocks the advanced capability of loco-manipulation to perform complex physical interaction tasks in areas ranging from industrial automation to search-and-rescue missions. However, achieving precise and adaptable grasping in such dynamic scenarios remains a significant challenge, often hindered by the need for extensive real-world calibration and pre-programmed grasp configurations. This paper introduces a deep learning framework designed to enhance the grasping capabilities of quadrupeds equipped with arms, focusing on improved precision and adaptability. Our approach centers on a sim-to-real methodology that minimizes reliance on physical data collection. We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects. By simulating thousands of interactions from various perspectives, we created pixel-wise annotated grasp-quality maps to serve as the ground truth for our model. This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras, including RGB images, depth maps, segmentation masks, and surface normal maps. The trained model outputs a grasp-quality heatmap to identify the optimal grasp point. We validated the complete framework on a four-legged robot. The system successfully executed a full loco-manipulation task: autonomously navigating to a target object, perceiving it with its sensors, predicting the optimal grasp pose using our model, and performing a precise grasp. This work proves that leveraging simulated training with advanced sensing offers a scalable and effective solution for object handling.


[92] 2508.17471

Distributed Implementation of Variational Quantum Eigensolver to Solve QUBO Problems

We present a distributed algorithm and implementation of the variational quantum eigensolver (VQE), termed distributed VQE (DVQE). DVQE, provided as an open-source Python package, enables the execution of parameterized quantum circuits across multiple logical quantum processing units (QPUs) in a distributed fashion. This approach addresses key hardware limitations of near-term quantum devices, including restricted qubit counts and limited circuit depth. Distributed ansatz circuits are constructed to preserve the quantum state fidelity of their monolithic counterparts, allowing consistent energy estimation while distributing the computational load. To improve the convergence and robustness of the optimization loop for identifying the variational parameters of the DVQE ansatz circuit, we use the ADAM optimizer in combination with metaheuristic initialization strategies, which outperform random initialization across various test cases. The complete DVQE pipeline is implemented in a modular Python package that accepts QUBO problems as input and supports monolithic and distributed execution modes. The framework leverages Qiskit to construct and simulate distributed circuits, and includes an internal greedy algorithm for automatic qubit allocation across multiple QPUs. Simulation results on QUBO benchmarks confirm the correctness of the approach, paving the way for real QPU deployment and further exploration of distributed quantum optimization. \textbf{The simulator is publicly available on \href{this https URL}{GitHub} under a package named raiselab, with a collection of tutorial examples.}


[93] 2508.17480

Random-phase Gaussian Wave Splatting for Computer-generated Holography

Holographic near-eye displays offer ultra-compact form factors for virtual and augmented reality systems, but rely on advanced computer-generated holography (CGH) algorithms to convert 3D scenes into interference patterns that can be displayed on spatial light modulators (SLMs). Gaussian Wave Splatting (GWS) has recently emerged as a powerful CGH paradigm that allows for the conversion of Gaussians, a state-of-the-art neural 3D representation, into holograms. However, GWS assumes smooth-phase distributions over the Gaussian primitives, limiting their ability to model view-dependent effects and reconstruct accurate defocus blur, and severely under-utilizing the space-bandwidth product of the SLM. In this work, we propose random-phase GWS (GWS-RP) to improve bandwidth utilization, which has the effect of increasing eyebox size, reconstructing accurate defocus blur and parallax, and supporting time-multiplexed rendering to suppress speckle artifacts. At the core of GWS-RP are (1) a fundamentally new wavefront compositing procedure and (2) an alpha-blending scheme specifically designed for random-phase Gaussian primitives, ensuring physically correct color reconstruction and robust occlusion handling. Additionally, we present the first formally derived algorithm for applying random phase to Gaussian primitives, grounded in rigorous statistical optics analysis and validated through practical near-eye display applications. Through extensive simulations and experimental validations, we demonstrate that these advancements, collectively with time-multiplexing, uniquely enables full-bandwith light field CGH that supports accurate accurate parallax and defocus, yielding state-of-the-art image quality and perceptually faithful 3D holograms for next-generation near-eye displays.


[94] 2508.17503

First and Second Order Optimal $\mathcal{H}_2$ Model Reduction for Linear Continuous-Time Systems

In this paper, we investigate the optimal $\mathcal{H}_2$ model reduction problem for single-input single-output (SISO) continuous-time linear time-invariant (LTI) systems. A semi-definite relaxation (SDR) approach is proposed to determine globally optimal interpolation points, providing an effective way to compute the reduced-order models via Krylov projection-based methods. In contrast to iterative approaches, we use the controllability Gramian and the moment-matching conditions to recast the model reduction problem as a convex optimization by introducing an upper bound $\gamma$ to minimize the $\mathcal{H}_2$ norm of the model reduction error system. We also prove that the relaxation is exact for first order reduced models and demonstrate, through examples, that it is exact for second order reduced models. We compare the performance of our proposed method with other iterative approaches and shift-selection methods on examples. Importantly, our approach also provides a means to verify the global optimality of known locally convergent methods.


[95] 2508.17623

EMO-Reasoning: Benchmarking Emotional Reasoning Capabilities in Spoken Dialogue Systems

Speech emotions play a crucial role in human-computer interaction, shaping engagement and context-aware communication. Despite recent advances in spoken dialogue systems, a holistic system for evaluating emotional reasoning is still lacking. To address this, we introduce EMO-Reasoning, a benchmark for assessing emotional coherence in dialogue systems. It leverages a curated dataset generated via text-to-speech to simulate diverse emotional states, overcoming the scarcity of emotional speech data. We further propose the Cross-turn Emotion Reasoning Score to assess the emotion transitions in multi-turn dialogues. Evaluating seven dialogue systems through continuous, categorical, and perceptual metrics, we show that our framework effectively detects emotional inconsistencies, providing insights for improving current dialogue systems. By releasing a systematic evaluation benchmark, we aim to advance emotion-aware spoken dialogue modeling toward more natural and adaptive interactions.


[96] 2508.17756

SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling

Diffusion models have recently achieved remarkable success in generative tasks (e.g., image and video generation), and the demand for high-quality content (e.g., 2K/4K videos) is rapidly increasing across various domains. However, generating ultra-high-resolution videos on existing standard-resolution (e.g., 720p) platforms remains challenging due to the excessive re-training requirements and prohibitively high computational and memory costs. To this end, we introduce SuperGen, an efficient tile-based framework for ultra-high-resolution video generation. SuperGen features a novel training-free algorithmic innovation with tiling to successfully support a wide range of resolutions without additional training efforts while significantly reducing both memory footprint and computational complexity. Moreover, SuperGen incorporates a tile-tailored, adaptive, region-aware caching strategy that accelerates video generation by exploiting redundancy across denoising steps and spatial regions. SuperGen also integrates cache-guided, communication-minimized tile parallelism for enhanced throughput and minimized latency. Evaluations demonstrate that SuperGen harvests the maximum performance gains while achieving high output quality across various benchmarks.


[97] 2508.17796

Zero-shot Context Biasing with Trie-based Decoding using Synthetic Multi-Pronunciation

Contextual automatic speech recognition (ASR) systems allow for recognizing out-of-vocabulary (OOV) words, such as named entities or rare words. However, it remains challenging due to limited training data and ambiguous or inconsistent pronunciations. In this paper, we propose a synthesis-driven multi-pronunciation contextual biasing method that performs zero-shot contextual ASR on a pretrained Whisper model. Specifically, we leverage text-to-speech (TTS) systems to synthesize diverse speech samples containing each target rare word, and then use the pretrained Whisper model to extract multiple predicted pronunciation variants. These variant token sequences are compiled into a prefix-trie, which assigns rewards to beam hypotheses in a shallow-fusion manner during beam-search decoding. After which, any recognized variant is mapped back to the original rare word in the final transcription. The evaluation results on the Librispeech dataset show that our method reduces biased word error rate (WER) by 42% on test-clean and 43% on test-other while maintaining unbiased WER essentially unchanged.


[98] 2508.17820

In-Memory Computing Enabled Deep MIMO Detection to Support Ultra-Low-Latency Communications

The development of sixth-generation (6G) mobile networks imposes unprecedented latency and reliability demands on multiple-input multiple-output (MIMO) communication systems, a key enabler of high-speed radio access. Recently, deep unfolding-based detectors, which map iterative algorithms onto neural network architectures, have emerged as a promising approach, combining the strengths of model-driven and data-driven methods to achieve high detection accuracy with relatively low complexity. However, algorithmic innovation alone is insufficient; software-hardware co-design is essential to meet the extreme latency requirements of 6G (i.e., 0.1 milliseconds). This motivates us to propose leveraging in-memory computing, which is an analog computing technology that integrates memory and computation within memristor circuits, to perform the intensive matrix-vector multiplication (MVM) operations inherent in deep MIMO detection at the nanosecond scale. Specifically, we introduce a novel architecture, called the deep in-memory MIMO (IM-MIMO) detector, characterized by two key features. First, each of its cascaded computational blocks is decomposed into channel-dependent and channel-independent neural network modules. Such a design minimizes the latency of memristor reprogramming in response to channel variations, which significantly exceeds computation time. Second, we develop a customized detector-training method that exploits prior knowledge of memristor-value statistics to enhance robustness against programming noise. Furthermore, we conduct a comprehensive analysis of the IM-MIMO detector's performance, evaluating detection accuracy, processing latency, and hardware complexity. Our study quantifies detection error as a function of various factors, including channel noise, memristor programming noise, and neural network size.


[99] 2508.17868

FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation

A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.


[100] 2508.17874

Vocoder-Projected Feature Discriminator

In text-to-speech (TTS) and voice conversion (VC), acoustic features, such as mel spectrograms, are typically used as synthesis or conversion targets owing to their compactness and ease of learning. However, because the ultimate goal is to generate high-quality waveforms, employing a vocoder to convert these features into waveforms and applying adversarial training in the time domain is reasonable. Nevertheless, upsampling the waveform introduces significant time and memory overheads. To address this issue, we propose a vocoder-projected feature discriminator (VPFD), which uses vocoder features for adversarial training. Experiments on diffusion-based VC distillation demonstrated that a pretrained and frozen vocoder feature extractor with a single upsampling step is necessary and sufficient to achieve a VC performance comparable to that of waveform discriminators while reducing the training time and memory consumption by 9.6 and 11.4 times, respectively.


[101] 2508.17882

modelSolver: A Symbolic Model-Driven Solver for Power Network Simulation and Monitoring

The development of advanced software tools for power system analysis requires extensive programming expertise. Even when using open-source tools, programming skills are essential to modify built-in models. This can be particularly challenging for domain experts who lack coding proficiency. This paper introduces modelSolver, a software solution with a new framework centered around symbolic mathematical modeling. The proposed paradigm facilitates defining models through intuitive mathematical expressions, thus eliminating the need for traditional programming constructs such as arrays, loops, and sparse matrix computations. The modelSolver focuses on power flow and state estimation using an open-box approach, which allows users to specify custom models using either real or complex variables. Unlike existing tools that rely on hard-coded models, modelSolver enables the representation of a wide range of advanced functionalities, including power flow with voltage regulators and load tap changers, continuation power flow, and Gauss-Newton state estimation with equality constraints. Compatibility with MATPOWER is ensured via a converter that automates importing data files. The framework prioritizes model-driven development and empowers domain experts to focus on power system modeling without programming barriers. It aims to simplify power system computations, making them more accessible to students, scientists, and practitioners.


[102] 2508.17976

Propose and Rectify: A Forensics-Driven MLLM Framework for Image Manipulation Localization

The increasing sophistication of image manipulation techniques demands robust forensic solutions that can both reliably detect alterations and precisely localize tampered regions. Recent Multimodal Large Language Models (MLLMs) show promise by leveraging world knowledge and semantic understanding for context-aware detection, yet they struggle with perceiving subtle, low-level forensic artifacts crucial for accurate manipulation localization. This paper presents a novel Propose-Rectify framework that effectively bridges semantic reasoning with forensic-specific analysis. In the proposal stage, our approach utilizes a forensic-adapted LLaVA model to generate initial manipulation analysis and preliminary localization of suspicious regions based on semantic understanding and contextual reasoning. In the rectification stage, we introduce a Forensics Rectification Module that systematically validates and refines these initial proposals through multi-scale forensic feature analysis, integrating technical evidence from several specialized filters. Additionally, we present an Enhanced Segmentation Module that incorporates critical forensic cues into SAM's encoded image embeddings, thereby overcoming inherent semantic biases to achieve precise delineation of manipulated regions. By synergistically combining advanced multimodal reasoning with established forensic methodologies, our framework ensures that initial semantic proposals are systematically validated and enhanced through concrete technical evidence, resulting in comprehensive detection accuracy and localization precision. Extensive experimental validation demonstrates state-of-the-art performance across diverse datasets with exceptional robustness and generalization capabilities.


[103] 2508.18025

AQ-PCDSys: An Adaptive Quantized Planetary Crater Detection System for Autonomous Space Exploration

Autonomous planetary exploration missions are critically dependent on real-time, accurate environmental perception for navigation and hazard avoidance. However, deploying deep learning models on the resource-constrained computational hardware of planetary exploration platforms remains a significant challenge. This paper introduces the Adaptive Quantized Planetary Crater Detection System (AQ-PCDSys), a novel framework specifically engineered for real-time, onboard deployment in the computationally constrained environments of space exploration missions. AQ-PCDSys synergistically integrates a Quantized Neural Network (QNN) architecture, trained using Quantization-Aware Training (QAT), with an Adaptive Multi-Sensor Fusion (AMF) module. The QNN architecture significantly optimizes model size and inference latency suitable for real-time onboard deployment in space exploration missions, while preserving high accuracy. The AMF module intelligently fuses data from Optical Imagery (OI) and Digital Elevation Models (DEMs) at the feature level, utilizing an Adaptive Weighting Mechanism (AWM) to dynamically prioritize the most relevant and reliable sensor modality based on planetary ambient conditions. This approach enhances detection robustness across diverse planetary landscapes. Paired with Multi-Scale Detection Heads specifically designed for robust and efficient detection of craters across a wide range of sizes, AQ-PCDSys provides a computationally efficient, reliable and accurate solution for planetary crater detection, a critical capability for enabling the next generation of autonomous planetary landing, navigation, and scientific exploration.


[104] 2508.18039

Modeling and Control Framework for Autonomous Space Manipulator Handover Operations

Autonomous space robotics is poised to play a vital role in future space missions, particularly for In-space Servicing, Assembly, and Manufacturing (ISAM). A key capability in such missions is the Robot-to-Robot (R2R) handover of mission-critical objects. This work presents a dynamic model of a dual-arm space manipulator system and compares various tracking control laws. The key contributions of this work are the development of a cooperative manipulator dynamic model and the comparative analysis of control laws to support autonomous R2R handovers in ISAM scenarios.


[105] 2508.18045

Riemannian Change Point Detection on Manifolds with Robust Centroid Estimation

Non-parametric change-point detection in streaming time series data is a long-standing challenge in signal processing. Recent advancements in statistics and machine learning have increasingly addressed this problem for data residing on Riemannian manifolds. One prominent strategy involves monitoring abrupt changes in the center of mass of the time series. Implemented in a streaming fashion, this strategy, however, requires careful step size tuning when computing the updates of the center of mass. In this paper, we propose to leverage robust centroid on manifolds from M-estimation theory to address this issue. Our proposal consists of comparing two centroid estimates: the classical Karcher mean (sensitive to change) versus one defined from Huber's function (robust to change). This comparison leads to the definition of a test statistic whose performance is less sensitive to the underlying estimation method. We propose a stochastic Riemannian optimization algorithm to estimate both robust centroids efficiently. Experiments conducted on both simulated and real-world data across two representative manifolds demonstrate the superior performance of our proposed method.


[106] 2508.18096

Realizing Reduced and Sparse Biochemical Reaction Networks from Dynamics

We propose a direct optimization framework for learning reduced and sparse chemical reaction networks (CRNs) from time-series trajectory data. In contrast to widely used indirect methods-such as those based on sparse identification of nonlinear dynamics (SINDy)-which infer reaction dynamics by fitting numerically estimated derivatives, our approach fits entire trajectories by solving a dynamically constrained optimization problem. This formulation enables the construction of reduced CRNs that are both low-dimensional and sparse, while preserving key dynamical behaviors of the original system. We develop an accelerated proximal gradient algorithm to efficiently solve the resulting non-convex optimization problem. Through illustrative examples, including a Drosophila circadian oscillator and a glycolytic oscillator, we demonstrate the ability of our method to recover accurate and interpretable reduced-order CRNs. Notably, the direct approach avoids the derivative estimation step and mitigates error accumulation issues inherent in indirect methods, making it a robust alternative for data-driven CRN realizations.


[107] 2508.18136

BirdRecorder's AI on Sky: Safeguarding birds of prey by detection and classification of tiny objects around wind turbines

The urgent need for renewable energy expansion, particularly wind power, is hindered by conflicts with wildlife conservation. To address this, we developed BirdRecorder, an advanced AI-based anti-collision system to protect endangered birds, especially the red kite (Milvus milvus). Integrating robotics, telemetry, and high-performance AI algorithms, BirdRecorder aims to detect, track, and classify avian species within a range of 800 m to minimize bird-turbine collisions. BirdRecorder integrates advanced AI methods with optimized hardware and software architectures to enable real-time image processing. Leveraging Single Shot Detector (SSD) for detection, combined with specialized hardware acceleration and tracking algorithms, our system achieves high detection precision while maintaining the speed necessary for real-time decision-making. By combining these components, BirdRecorder outperforms existing approaches in both accuracy and efficiency. In this paper, we summarize results on field tests and performance of the BirdRecorder system. By bridging the gap between renewable energy expansion and wildlife conservation, BirdRecorder contributes to a more sustainable coexistence of technology and nature.


[108] 2404.13484

Joint Quality Assessment and Example-Guided Image Processing by Disentangling Picture Appearance from Content

The deep learning revolution has strongly impacted low-level image processing tasks such as style/domain transfer, enhancement/restoration, and visual quality assessments. Despite often being treated separately, the aforementioned tasks share a common theme of understanding, editing, or enhancing the appearance of input images without modifying the underlying content. We leverage this observation to develop a novel disentangled representation learning method that decomposes inputs into content and appearance features. The model is trained in a self-supervised manner and we use the learned features to develop a new quality prediction model named DisQUE. We demonstrate through extensive evaluations that DisQUE achieves state-of-the-art accuracy across quality prediction tasks and distortion types. Moreover, we demonstrate that the same features may also be used for image processing tasks such as HDR tone mapping, where the desired output characteristics may be tuned using example input-output pairs.


[109] 2405.01197

Optimal Beamforming for Bistatic MIMO Sensing

This paper considers the beamforming optimization for sensing a point-like scatterer using a bistatic multiple-input multiple-output (MIMO) orthogonal frequency-division multiplexing (OFDM) radar, which could be part of a joint communication and sensing system. The goal is to minimize the Cramér-Rao bound on the target position's estimation error, where the radar already knows an approximate position that is taken into account in the optimization. The optimization considers multiple subcarriers, and permits beamforming with more than one beam per subcarrier. We discuss the properties of optimal beamforming solutions, including the case of a known channel gain. Numerical results show that beamforming with at most one beam per subcarrier is optimal for certain parameters, but for other parameters, optimal solutions need two beams on some subcarriers. In addition, the degree of freedom which end of the bistatic radar should transmit and receive in a bidirectional radar is considered.


[110] 2406.05914

Soundscape Captioning using Sound Affective Quality Network and Large Language Model

We live in a rich and varied acoustic world, which is experienced by individuals or communities as a soundscape. Computational auditory scene analysis, disentangling acoustic scenes by detecting and classifying events, focuses on objective attributes of sounds, such as their category and temporal characteristics, ignoring their effects on people, such as the emotions they evoke within a context. To fill this gap, we propose the affective soundscape captioning (ASSC) task, which enables automated soundscape analysis, thus avoiding labour-intensive subjective ratings and surveys in conventional methods. With soundscape captioning, context-aware descriptions are generated for soundscape by capturing the acoustic scenes (ASs), audio events (AEs) information, and the corresponding human affective qualities (AQs). To this end, we propose an automatic soundscape captioner (SoundSCaper) system composed of an acoustic model, i.e. SoundAQnet, and a large language model (LLM). SoundAQnet simultaneously models multi-scale information about ASs, AEs, and perceived AQs, while the LLM describes the soundscape with captions by parsing the information captured with SoundAQnet. SoundSCaper is assessed by two juries of 32 people. In expert evaluation, the average score of SoundSCaper-generated captions is slightly lower than that of two soundscape experts on the evaluation set D1 and the external mixed dataset D2, but not statistically significant. In layperson evaluation, SoundSCaper outperforms soundscape experts in several metrics. In addition to human evaluation, compared to other automated audio captioning systems with and without LLM, SoundSCaper performs better on the ASSC task in several NLP-based metrics. Overall, SoundSCaper performs well in human subjective evaluation and various objective captioning metrics, and the generated captions are comparable to those annotated by soundscape experts.


[111] 2407.03911

Geometry-Aware Edge-State Tracking for Resilient Affine Formation Control

Affine formation control (AFC) is a subset of formation control methods that enables coordinated multiagent movement while preserving affine relationships, and has recently gained increasing popularity due to its broad applicability across diverse applications. AFC is inherently distributed, where each agent's local controller relies on the relative displacements of neighboring agents. The unavailability of these measurements in practice, due to node or communication failures, leads to a change in the underlying graph topology and subsequently causes instability or sub-optimal performance. In this work, each edge in the graph is modeled using a state-space framework, allowing the corresponding edge-states to be estimated with or without up-to-date measurements. We then propose a Kalman-based estimation framework where we fuse both temporal information from agents' dynamics and spatial information, which is derived from the geometry of the affine formations. We give convergence guarantees and optimality analysis on the proposed algorithm, and numerical validations show the enhanced resilience of AFC against these topology changes in several practical scenarios.


[112] 2409.11823

Robust Sensor-Limited Control with Safe Input-Output Constraints for Hydraulic In-Wheel Motor Drive Mobility Systems

In-wheel drive (IWD) systems enhance the responsiveness, traction, and maintenance efficiency of vehicles by enabling each wheel to operate independently. This paper proposes a novel robust torque-observed valve-based control (RTOVC) framework to address velocity tracking in hydraulic IWDs that actuate heavy-duty wheeled mobile robots (HWMRs), considering such challenges as wheel slippages, sensor limitations, rough terrains, and modeling uncertainties. To overcome the sensor-dependent control systems associated with the closed-loop torque/pressure in hydraulic IWD-actuated HWMRs, a robust observer network based on an adaptive barrier Lyapunov function (BLF) is proposed to estimate the required in-wheel motor torque to track the velocity references. Then, another adaptive BLF for valve control signals is employed to modulate the hydraulic fluid to generate the estimated torque for each IWD. The RTOVC strategy ensures user-defined safety within the logarithmic BLF framework by constraining the valve control signal, actual velocity, velocity tracking error, and torque of each hydraulic IWD in an HWMR to avoid exceeding specified limits. Despite its safety constraints, external disturbances, and modeling uncertainties, robustness and uniformly exponential stability of the RTOVC-applied hydraulic IWD mechanism are ensured in HWMRs. Experimental investigations using a 6,500-kg HWMR, actuated by four independent IWDs under intense disturbances and safety-defined constraints, validate the performance of the RTOVC.


[113] 2409.11828

Model-Free Generic Robust Control for Servo-Driven Actuation Mechanisms with Layered Insight into Energy Conversions

To advance theoretical solutions and address limitations in modeling complex servo-driven actuation systems experiencing high non-linearity and load disturbances, this paper aims to design a practical model-free generic robust control (GRC) framework for these mechanisms. This framework is intended to be applicable across all actuator systems encompassing electrical, hydraulic, or pneumatic servomechanisms, while also functioning within complex interactions among dynamic components and adhering to control input constraints. In this respect, the state-space model of actuator systems is decomposed into smaller subsystems that incorporate the first principle equation of actuator motion dynamics and interactive energy conversion equations. This decomposition operates under the assumption that the comprehensive model of the servo-driven actuator system and energy conversion, uncertainties, load disturbances, and their bounds are unknown. Then, the GRC employs subsystem-based adaptive control strategies for each state-variant subsystem separately. Despite control input constraints and the unknown interactive system model, the GRC-applied actuator mechanism ensures uniform exponential stability and robustness in tracking desired motions. It features straightforward implementation, experimentally evaluated by applying it to two industrial applications.


[114] 2409.14031

Maximum-Likelihood Estimation Based on Diffusion Model For Wireless Communications

Generative Artificial Intelligence (GenAI) models, with their powerful feature learning capabilities, have been applied in many fields. In mobile wireless communications, GenAI can dynamically optimize the network to enhance the user experience. Especially in signal detection and channel estimation tasks, due to digital signals following a certain random distribution, GenAI models can fully utilize their distribution learning characteristics. For example, diffusion models (DMs) and normalized flow models have been applied to related tasks. However, since the DM cannot guarantee that the generated results are the maximum-likelihood estimation points of the distribution during the data generation process, the successful task completion rate is reduced. Based on this, this paper proposes a Maximum-Likelihood Estimation Inference (MLEI) framework. The framework uses the loss function in the forward diffusion process of the DM to infer the maximum-likelihood estimation points in the discrete space. Then, we present a signal detection task in near-field communication scenarios with unknown noise characteristics. In experiments, numerical results demonstrate that the proposed framework has better performance than state-of-the-art signal estimators.


[115] 2410.18757

Sliding DFT-based Signal Recovery for Modulo ADC with 1-bit Folding Information

The modulo analog-to-digital converter (ADC) is a promising solution to resolve the limited dynamic range (DR) issue of conventional ADCs and achieve an enhanced digital resolution given a fixed quantization bit budget. However, a modulo ADC requires an unfolding scheme to correct the nonlinear distortion introduced by the modulo operation. This paper presents a sliding discrete Fourier Transform (DFT)-based method for fast signal reconstruction given the modulo ADC output sequence and a 1-bit folding information sequence. In contrast to existing DFT-based signal recovery techniques for modulo ADCs, our proposed sliding DFT method reduces the required observation time and minimizes the spectral leakage effects via proper choice of window function parameters. A mean squared error (MSE) performance guarantee is established for the proposed signal recovery algorithm. More precisely, we derive sufficient conditions for the oversampling factor ($\mathrm{OF}$) and the number of quantization bits ($b$) to obtain a specific MSE performance. Our numerical results demonstrate that modulo ADCs equipped with our proposed recovery method can outperform conventional ADCs without modulo for $\mathrm{OF} \geq 4$ and $b \geq 4$. Moreover, the proposed sliding DFT-based recovery method outperforms higher-order difference-based recovery in terms of MSE, particularly in the low-sampling-rate and low-resolution regime. The impact of spectral leakage on the MSE performance of the proposed sliding DFT recovery method is also quantified.


[116] 2411.04689

Over-the-Air DPD and Reciprocity Calibration in Massive MIMO and Beyond

Non-linear transceivers and non-reciprocity of downlink and uplink channels are two major challenges in the deployment of massive multiple-input-multiple-output (MIMO) systems. We consider an over-the-air (OTA) approach for digital pre-distortion (DPD) and reciprocity calibration to jointly address these issues. In particular, we consider a memory-less non-linearity model for the base station (BS) transmitters, and we propose a method to perform both linearization and reciprocity calibration based on mutual coupling OTA measurements between BS antennas. We show that, by using only the OTA-based data, we can linearize the transmitters and design the calibration to compensate for both the non-linearity and non-reciprocity of BS transceivers. This allows alleviating the requirement to have dedicated hardware modules for transceiver linearization. Moreover, the proposed reciprocity calibration method is solely based on closed-form linear transformations, achieving a significant complexity reduction over state-of-the-art reciprocity methods, which assume linear transceivers, and rely on iterative methods. Simulation results showcase the potential of our approach in terms of the calibration matrix estimation error and downlink data-rates when applying zero-forcing (ZF) precoding after using our OTA-based DPD and reciprocity calibration method.


[117] 2411.09956

Secure State Estimation of Cyber-Physical Systems via Gaussian Bernoulli Mixture Model

The implementation of cyber-physical systems in real-world applications is challenged by safety requirements in the presence of sensor threats. Most cyber-physical systems, especially multi-sensor systems, struggle to detect sensor attacks when the attack model is unknown. In this paper, we tackle this issue by proposing a Gaussian-Bernoulli Secure (GBS) estimator, which transforms the detection problem into an optimal estimation problem concerning the system state and observation indicators. It encompasses two theoretical sub-problems: sequential state estimation with partial observations and estimation updates with disordered new observations. Within the framework of Kalman filter, we derive closed-form solutions for these two problems. However, due to their computational inefficiency, we propose the iterative approach employing proximal gradient descent to update the estimation in less time. Finally, we conduct experiments from three perspectives: computational efficiency, detection performance, and estimation error. Our GBS estimator demonstrates significant improvements over other methods.


[118] 2411.11610

Approximate predictive control barrier function for discrete-time systems

We propose integrating an approximation of a predictive control barrier function (PCBF) in a safety filter framework, resulting in a prediction horizon independent formulation. The PCBF is defined through the value function of an optimal control problem and ensures invariance as well as stability of a safe set within a larger domain of attraction. We provide a theoretical analysis of the proposed algorithm, establishing input-to-state stability of the safe set with respect to approximation errors as well as exogenous disturbances. Furthermore, we propose a continuous extension of the PCBF within the safe set, reducing the impact of learning errors on filter interventions. We demonstrate the stability properties and computational advantages of the proposed algorithm on a linear system example and its application as a safety filter for miniature race cars in simulation.


[119] 2412.01235

Real-time Traffic Simulation and Management for Large-scale Urban Air Mobility: Integrating Route Guidance and Collision Avoidance

Given the spatial heterogeneity of land use patterns in most cities, large-scale UAM deployments will likely focus on specific areas, such as intertransfer traffic between suburbs and city centers. However, large-scale UAM operations connecting multiple origin-destination pairs raise concerns about air traffic safety and efficiency due to potential conflict movements, particularly at major conflict points analogous to roadway junctions. To meet the safety and efficiency requirements of future UAM operations, this work proposes an air traffic management framework that integrates route guidance and collision avoidance. The route guidance mechanism optimizes aircraft distribution across both spatial and temporal dimensions by regulating their paths (composed of waypoints). Given the optimized paths, the collision avoidance algorithm generates collision-free aircraft trajectories between waypoints in the 3D space. To enable large-scale applications, we develop fast approximation methods for centralized path planning and adopt the velocity obstacle model for distributed collision avoidance. To our knowledge, this work is one of the first to integrate route guidance and collision avoidance for UAM. Simulation results demonstrate that the proposed framework enables efficient and flexible UAM operations, including air traffic assignment, local congestion mitigation, and dynamic no-fly zone management. Compared with a collision-free baseline strategy, the proposed framework achieves considerable improvements in traffic safety and efficiency, with increases in the average minimum separation (+98.2%), the average travel speed (+70.2%), and the trip completion rate (+130%), along with a reduction in the energy consumption (-23.0%). The proposed framework demonstrates its potential for real-time traffic simulation and management in large-scale UAM systems.


[120] 2412.07718

Closed-Form Approximation of the Total Variation Proximal Operator

Total variation (TV) is a widely used function for regularizing imaging inverse problems that is particularly appropriate for images whose underlying structure is piecewise constant. TV regularized optimization problems are typically solved using proximal methods, but the way in which they are applied is constrained by the absence of a closed-form expression for the proximal operator of the TV function. A closed-form approximation of the TV proximal operator has previously been proposed, but its accuracy was not theoretically explored in detail. We address this gap by making several new theoretical contributions, proving that the approximation leads to a proximal operator of some convex function, it is equivalent to a gradient descent step on a smoothed version of TV, and that its error can be fully characterized and controlled with its scaling parameter. We experimentally validate our theoretical results on image denoising and sparse-view computed tomography (CT) image reconstruction.


[121] 2501.01191

Data-Driven Yet Formal Policy Synthesis for Stochastic Nonlinear Dynamical Systems

The automated synthesis of control policies for stochastic dynamical systems presents significant challenges. A standard approach is to construct a finite-state abstraction of the continuous system, typically represented as a Markov decision process (MDP). However, generating abstractions is challenging when (1) the system's dynamics are nonlinear, and/or (2) we do not have complete knowledge of the dynamics. In this work, we introduce a novel data-driven abstraction technique for nonlinear Lipschitz continuous dynamical systems with additive stochastic noise that addresses both of these issues. As a key step, we use samples of the dynamics to learn the enabled actions and transition probabilities of the abstraction. We represent abstractions as MDPs with intervals of transition probabilities, known as interval MDPs (IMDPs). These abstractions enable the synthesis of policies for the concrete nonlinear system, with probably approximately correct (PAC) guarantees on the probability of satisfying a specified control objective. Our numerical experiments illustrate the effectiveness and robustness of our approach in achieving reliable control under uncertainty.


[122] 2501.14792

A Wearable Strain-Sensor-Based Shoulder Patch for Fatigue Detection in Bicep Curls

A common challenge in home-based rehabilitation is muscle compensation induced by pain or fatigue, where patients with weakened primary muscles recruit secondary muscle groups to assist their movement, causing issues such as delayed rehabilitation progress or risk of further injury. In a home-based setting, the subtle compensatory actions may not be perceived since physiotherapists cannot directly observe patients. To address this problem, this study develops a novel wearable strain sensor-based shoulder patch to detect fatigue-induced muscle compensation during bicep curl exercises. Built on an observation that the amplitude of a strain sensor's resistance is correlated to the motion of a joint that the sensor is attached to, we develop an algorithm that can robustly detect the state when significant changes appear in the shoulder joint motion, which indicates fatigue-induced muscle compensation in bicep curls. The developed shoulder patch is tested on 13 subjects who perform bicep curl exercises with a 5 kg dumbbell until reaching fatigue. During the experiment, the performance of the shoulder patch is also benchmarked with optical tracking sensors and surface electromyography (sEMG) sensors. Results reveal that the proposed wearable sensor and detection methods effectively monitor fatigue-induced muscle compensation during bicep curl exercises in both Real-Time and Post Hoc modes. This development marks a significant step toward enhancing the effectiveness of home-based rehabilitation by providing physiotherapists with a tool to monitor and adjust treatment plans remotely.


[123] 2502.14935

Denoising, segmentation and volumetric rendering of optical coherence tomography angiography (OCTA) image using deep learning techniques: a review

Optical coherence tomography angiography (OCTA) is a non-invasive imaging technique widely used to study vascular structures and micro-circulation dynamics in the retina and choroid. OCTA has been widely used in clinics for diagnosing ocular disease and monitoring its progression, because OCTA is safer and faster than dye-based angiography while retaining the ability to characterize micro-scale structures. However, OCTA data contains many inherent noises from the devices and acquisition protocols and suffers from various types of artifacts, which impairs diagnostic accuracy and repeatability. Deep learning (DL) based imaging analysis models are able to automatically detect and remove artifacts and noises, and enhance the quality of image data. It is also a powerful tool for segmentation and identification of normal and pathological structures in the images. Thus, the value of OCTA imaging can be significantly enhanced by the DL-based approaches for interpreting and performing measurements and predictions on the OCTA data. In this study, we reviewed literature on the DL models for OCTA images in the latest five years. In particular, we focused on discussing the current problems in the OCTA data and the corresponding design principles of the DL models. We also reviewed the state-of-art DL models for 3D volumetric reconstruction of the vascular networks and pathological structures such as the edema and distorted optic disc. In addition, the publicly available dataset of OCTA images are summarized at the end of this review. Overall, this review can provide valuable insights for engineers to develop novel DL models by utilizing the characteristics of OCTA signals and images. The pros and cons of each DL methods and their applications discussed in this review can be helpful to assist technicians and clinicians to use proper DL models for fundamental research and disease screening.


[124] 2504.00196

Output-feedback model predictive control under dynamic uncertainties using integral quadratic constraints

In this work, we propose an output-feedback tube-based model predictive control (MPC) scheme for linear systems under dynamic uncertainties that are described via integral quadratic constraints (IQC). By leveraging IQCs, a large class of nonlinear and dynamic uncertainties can be addressed. We leverage recent IQC synthesis tools to design a dynamic controller and an estimator that are robust to these uncertainties and minimize the size of the resulting constraint tightening in the MPC. Thereby, we show that the robust estimation problem using IQCs with peak-to-peak performance can be convexified. We guarantee recursive feasibility, robust constraint satisfaction, and input-to-state stability of the resulting MPC scheme.


[125] 2504.07144

GIGA: Generalizable Sparse Image-driven Gaussian Humans

Driving a high-quality and photorealistic full-body virtual human from a few RGB cameras is a challenging problem that has become increasingly relevant with emerging virtual reality technologies. A promising solution to democratize such technology would be a generalizable method that takes sparse multi-view images of any person and then generates photoreal free-view renderings of them. However, the state-of-the-art approaches are not scalable to very large datasets and, thus, lack diversity and photorealism. To address this problem, we propose GIGA, a novel, generalizable full-body model for rendering photoreal humans in free viewpoint, driven by a single-view or sparse multi-view video. Notably, GIGA can scale training to a few thousand subjects while maintaining high photorealism and synthesizing dynamic appearance. At the core, we introduce a MultiHeadUNet architecture, which takes an approximate RGB texture accumulated from a single or multiple sparse views and predicts 3D Gaussian primitives represented as 2D texels on top of a human body mesh. At test time, our method performs novel view synthesis of a virtual 3D Gaussian-based human from 1 to 4 input views and a tracked body template for unseen identities. Our method excels over prior works by a significant margin in terms of identity generalization capability and photorealism.


[126] 2504.08922

Data-Importance-Aware Power Allocation for Adaptive Semantic Communication in Computer Vision Applications

Life-transformative applications such as immersive extended reality are revolutionizing wireless communications and computer vision (CV). This paper presents a novel framework for importance-aware adaptive data transmissions, designed specifically for real-time CV applications where task-specific fidelity is critical. A novel importance-weighted mean square error (IMSE) metric is introduced as a task-oriented measure of reconstruction quality, considering sub-pixel-level importance (SP-I) and semantic segment-level importance (SS-I) models. To minimize IMSE under total power constraints, data-importance-aware waterfilling approaches are proposed to optimally allocate transmission power according to data importance and channel conditions, prioritizing sub-streams with high importance. Simulation results demonstrate that the proposed approaches significantly outperform margin-adaptive waterfilling and equal power allocation strategies. The data partitioning that combines both SP-I and SS-I models is shown to achieve the most significant improvements, with normalized IMSE gains exceeding $7\,$dB and $10\,$dB over the baselines at high SNRs ($>10\,$dB). These substantial gains highlight the potential of the proposed framework to enhance data efficiency and robustness in real-time CV applications, especially in bandwidth-limited and resource-constrained environments.


[127] 2504.19062

Versatile Framework for Song Generation with Prompt-based Control

Song generation focuses on producing controllable high-quality songs based on various prompts. However, existing methods struggle to generate vocals and accompaniments with prompt-based control and proper alignment. Additionally, they fall short in supporting various tasks. To address these challenges, we introduce VersBand, a multi-task song generation framework for synthesizing high-quality, aligned songs with prompt-based control. VersBand comprises these primary models: 1) VocalBand, a decoupled model, leverages the flow-matching method for generating singing styles, pitches, and mel-spectrograms, allowing fast, high-quality vocal generation with style control. 2) AccompBand, a flow-based transformer model, incorporates the Band-MOE, selecting suitable experts for enhanced quality, alignment, and control. This model allows for generating controllable, high-quality accompaniments aligned with vocals. 3) Two generation models, LyricBand for lyrics and MelodyBand for melodies, contribute to the comprehensive multi-task song generation system, allowing for extensive control based on multiple prompts. Experimental results show that VersBand outperforms baseline models across multiple song generation tasks using objective and subjective metrics. Demos and codes are available at this https URL and this https URL.


[128] 2504.19119

MLICv2: Enhanced Multi-Reference Entropy Modeling for Learned Image Compression

Recent advances in learned image compression (LIC) have achieved remarkable performance improvements over traditional codecs. Notably, the MLIC series-LICs equipped with multi-reference entropy models-have substantially surpassed conventional image codecs such as Versatile Video Coding (VVC) Intra. However, existing MLIC variants suffer from several limitations: performance degradation at high bitrates due to insufficient transform capacity, suboptimal entropy modeling that fails to capture global correlations in initial slices, and lack of adaptive channel importance modeling. In this paper, we propose MLICv2 and MLICv2+, enhanced successors that systematically address these limitations through improved transform design, dvanced entropy modeling, and exploration of the potential of instance-specific optimization. For transform enhancement, we introduce a lightweight token mixing block inspired by the MetaFormer architecture, which effectively mitigates high-bitrate performance degradation while maintaining computational efficiency. For entropy modeling improvements, we propose hyperprior-guided global correlation prediction to extract global context even in the initial slice of latent representation, complemented by a channel reweighting module that dynamically emphasizes informative channels. We further explore enhanced positional embedding and guided selective compression strategies for superior context modeling. Additionally, we apply the Stochastic Gumbel Annealing (SGA) to demonstrate the potential for further performance improvements through input-specific optimization. Extensive experiments demonstrate that MLICv2 and MLICv2+ achieve state-of-the-art results, reducing Bjøntegaard-Delta Rate by 16.54%, 21.61%, 16.05% and 20.46%, 24.35%, 19.14% on Kodak, Tecnick, and CLIC Pro Val datasets, respectively, compared to VTM-17.0 Intra.


[129] 2505.13237

SAKURA: On the Multi-hop Reasoning of Large Audio-Language Models Based on Speech and Audio Information

Large audio-language models (LALMs) extend the large language models with multimodal understanding in speech, audio, etc. While their performances on speech and audio-processing tasks are extensively studied, their reasoning abilities remain underexplored. Particularly, their multi-hop reasoning, the ability to recall and integrate multiple facts, lacks systematic evaluation. Existing benchmarks focus on general speech and audio-processing tasks, conversational abilities, and fairness but overlook this aspect. To bridge this gap, we introduce SAKURA, a benchmark assessing LALMs' multi-hop reasoning based on speech and audio information. Results show that LALMs struggle to integrate speech/audio representations for multi-hop reasoning, even when they extract the relevant information correctly, highlighting a fundamental challenge in multimodal reasoning. Our findings expose a critical limitation in LALMs, offering insights and resources for future research.


[130] 2505.20575

Synergising Hierarchical Data Centers and Power Networks: A Privacy-Preserving Approach

In the era of digitization, data centers have emerged as integral contributors sustaining our interlinked world, bearing responsibility for an increasing proportion of the world's energy consumption. To facilitate the their fast rollout while progressing towards net-zero energy systems, the synergy of hierarchical data centers (cloud-fog-edge) and power networks can play a pivotal role. However, existing centralized co-dispatch manners encroach on the privacy of different agents within the integrated systems, meanwhile suffering from the combinatorial explosion. In this research, we propose a near-optimal distributed privacy-preserving approach to solve the non-convex synergy (day-ahead co-dispatch) problem. The synergy problem is formulated as a mixed integer quadratically constrained quadratic programming considering both communication and energy conservation, where Lyapunov optimization is introduced to balance operating costs and uncertain communication delays. To mitigate impacts of the highly non-convex nature, the normalized multi-parametric disaggregation technique is leveraged to reformulate the problem into a mixed integer non-linear programming. To further overcome non-smoothness of the reformulated problem, the customized $\ell_1-$surrogate Lagrangian relaxation method with convergence guarantees is proposed to solve the problem in a distributed privacy-preserving manner. The effectiveness, optimality, and scalability of the proposed methodologies for the synergy problem are validated via numerical simulations. Simulation results also indicate that computing tasks can be delayed and migrated within the hierarchical data centers, demonstrating the flexible resource allocation capabilities of the hierarchical data center architecture, further facilitating peak load balancing in the power network.


[131] 2506.13094

MorphSAM: Learning the Morphological Prompts from Atlases for Spine Image Segmentation

Spine image segmentation is crucial for clinical diagnosis and treatment of spine diseases. The complex structure of the spine and the high morphological similarity between individual vertebrae and adjacent intervertebral discs make accurate spine segmentation a challenging task. Although the Segment Anything Model (SAM) has been proposed, it still struggles to effectively capture and utilize morphological information, limiting its ability to enhance spine image segmentation performance. To address these challenges, in this paper, we propose a MorphSAM that explicitly learns morphological information from atlases, thereby strengthening the spine image segmentation performance of SAM. Specifically, the MorphSAM includes two fully automatic prompt learning networks, 1) an anatomical prompt learning network that directly learns morphological information from anatomical atlases, and 2) a semantic prompt learning network that derives morphological information from text descriptions converted from the atlases. Then, the two learned morphological prompts are fed into the SAM model to boost the segmentation performance. We validate our MorphSAM on two spine image segmentation tasks, including a spine anatomical structure segmentation task with CT images and a lumbosacral plexus segmentation task with MR images. Experimental results demonstrate that our MorphSAM achieves superior segmentation performance when compared to the state-of-the-art methods.


[132] 2506.14318

BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification with Swin-HAFNet

Accurate segmentation and classification of brain tumors from Magnetic Resonance Imaging (MRI) remain key challenges in medical image analysis, primarily due to the lack of high-quality, balanced, and diverse datasets. In this work, we present a newly developed MRI dataset named BRISC designed specifically for brain tumor segmentation and classification tasks. The dataset comprises 6,000 contrast-enhanced T1-weighted MRI scans annotated by certified radiologists and physicians. It includes three major tumor types, namely glioma, meningioma, and pituitary, as well as non-tumorous cases. Each sample includes high-resolution labels and is categorized across axial, sagittal, and coronal imaging planes to facilitate robust model development and cross-view generalization. To demonstrate the utility of the dataset, we propose a transformer-based model, leveraging a Swin Transformer backbone for multi-scale feature representation, to benchmark both segmentation and classification tasks. This model serves as a benchmark to demonstrate the utility of the BRISC dataset for advancing methodological research in neuro-oncological image analysis. datasetlink: this https URL


[133] 2507.09226

Can We Really Repurpose Multi-Speaker ASR Corpus for Speaker Diarization?

Neural speaker diarization is widely used for overlap-aware speaker diarization, but it requires large multi-speaker datasets for training. To meet this data requirement, large datasets are often constructed by combining multiple corpora, including those originally designed for multi-speaker automatic speech recognition (ASR). However, ASR datasets often feature loosely defined segment boundaries that do not align with the stricter conventions of diarization benchmarks. In this work, we show that such boundary looseness significantly impacts the diarization error rate, reducing evaluation reliability. We also reveal that models trained on data with varying boundary precision tend to learn dataset-specific looseness, leading to poor generalization across out-of-domain datasets. Training with standardized tight boundaries via forced alignment improves not only diarization performance, especially in streaming scenarios, but also ASR performance when combined with simple post-processing.


[134] 2507.12698

Pixel Perfect MegaMed: A Megapixel-Scale Vision-Language Foundation Model for Generating High Resolution Medical Images

Medical image synthesis presents unique challenges due to the inherent complexity and high-resolution details required in clinical contexts. Traditional generative architectures such as Generative Adversarial Networks (GANs) or Variational Auto Encoder (VAEs) have shown great promise for high-resolution image generation but struggle with preserving fine-grained details that are key for accurate diagnosis. To address this issue, we introduce Pixel Perfect MegaMed, the first vision-language foundation model to synthesize images at resolutions of 1024x1024. Our method deploys a multi-scale transformer architecture designed specifically for ultra-high resolution medical image generation, enabling the preservation of both global anatomical context and local image-level details. By leveraging vision-language alignment techniques tailored to medical terminology and imaging modalities, Pixel Perfect MegaMed bridges the gap between textual descriptions and visual representations at unprecedented resolution levels. We apply our model to the CheXpert dataset and demonstrate its ability to generate clinically faithful chest X-rays from text prompts. Beyond visual quality, these high-resolution synthetic images prove valuable for downstream tasks such as classification, showing measurable performance gains when used for data augmentation, particularly in low-data regimes. Our code is accessible through the project website - this https URL.


[135] 2507.14165

A Multi-Modal IoT Node for Energy-Efficient Environmental Monitoring with Edge AI Processing

The widespread adoption of Internet of Things (IoT) technologies has significantly advanced environmental monitoring (EM) by enabling cost-effective and scalable sensing solutions. Concurrently, machine learning (ML) and artificial intelligence (AI) are introducing powerful tools for the efficient and accurate analysis of complex environmental data. However, current IoT platforms for environmental sensing are typically limited to a narrow set of sensors, preventing a comprehensive assessment of environmental conditions and lacking sufficient computational capabilities to support the deployment of advanced ML and AI algorithms on the edge. To overcome these limitations, we introduce a compact (17x38 mm2), multi-modal, MCU-based environmental IoT node integrating 11 sensors, including CO2 concentration, volatile organic compounds (VOCs), light intensity, UV radiation, pressure, temperature, humidity, visual sensing via an RGB camera, and precise geolocation through a GNSS module. It features GAP9, a parallel ultra-low-power system-on-chip, enabling real-time, energy-efficient edge processing of advanced ML models directly on-device. We implemented a YOLOv5-based occupancy detection pipeline (0.3 M parameters, 42 MOP per inference), demonstrating 42% energy savings over raw data streaming. Additionally, we present a smart indoor air quality (IAQ) monitoring setup that combines occupancy detection with adaptive sample rates, achieving operational times of up to 143 h on a single compact 600 mAh, 3.7 V battery. Our platform lays the groundwork for innovative applications such as predictive indoor IAQ, enabling efficient AI-driven on-edge forecasting for energy-efficient and autonomous, proactive pollution-mitigation control strategies


[136] 2508.02072

HyTIP: Hybrid Temporal Information Propagation for Masked Conditional Residual Video Coding

Most frame-based learned video codecs can be interpreted as recurrent neural networks (RNNs) propagating reference information along the temporal dimension. This work revisits the limitations of the current approaches from an RNN perspective. The output-recurrence methods, which propagate decoded frames, are intuitive but impose dual constraints on the output decoded frames, leading to suboptimal rate-distortion performance. In contrast, the hidden-to-hidden connection approaches, which propagate latent features within the RNN, offer greater flexibility but require large buffer sizes. To address these issues, we propose HyTIP, a learned video coding framework that combines both mechanisms. Our hybrid buffering strategy uses explicit decoded frames and a small number of implicit latent features to achieve competitive coding performance. Experimental results show that our HyTIP outperforms the sole use of either output-recurrence or hidden-to-hidden approaches. Furthermore, it achieves comparable performance to state-of-the-art methods but with a much smaller buffer size, and outperforms VTM 17.0 (Low-delay B) in terms of PSNR-RGB and MS-SSIM-RGB. The source code of HyTIP is available at this https URL.


[137] 2508.07165

Large-scale Multi-sequence Pretraining for Generalizable MRI Analysis in Versatile Clinical Applications

Multi-sequence Magnetic Resonance Imaging (MRI) offers remarkable versatility, enabling the distinct visualization of different tissue types. Nevertheless, the inherent heterogeneity among MRI sequences poses significant challenges to the generalization capability of deep learning models. These challenges undermine model performance when faced with varying acquisition parameters, thereby severely restricting their clinical utility. In this study, we present PRISM, a foundation model PRe-trained with large-scale multI-Sequence MRI. We collected a total of 64 datasets from both public and private sources, encompassing a wide range of whole-body anatomical structures, with scans spanning diverse MRI sequences. Among them, 336,476 volumetric MRI scans from 34 datasets (8 public and 26 private) were curated to construct the largest multi-organ multi-sequence MRI pretraining corpus to date. We propose a novel pretraining paradigm that disentangles anatomically invariant features from sequence-specific variations in MRI, while preserving high-level semantic representations. We established a benchmark comprising 44 downstream tasks, including disease diagnosis, image segmentation, registration, progression prediction, and report generation. These tasks were evaluated on 32 public datasets and 5 private cohorts. PRISM consistently outperformed both non-pretrained models and existing foundation models, achieving first-rank results in 39 out of 44 downstream benchmarks with statistical significance improvements. These results underscore its ability to learn robust and generalizable representations across unseen data acquired under diverse MRI protocols. PRISM provides a scalable framework for multi-sequence MRI analysis, thereby enhancing the translational potential of AI in radiology. It delivers consistent performance across diverse imaging protocols, reinforcing its clinical applicability.


[138] 2508.07903

Diffusing the Blind Spot: Uterine MRI Synthesis with Diffusion Models

Despite significant progress in generative modelling, existing diffusion models often struggle to produce anatomically precise female pelvic images, limiting their application in gynaecological imaging, where data scarcity and patient privacy concerns are critical. To overcome these barriers, we introduce a novel diffusion-based framework for uterine MRI synthesis, integrating both unconditional and conditioned Denoising Diffusion Probabilistic Models (DDPMs) and Latent Diffusion Models (LDMs) in 2D and 3D. Our approach generates anatomically coherent, high fidelity synthetic images that closely mimic real scans and provide valuable resources for training robust diagnostic models. We evaluate generative quality using advanced perceptual and distributional metrics, benchmarking against standard reconstruction methods, and demonstrate substantial gains in diagnostic accuracy on a key classification task. A blinded expert evaluation further validates the clinical realism of our synthetic images. We release our models with privacy safeguards and a comprehensive synthetic uterine MRI dataset to support reproducible research and advance equitable AI in gynaecology.


[139] 2508.12059

Co-Investment with Payoff-Sharing Mechanism for Cooperative Decision-Making in Network Design Games

Network-based systems are inherently interconnected, with the design and performance of subnetworks being interdependent. However, the decisions of self-interested operators may lead to suboptimal outcomes for users and the overall system. This paper explores cooperative mechanisms that can simultaneously benefit both operators and users. We address this challenge using a game-theoretical framework that integrates both non-cooperative and cooperative game theory. In the non-cooperative stage, we propose a network design game in which subnetwork decision-makers strategically design local infrastructures. In the cooperative stage, co-investment with payoff-sharing mechanism is developed to enlarge collective benefits and fairly distribute them. To demonstrate the effectiveness of our framework, we conduct case studies on the Sioux Falls network and real-world public transport networks in Zurich and Winterthur, Switzerland. Our evaluation considers impacts on environmental sustainability, social welfare, and economic efficiency. The proposed framework provides a foundation for improving interdependent networked systems by enabling strategic cooperation among self-interested operators.


[140] 2508.12445

FractMorph: A Fractional Fourier-Based Multi-Domain Transformer for Deformable Image Registration

Deformable image registration (DIR) is a crucial and challenging technique for aligning anatomical structures in medical images and is widely applied in diverse clinical applications. However, existing approaches often struggle to capture fine-grained local deformations and large-scale global deformations simultaneously within a unified framework. We present FractMorph, a novel 3D dual-parallel transformer-based architecture that enhances cross-image feature matching through multi-domain fractional Fourier transform (FrFT) branches. Each Fractional Cross-Attention (FCA) block applies parallel FrFTs at fractional angles of $0^\circ$, $45^\circ$, $90^\circ$, along with a log-magnitude branch, to effectively extract local, semi-global, and global features at the same time. These features are fused via cross-attention between the fixed and moving image streams. A lightweight U-Net style network then predicts a dense deformation field from the transformer-enriched features. On the intra-patient ACDC cardiac MRI dataset, FractMorph achieves state-of-the-art performance with an overall Dice Similarity Coefficient (DSC) of $86.45\%$, an average per-structure DSC of $75.15\%$, and a 95th-percentile Hausdorff distance (HD95) of $1.54~\mathrm{mm}$ on our data split. FractMorph-Light, a lightweight variant of our model with only 29.6M parameters, preserves high accuracy while halving model complexity. Furthermore, we demonstrate the generality of our approach with solid performance on a cerebral atlas-to-patient dataset. Our results demonstrate that multi-domain spectral-spatial attention in transformers can robustly and efficiently model complex non-rigid deformations in medical images using a single end-to-end network, without the need for scenario-specific tuning or hierarchical multi-scale networks. The source code is available at this https URL.


[141] 2508.13818

Robust Optimization for Movable Antenna-aided Cell-Free ISAC with Time Synchronization Errors

The cell-free integrated sensing and communication (CF-ISAC) system, which effectively mitigates intra-cell interference and provides precise sensing accuracy, is a promising technology for future 6G networks. However, to fully capitalize on the potential of CF-ISAC, accurate time synchronization (TS) between access points (APs) is critical. Due to the limitations of current synchronization technologies, TS errors have become a significant challenge in the development of the CF-ISAC system. In this paper, we propose a novel CF-ISAC architecture based on movable antennas (MAs), which exploits spatial diversity to enhance communication rates, maintain sensing accuracy, and reduce the impact of TS errors. We formulate a worst-case sensing accuracy optimization problem for TS errors to address this challenge, deriving the worst-case Cramér-Rao lower bound (CRLB). Subsequently, we develop a joint optimization framework for AP beamforming and MA positions to satisfy communication rate constraints while improving sensing accuracy. A robust optimization framework is designed for the highly complex and non-convex problem. Specifically, we employ manifold optimization (MO) to solve the worst-case sensing accuracy optimization problem. Then, we propose an MA-enabled meta-reinforcement learning (MA-MetaRL) to design optimization variables while satisfying constraints on MA positions, communication rate, and transmit power, thereby improving sensing accuracy. The simulation results demonstrate that the proposed robust optimization algorithm significantly improves the accuracy of the detection and is strong against TS errors. Moreover, compared to conventional fixed position antenna (FPA) technologies, the proposed MA-aided CF-ISAC architecture achieves higher system capacity, thus validating its effectiveness.


[142] 2508.13839

Distributed Distortion-Aware Robust Optimization for Movable Antenna-aided Cell-Free ISAC Systems

The cell-free integrated sensing and communication (CF-ISAC) architecture is a promising enabler for 6G, offering spectrum efficiency and ubiquitous coverage. However, real deployments suffer from hardware impairments, especially nonlinear distortion from power amplifiers (PAs), which degrades both communication and sensing. To address this, we propose a movable antenna (MA)-aided CF-ISAC system that mitigates distortion and enhances robustness. The PAs nonlinearities are modeled by a third-order memoryless polynomial, where the third-order distortion coefficients (3RDCs) vary across access points (APs) due to hardware differences, aging, and environmental conditions. We design a distributed distortion-aware worst-case robust optimization framework that explicitly incorporates uncertainty in 3RDCs. First, we analyze the worst-case impact of PA distortion on both the Cramer-Rao lower bound (CRLB) and communication rate. Then, to address the resulting non-convexity, we apply successive convex approximation (SCA) for estimating the 3RDCs. With these, we jointly optimize beamforming and MA positions under transmit power and sensing constraints. To efficiently solve this highly non-convex problem, we develop an MA-enabled self-attention convolutional graph neural network (SACGNN) algorithm. Simulations demonstrate that our method substantially enhances the communication-sensing trade-off under distortion and outperforms fixed-position antenna baselines in terms of robustness and capacity, thereby highlighting the advantages of MA-aided CF-ISAC systems.


[143] 2508.13937

Evaluating Particle Filtering for RSS-Based Target Localization under Varying Noise Levels and Sensor Geometries

Target localization is a critical task in various applications, such as search and rescue, surveillance, and wireless sensor networks. When a target emits a radio frequency (RF) signal, spatially distributed sensors can collect signal measurements to estimate the target's location. Among various measurement modalities, received signal strength (RSS) is particularly attractive due to its low cost, low power consumption, and ease of deployment. While particle filtering has previously been applied to RSS-based target localization, few studies have systematically analyzed its performance under varying sensor geometries and RSS noise levels. This paper addresses this gap by designing and evaluating a particle filtering algorithm for localizing a stationary target. The proposed method is compared with a conventional RSS-based trilateration approach across different sensor configurations and noise conditions. Simulation results indicate that particle filtering provides more accurate target localization than trilateration, particularly in scenarios with unfavorable sensor geometries and high RSS noise.


[144] 2508.14715

Recursive Gaussian Process Regression with Integrated Monotonicity Assumptions for Control Applications

In this paper, we present an extension to the recursive Gaussian Process (RGP) regression that enables the satisfaction of inequality constraints and is well suited for a real-time execution in control applications. The soft inequality constraints are integrated by introducing an additional extended Kalman Filter (EKF) update step using pseudo-measurements. The sequential formulation of the algorithm and several developed heuristics ensure both the performance and a low computational effort of the algorithm. A special focus lies on an efficient consideration of monotonicity assumptions for GPs in the form of inequality constraints. The algorithm is statistically validated in simulations, where the possible advantages in comparison with the standard RGP algorithm become obvious. The paper is concluded with a successful experimental validation of the developed algorithm for the monotonicity-preserving learning of heat transfer values for the control of a vapor compression cycle evaporator, leveraging a previously published partial input output linearization (IOL).


[145] 2508.15442

Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

Language Model (LM)-based Text-to-Speech (TTS) systems often generate hallucinated speech that deviates from input text. Existing mitigation strategies either demand excessive training resources or introduce significant inference latency. In this paper, we propose GFlOwNet-guided distribution AlignmenT (GOAT) for LM-based TTS, a post-training framework that mitigates hallucinations without relying on massive resources or inference cost. Specifically, we first conduct an uncertainty analysis, revealing a strong positive correlation between hallucination and model uncertainty. Based on this, we reformulate TTS generation as a trajectory flow optimization problem and introduce an enhanced Subtrajectory Balance objective together with a sharpened internal reward as target distribution. We further integrate reward temperature decay and learning rate optimization for stability and performance balance. Extensive experiments show that GOAT reduce over 50% character error rates on challenging test cases and lowering uncertainty by up to 58%, demonstrating its strong generalization ability and effectiveness.


[146] 2508.15660

Hessian-Based Lightweight Neural Network HessNet for State-of-the-Art Brain Vessel Segmentation on a Minimal Training Dataset

Accurate segmentation of blood vessels in brain magnetic resonance angiography (MRA) is essential for successful surgical procedures, such as aneurysm repair or bypass surgery. Currently, annotation is primarily performed through manual segmentation or classical methods, such as the Frangi filter, which often lack sufficient accuracy. Neural networks have emerged as powerful tools for medical image segmentation, but their development depends on well-annotated training datasets. However, there is a notable lack of publicly available MRA datasets with detailed brain vessel annotations. To address this gap, we propose a novel semi-supervised learning lightweight neural network with Hessian matrices on board for 3D segmentation of complex structures such as tubular structures, which we named HessNet. The solution is a Hessian-based neural network with only 6000 parameters. HessNet can run on the CPU and significantly reduces the resource requirements for training neural networks. The accuracy of vessel segmentation on a minimal training dataset reaches state-of-the-art results. It helps us create a large, semi-manually annotated brain vessel dataset of brain MRA images based on the IXI dataset (annotated 200 images). Annotation was performed by three experts under the supervision of three neurovascular surgeons after applying HessNet. It provides high accuracy of vessel segmentation and allows experts to focus only on the most complex important cases. The dataset is available at this https URL.


[147] 2508.15883

Beyond Imaging: Vision Transformer Digital Twin Surrogates for 3D+T Biological Tissue Dynamics

Understanding the dynamic organization and homeostasis of living tissues requires high-resolution, time-resolved imaging coupled with methods capable of extracting interpretable, predictive insights from complex datasets. Here, we present the Vision Transformer Digital Twin Surrogate Network (VT-DTSN), a deep learning framework for predictive modeling of 3D+T imaging data from biological tissue. By leveraging Vision Transformers pretrained with DINO (Self-Distillation with NO Labels) and employing a multi-view fusion strategy, VT-DTSN learns to reconstruct high-fidelity, time-resolved dynamics of a Drosophila midgut while preserving morphological and feature-level integrity across imaging depths. The model is trained with a composite loss prioritizing pixel-level accuracy, perceptual structure, and feature-space alignment, ensuring biologically meaningful outputs suitable for in silico experimentation and hypothesis testing. Evaluation across layers and biological replicates demonstrates VT-DTSN's robustness and consistency, achieving low error rates and high structural similarity while maintaining efficient inference through model optimization. This work establishes VT-DTSN as a feasible, high-fidelity surrogate for cross-timepoint reconstruction and for studying tissue dynamics, enabling computational exploration of cellular behaviors and homeostasis to complement time-resolved imaging studies in biological research.


[148] 2307.14072

Negative Spin $Δ_T$ noise Induced by Spin-Flip Scattering and Andreev Reflection

We study charge $\Delta_T$ noise, followed by an examination of spin $\Delta_T$ noise, in the normal metal-spin flipper-normal metal-insulator-superconductor (N-sf-N-I-S) junction. Our analysis reveals a key contrast: while charge $\Delta_T$ noise remains strictly positive, spin $\Delta_T$ noise undergoes a sign reversal from positive to negative, driven by the interplay between spin-flip scattering as well as Andreev reflection. In contrast, charge quantum shot noise remains positive and sign-definite, which is valid for spin quantum shot noise also. The emergence of negative spin $\Delta_T$ noise has two major implications. First, it establishes a clear distinction between spin resolved $\Delta_T$ noise and quantum shot noise: the former is dominated by opposite-spin correlations, whereas the latter is led by same-spin correlations. Second, it provides access to scattering mechanisms that are not captured by quantum shot noise alone. Thus, negative spin $\Delta_T$ noise serves as a unique probe of the cooperative effects of Andreev reflection and spin flipping. We further place our results in context by comparing them with earlier reports of negative $\Delta_T$ noise in strongly correlated systems, such as fractional quantum Hall states, and in multiterminal hybrid superconducting junctions. Overall, this work offers new insights into the mechanisms governing sign reversals in $\Delta_T$ noise and highlights their role as distinctive fingerprints of spin-dependent scattering in superconducting hybrid devices.


[149] 2311.09018

On the Foundation of Distributionally Robust Reinforcement Learning

Motivated by the need for a robust policy in the face of environment shifts between training and deployment, we contribute to the theoretical foundation of distributionally robust reinforcement learning (DRRL). This is accomplished through a comprehensive modeling framework centered around robust Markov decision processes (RMDPs). This framework obliges the decision maker to choose an optimal policy under the worst-case distributional shift orchestrated by an adversary. By unifying and extending existing formulations, we rigorously construct RMDPs that embrace various modeling attributes for both the decision maker and the adversary. These attributes include the structure of information availability-covering history-dependent, Markov, and Markov time-homogeneous dynamics-as well as constraints on the shifts induced by the adversary, with a focus on SA- and S-rectangularity. Within this RMDP framework, we investigate conditions for the existence or absence of the dynamic programming principle (DPP). From an algorithmic standpoint, the existence of DPP holds significant implications, as the vast majority of existing data and computationally efficient DRRL algorithms are reliant on the DPP. To investigate its existence, we systematically analyze various combinations of controller and adversary attributes, presenting streamlined proofs based on a unified methodology. We then construct counterexamples for settings where a fully general DPP fails to hold and establish asymptotically optimal history-dependent policies for key scenarios where the DPP is absent.


[150] 2312.12561

Generalizations of data-driven balancing: What to sample for different balancing-based reduced models

The quadrature-based balanced truncation (QuadBT) framework of arXiv:2104.01006 is a non-intrusive reformulation of balanced truncation (BT), a classical projection-based model-order reduction technique for linear systems. QuadBT is non-intrusive in the sense that it builds approximate balanced truncation reduced-order models entirely from system response data, e.g., transfer function measurements, without the need to reference an explicit state-space realization of the underlying full-order model. In this work, we generalize the QuadBT framework to other types of balanced truncation model reduction. Namely, we show what transfer function data are required to compute data-driven reduced models by balanced stochastic truncation, positive-real balanced truncation, and bounded-real balanced truncation. In each case, these data are evaluations of particular spectral factors associated with the system of interest. These results lay the theoretical foundation for data-driven reformulations of the aforementioned BT variants. Although it is not yet clear how to compute or obtain these spectral factor data in a practical real-world setting, examples using synthetic (numerically evaluated) transfer function data are included to validate the data-based reduced models.


[151] 2401.10266

Intelligent Condition Monitoring of Industrial Plants: An Overview of Methodologies and Uncertainty Management Strategies

Condition monitoring is essential for ensuring the safety, reliability, and efficiency of modern industrial systems. With the increasing complexity of industrial processes, artificial intelligence (AI) has emerged as a powerful tool for fault detection and diagnosis, attracting growing interest from both academia and industry. This paper provides a comprehensive overview of intelligent condition monitoring methods, with a particular emphasis on chemical plants and the widely used Tennessee Eastman Process (TEP) benchmark. State-of-the-art machine learning (ML) and deep learning (DL) algorithms are reviewed, highlighting their strengths, limitations, and applicability to industrial fault detection and diagnosis. Special attention is given to key challenges, including imbalanced and unlabeled data, and to strategies by which models can address these issues. Furthermore, comparative analyses of algorithm performance are presented to guide method selection in practical scenarios. This survey is intended to benefit both newcomers and experienced researchers by consolidating fundamental concepts, summarizing recent advances, and outlining open challenges and promising directions for intelligent condition monitoring in industrial plants.


[152] 2403.09110

SINDy-RL: Interpretable and Efficient Model-Based Reinforcement Learning

Deep reinforcement learning (DRL) has shown significant promise for uncovering sophisticated control policies that interact in complex environments, such as stabilizing a tokamak fusion reactor or minimizing the drag force on an object in a fluid flow. However, DRL requires an abundance of training examples and may become prohibitively expensive for many applications. In addition, the reliance on deep neural networks often results in an uninterpretable, black-box policy that may be too computationally expensive to use with certain embedded systems. Recent advances in sparse dictionary learning, such as the sparse identification of nonlinear dynamics (SINDy), have shown promise for creating efficient and interpretable data-driven models in the low-data regime. In this work we introduce SINDy-RL, a unifying framework for combining SINDy and DRL to create efficient, interpretable, and trustworthy representations of the dynamics model, reward function, and control policy. We demonstrate the effectiveness of our approaches on benchmark control environments and flow control problems, including gust mitigation on a 3D NACA 0012 airfoil at $Re=1000$. SINDy-RL achieves comparable performance to modern DRL algorithms using significantly fewer interactions in the environment and results in an interpretable control policy orders of magnitude smaller than a DRL policy.


[153] 2405.00947

Co-Optimization of EV Charging Control and Incentivization for Enhanced Power System Stability

We study how high charging rate demands from electric vehicles (EVs) in a power distribution grid may collectively cause poor dynamic performance, and propose a price incentivization strategy to steer customers to settle for lesser charging rate demands so that such performance degradation can be avoided. We pose the problem as a joint optimization and optimal control formulation. The optimization determines the optimal charging setpoints for EVs to minimize the $\mathcal{H}_2$-norm of the transfer function of the grid model, while the optimal control simultaneously develops a linear quadratic regulator (LQR) based state-feedback control signal for the battery currents of those EVs to jointly improve the small-signal dynamic performance of the system states. A subsequent algorithm is developed to determine how much customers may be willing to sacrifice their intended charging rate demands in return for financial incentives. Results are derived for both unidirectional and bidirectional charging, and validated using numerical simulations of multiple EV charging stations (EVCSs) in the IEEE 33-bus power distribution model.


[154] 2406.04920

Sim-to-Real Transfer of Deep Reinforcement Learning Agents for Online Coverage Path Planning

Coverage path planning (CPP) is the problem of finding a path that covers the entire free space of a confined area, with applications ranging from robotic lawn mowing to search-and-rescue. While for known environments, offline methods can find provably complete paths, and in some cases optimal solutions, unknown environments need to be planned online during mapping. We investigate the suitability of continuous-space reinforcement learning (RL) for this challenging problem, and propose a computationally feasible egocentric map representation based on frontiers, as well as a novel reward term based on total variation to promote complete coverage. Compared to existing classical methods, this approach allows for a flexible path space, and enables the agent to adapt to specific environment characteristics. Meanwhile, the deployment of RL models on real robot systems is difficult. Training from scratch may be infeasible due to slow convergence times, while transferring from simulation to reality, i.e. sim-to-real transfer, is a key challenge in itself. We bridge the sim-to-real gap through a semi-virtual environment, including a real robot and real-time aspects, while utilizing a simulated sensor and obstacles to enable environment randomization and automated episode resetting. We investigate what level of fine-tuning is needed for adapting to a realistic setting. Through extensive experiments, we show that our approach surpasses the performance of both previous RL-based approaches and highly specialized methods across multiple CPP variations in simulation. Meanwhile, our method successfully transfers to a real robot. Our code implementation can be found online.


[155] 2406.09726

PixRO: Pixel-Distributed Rotational Odometry with Gaussian Belief Propagation

Images are the standard input for most computer vision algorithms. However, their processing often reduces to parallelizable operations applied locally and independently to individual pixels. Yet, many of these low-level raw pixel readings only provide redundant or noisy information for specific high-level tasks, leading to inefficiencies in both energy consumption during their transmission off-sensor and computational resources in their subsequent processing. As novel sensors featuring advanced in-pixel processing capabilities emerge, we envision a paradigm shift toward performing increasingly complex visual processing directly in-pixel, reducing computational overhead downstream. We advocate for synthesizing high-level cues at the pixel level, enabling their off-sensor transmission to directly support downstream tasks more effectively than raw pixel readings. This paper conceptualizes a novel photometric rotation estimation algorithm to be distributed at pixel level, where each pixel estimates the global motion of the camera by exchanging information with other pixels to achieve global consensus. We employ a probabilistic formulation and leverage Gaussian Belief Propagation (GBP) for decentralized inference using messaging-passing. The proposed proposed technique is evaluated on real-world public datasets and we offer a in-depth analysis of the practicality of applying GBP to distributed rotation estimation at pixel level.


[156] 2409.03010

Diffusion MRI invariants: from the group of rotations to a complete neuroimaging fingerprint

Water diffusion gives rise to micrometer-scale sensitivity of diffusion MRI (dMR) to cellular-level tissue structure. The advent of precision medicine and quantitative imaging hinges on revealing the information content of dMR, and providing its parsimonious basis- and hardware-independent ``fingerprint". Here we focus on the geometry of a multi-dimensional dMR signal, derive a complete set of 21 diffusion and covariance tensor invariants in terms of irreducible representations of the group of rotations, and relate them to tissue properties. Conventional dMR metrics are shown to be redundant, while most of the invariants provide novel complementary information. Our complete set of invariants for the kurtosis tensor improves multiple sclerosis classification in a cohort of 1189 subjects. We design acquisitions based on icosahedral vertices guaranteeing minimal number of measurements to determine the most used invariants in only 1--2 minutes for the whole brain. Representing dMR signals via scalar invariant maps with definite symmetries will underpin machine learning classifiers of brain pathology, development, and aging, while fast protocols will enable translation of advanced dMR into clinical practice.


[157] 2409.17054

Using LLM for Real-Time Transcription and Summarization of Doctor-Patient Interactions into ePuskesmas in Indonesia: A Proof-of-Concept Study

One of the critical issues contributing to inefficiency in Puskesmas (Indonesian community health centers) is the time-consuming nature of documenting doctor-patient interactions. Doctors must conduct thorough consultations and manually transcribe detailed notes into ePuskesmas electronic health records (EHR), which creates substantial administrative burden to already overcapacitated physicians. This paper presents a proof-of-concept framework using large language models (LLMs) to automate real-time transcription and summarization of doctor-patient conversations in Bahasa Indonesia. Our system combines Whisper model for transcription with GPT-3.5 for medical summarization, implemented as a browser extension that automatically populates ePuskesmas forms. Through controlled roleplay experiments with medical validation, we demonstrate the technical feasibility of processing detailed 300+ seconds trimmed consultations in under 30 seconds while maintaining clinical accuracy. This work establishes the foundation for AI-assisted clinical documentation in resource-constrained healthcare environments. However, concerns have also been raised regarding privacy compliance and large-scale clinical evaluation addressing language and cultural biases for LLMs.


[158] 2411.14618

Active Learning-Based Optimization of Hydroelectric Turbine Startup to Minimize Fatigue Damage

Hydro-generating units (HGUs) play a crucial role in integrating intermittent renewable energy sources into the power grid due to their flexible operational capabilities. This evolving role has led to an increase in transient events, such as startups, which impose significant stresses on turbines, leading to increased turbine fatigue and a reduced operational lifespan. Consequently, optimizing startup sequences to minimize stresses is vital for hydropower utilities. However, this task is challenging, as stress measurements on prototypes can be expensive and time-consuming. To tackle this challenge, we propose an innovative automated approach to optimize the startup parameters of HGUs with a limited budget of measured startup sequences. Our method combines active learning and black-box optimization techniques, utilizing virtual strain sensors and dynamic simulations of HGUs. This approach was tested in real-time during an on-site measurement campaign on an instrumented Francis turbine prototype. The results demonstrate that our algorithm successfully identified an optimal startup sequence using only seven measured sequences. It achieves a remarkable 42% reduction in the maximum strain cycle amplitude compared to the standard startup sequence. This study paves the way for more efficient HGU startup optimization, potentially extending their operational lifespans.


[159] 2412.04100

Missing Melodies: AI Music Generation and its "Nearly" Complete Omission of the Global South

Recent advances in generative AI have sparked renewed interest and expanded possibilities for music generation. However, the performance and versatility of these systems across musical genres are heavily influenced by the availability of training data. We conducted an extensive analysis of over one million hours of audio datasets used in AI music generation research and manually reviewed more than 200 papers from eleven prominent AI and music conferences and organizations (AAAI, ACM, EUSIPCO, EURASIP, ICASSP, ICML, IJCAI, ISMIR, NeurIPS, NIME, SMC) to identify a critical gap in the fair representation and inclusion of the musical genres of the Global South in AI research. Our findings reveal a stark imbalance: approximately 86% of the total dataset hours and over 93% of researchers focus primarily on music from the Global North. However, around 40% of these datasets include some form of non-Western music, genres from the Global South account for only 14.6% of the data. Furthermore, approximately 51% of the papers surveyed concentrate on symbolic music generation, a method that often fails to capture the cultural nuances inherent in music from regions such as South Asia, the Middle East, and Africa. As AI increasingly shapes the creation and dissemination of music, the significant underrepresentation of music genres in datasets and research presents a serious threat to global musical diversity. We also propose some important steps to mitigate these risks and foster a more inclusive future for AI-driven music generation.


[160] 2412.06602

Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey

Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides the first comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit this https URL for a comprehensive paper list and updates.


[161] 2502.12984

On Erlang mixture approximations for differential equations with distributed time delays

In this paper, we propose a general approach for approximate simulation and analysis of delay differential equations (DDEs) with distributed time delays based on methods for ordinary differential equations (ODEs). The key innovation is that we 1) approximate the kernel by the probability density function of an Erlang mixture and 2) use the linear chain trick to transform the approximate DDEs to ODEs. Furthermore, we prove that an approximation with infinitely many terms converges for continuous and bounded kernels and for specific choices of the coefficients. We show that the approximate ODEs can be used to assess the stability of the steady states of the original DDEs and that the solution to the ODEs converges if the kernel is also exponentially bounded. Additionally, we propose an approach based on bisection and least-squares estimation for determining optimal parameter values in the approximation. Finally, we present numerical examples that demonstrate the accuracy and convergence rate obtained with the optimal parameters and the efficacy of the proposed approach for bifurcation analysis and Monte Carlo simulation. The numerical examples involve a modified logistic equation, chemotherapy-induced myelosuppression, and a point reactor kinetics model of a molten salt nuclear fission reactor.


[162] 2502.20583

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in reduced dimensionality. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto frontier of accuracy and efficiency. The code of LiteASR is available at this https URL.


[163] 2503.00156

Neural Posterior Estimation for Cataloging Astronomical Images with Spatially Varying Backgrounds and Point Spread Functions

Neural posterior estimation (NPE), a type of amortized variational inference, is a computationally efficient means of constructing probabilistic catalogs of light sources from astronomical images. To date, NPE has not been used to perform inference in models with spatially varying covariates. However, ground-based astronomical images have spatially varying sky backgrounds and point spread functions (PSFs), and accounting for this variation is essential for constructing accurate catalogs of imaged light sources. In this work, we introduce a method of performing NPE with spatially varying backgrounds and PSFs. In this method, we generate synthetic catalogs and semi-synthetic images for these catalogs using randomly sampled PSF and background estimates from existing surveys. Using this data, we train a neural network, which takes an astronomical image and representations of its background and PSF as input, to output a probabilistic catalog. Our experiments with Sloan Digital Sky Survey data demonstrate the effectiveness of NPE in the presence of spatially varying backgrounds and PSFs for light source detection, star/galaxy separation, and flux measurement.


[164] 2504.02913

On Word-of-Mouth and Private-Prior Sequential Social Learning

Social learning constitutes a fundamental framework for studying interactions among rational agents who observe each other's actions but lack direct access to individual beliefs. This paper investigates a specific social learning paradigm known as Word-of-Mouth (WoM), where a series of agents seeks to estimate the state of a dynamical system. The first agent receives noisy measurements of the state, while each subsequent agent relies solely on a degraded version of her predecessor's estimate. A defining feature of WoM is that the final agent's belief is publicly broadcast and subsequently adopted by all agents, in place of their own. We analyze this setting theoretically and through numerical simulations, noting that some agents benefit from using the belief of the last agent, while others experience performance deterioration.


[165] 2504.17709

Fault Detection in New Wind Turbines with Limited Data by Generative Transfer Learning

Intelligent condition monitoring of wind turbines is essential for reducing downtimes. Machine learning models trained on wind turbine operation data are commonly used to detect anomalies and, eventually, operation faults. However, data-driven normal behavior models (NBMs) require a substantial amount of training data, as NBMs trained with scarce data may result in unreliable fault detection. To overcome this limitation, we present a novel generative deep transfer learning approach to make SCADA samples from one wind turbine lacking training data resemble SCADA data from wind turbines with representative training data. Through CycleGAN-based domain mapping, our method enables the application of an NBM trained on an existing wind turbine to a new one with severely limited data. We demonstrate our approach on field data mapping SCADA samples across 7 substantially different WTs. Our findings show significantly improved fault detection in wind turbines with scarce data. Our method achieves the most similar anomaly scores to an NBM trained with abundant data, outperforming NBMs trained on scarce training data with improvements of +10.3% in F1-score when 1 month of training data is available and +16.8% when 2 weeks are available. The domain mapping approach outperforms conventional fine-tuning at all considered degrees of data scarcity, ranging from 1 to 8 weeks of training data. The proposed technique enables earlier and more reliable fault detection in newly installed wind farms, demonstrating a novel and promising research direction to improve anomaly detection when faced with training data scarcity.


[166] 2505.00210

Generative Machine Learning in Adaptive Control of Dynamic Manufacturing Processes: A Review

Dynamic manufacturing processes exhibit complex characteristics defined by time-varying parameters, nonlinear behaviors, and uncertainties. These characteristics require sophisticated in-situ monitoring techniques utilizing multimodal sensor data and adaptive control systems that can respond to real-time feedback while maintaining product quality. Recently, generative machine learning (ML) has emerged as a powerful tool for modeling complex distributions and generating synthetic data while handling these manufacturing uncertainties. However, adopting these generative technologies in dynamic manufacturing systems lacks a functional control-oriented perspective to translate their probabilistic understanding into actionable process controls while respecting constraints. This review presents a functional classification of Prediction-Based, Direct Policy, Quality Inference, and Knowledge-Integrated approaches, offering a perspective for understanding existing ML-enhanced control systems and incorporating generative ML. The analysis of generative ML architectures within this framework demonstrates control-relevant properties and potential to extend current ML-enhanced approaches where conventional methods prove insufficient. We show generative ML's potential for manufacturing control through decision-making applications, process guidance, simulation, and digital twins, while identifying critical research gaps: separation between generation and control functions, insufficient physical understanding of manufacturing phenomena, and challenges adapting models from other domains. To address these challenges, we propose future research directions aimed at developing integrated frameworks that combine generative ML and control technologies to address the dynamic complexities of modern manufacturing systems.


[167] 2505.01263

FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing

Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects while preserving the vocal timbre of a given brief reference audio. Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality. To address these issues, we propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber, which achieves high-quality audio-visual sync and pronunciation by incorporating a large speech language model and dual contrastive aligning while achieving better acoustic quality via the proposed voice-enhanced flow matching than previous works. First, we introduce Qwen2.5 as the backbone of LLM to learn the in-context sequence from movie scripts and reference audio. Then, the proposed semantic-aware learning focuses on capturing LLM semantic knowledge at the phoneme level. Next, dual contrastive aligning (DCA) boosts mutual alignment with lip movement, reducing ambiguities where similar phonemes might be confused. Finally, the proposed Flow-based Voice Enhancing (FVE) improves acoustic quality in two aspects, which introduces an LLM-based acoustics flow matching guidance to strengthen clarity and uses affine style prior to enhance identity when recovering noise into mel-spectrograms via gradient vector field prediction. Extensive experiments demonstrate that our method outperforms several state-of-the-art methods on two primary benchmarks.


[168] 2505.07996

An Ultra-Sub-Wavelength Microwave Polarization Switching Antenna for Covert Communication Implemented With Directed Surface Acoustic Waves in an Artificial Multiferroic Magnonic Crystal

Polarization switches are of great technological interest because of their many applications in long distance electromagnetic communication (e.g., polarization division multiplexing). Binary bits can be encoded in the two orthogonal polarizations and transmitted from point to point. Polarization switches, however, are usually much larger than the wavelength of the electromagnetic wave that they transmit. Consequently, most research in this area has focused on the optical regime where the wavelength is relatively short (~1 micron), so that the switch being much larger than the wavelength is not too inconvenient. However, this changes in the microwave regime where the wavelength is much larger (typically > 1 cm). That makes a microwave ultra-sub-wavelength polarization switch very attractive. Here, for the first time to the authors' knowledge, such a switch made of an array of magnetostrictive nanomagnets (~100 nm lateral dimension, ~5 nm thickness) deposited on a piezoelectric substrate to make an "artificial multiferroic magnonic crystal (AMMC)" is reported. A surface acoustic wave (SAW) launched in the substrate with suitable electrodes excites confined spin waves in the nanomagnets via phonon-magnon coupling, which then radiate electromagnetic waves in space via magnon-photon coupling. In some particular direction(s), determined by the AMMC parameters, the polarization of the beam at a given frequency can be rotated through ~90 degrees by switching the direction of SAW propagation in the piezoelectric substrate between two mutually orthogonal directions via activation of two different pairs of SAW launching electrodes. By aligning the transmitter and the receiver along such a direction (known only to authorized users), one can communicate covertly from point to point, without the need for encryption or cryptography.


[169] 2506.05140

AudioLens: A Closer Look at Auditory Attribute Perception of Large Audio-Language Models

Understanding the internal mechanisms of large audio-language models (LALMs) is crucial for interpreting their behavior and improving performance. This work presents the first in-depth analysis of how LALMs internally perceive and recognize auditory attributes. By applying vocabulary projection on three state-of-the-art LALMs, we track how attribute information evolves across layers and token positions. We find that attribute information generally decreases with layer depth when recognition fails, and that resolving attributes at earlier layers correlates with better accuracy. Moreover, LALMs heavily rely on querying auditory inputs for predicting attributes instead of aggregating necessary information in hidden states at attribute-mentioning positions. Based on our findings, we demonstrate a method to enhance LALMs. Our results offer insights into auditory attribute processing, paving the way for future improvements.


[170] 2506.06888

Automatic Speech Recognition of African American English: Lexical and Contextual Effects

Automatic Speech Recognition (ASR) models often struggle with the phonetic, phonological, and morphosyntactic features found in African American English (AAE). This study focuses on two key AAE variables: Consonant Cluster Reduction (CCR) and ING-reduction. It examines whether the presence of CCR and ING-reduction increases ASR misrecognition. Subsequently, it investigates whether end-to-end ASR systems without an external Language Model (LM) are more influenced by lexical neighborhood effect and less by contextual predictability compared to systems with an LM. The Corpus of Regional African American Language (CORAAL) was transcribed using wav2vec 2.0 with and without an LM. CCR and ING-reduction were detected using the Montreal Forced Aligner (MFA) with pronunciation expansion. The analysis reveals a small but significant effect of CCR and ING on Word Error Rate (WER) and indicates a stronger presence of lexical neighborhood effect in ASR systems without LMs.


[171] 2506.09375

CoLMbo: Speaker Language Model for Descriptive Profiling

Speaker recognition systems are often limited to classification tasks and struggle to generate detailed speaker characteristics or provide context-rich descriptions. These models primarily extract embeddings for speaker identification but fail to capture demographic attributes such as dialect, gender, and age in a structured manner. This paper introduces CoLMbo, a Speaker Language Model (SLM) that addresses these limitations by integrating a speaker encoder with prompt-based conditioning. This allows for the creation of detailed captions based on speaker embeddings. CoLMbo utilizes user-defined prompts to adapt dynamically to new speaker characteristics and provides customized descriptions, including regional dialect variations and age-related traits. This innovative approach not only enhances traditional speaker profiling but also excels in zero-shot scenarios across diverse datasets, marking a significant advancement in the field of speaker recognition.


[172] 2506.11281

Constrained Diffusion Models for Synthesizing Representative Power Flow Datasets

High-quality power flow datasets are essential for training machine learning models in power systems. However, security and privacy concerns restrict access to real-world data, making statistically accurate and physically consistent synthetic datasets a viable alternative. We develop a diffusion model for generating synthetic power flow datasets from real-world power grids that both replicate the statistical properties of the real-world data and ensure AC power flow feasibility. To enforce the constraints, we incorporate gradient guidance based on the power flow constraints to steer diffusion sampling toward feasible samples. For computational efficiency, we further leverage insights from the fast decoupled power flow method and propose a variable decoupling strategy for the training and sampling of the diffusion model. These solutions lead to a physics-informed diffusion model, generating power flow datasets that outperform those from the standard diffusion in terms of feasibility and statistical similarity, as shown in experiments across IEEE benchmark systems.


[173] 2507.06407

GloBIAS: strengthening the foundations of BioImage Analysis

There is a global need for BioImage Analysis (BIA) as advances in life sciences increasingly rely on cutting-edge imaging systems that have dramatically expanded the complexity and dimensionality of biological images. Turning these data into scientific discoveries requires people with effective data management skills and knowledge of state-of-the-art image processing and data analysis, in other words, BioImage Analysts. The Global BioImage Analysts' Society (GloBIAS) aims to enhance the profile of BioImage Analysts as a key role in science and research. Its vision encompasses fostering a global network, democratising access to BIA by providing educational resources tailored to various proficiency levels and disciplines, while also establishing guidelines for BIA courses. By collaboratively shaping the education of BioImage Analysts, GloBIAS aims to unlock the full potential of BIA in advancing life science research and to consolidate BIA as a career path. To better understand the needs and geographical representation of the BIA community, a worldwide survey was conducted and 291 responses were collected across people from all career stages and continents. This work discusses how GloBIAS aims to address community-identified shortcomings in work environment, funding, and scientific activities. The survey underscores a strong interest from the BIA community in activities proposed by GloBIAS and their interest to actively contribute. With 72% of respondents willing to pay for membership, the community's enthusiasm for both online and in-person events is set to drive the growth and sustainability of GloBIAS.


[174] 2507.07953

Incremental Collision Laws Based on the Bouc-Wen Model: Improved Collision Models and Further Results

In the article titled "The Bouc-Wen Model for Binary Direct Collinear Collisions of Convex Viscoplastic Bodies" and published in the Journal of Computational and Nonlinear Dynamics (Volume 20, Issue 6, June 2025), the authors studied mathematical models of binary direct collinear collisions of convex viscoplastic bodies that employed two incremental collision laws based on the Bouc-Wen differential model of hysteresis. It was shown that the models possess favorable analytical properties, and several model parameter identification studies were conducted, demonstrating that the models can accurately capture the nature of a variety of collision phenomena. In this article, the aforementioned models are augmented by modeling the effects of external forces as time-dependent inputs that belong to a certain function space. Furthermore, the range of the parameters under which the models possess favorable analytical properties is extended to several corner cases that were not considered in the prior publication. Finally, the previously conducted model parameter identification studies are extended, and an additional model parameter identification study is provided in an attempt to validate the ability of the augmented models to represent the effects of external forces.


[175] 2507.16190

LABNet: A Lightweight Attentive Beamforming Network for Ad-hoc Multichannel Microphone Invariant Real-Time Speech Enhancement

Multichannel speech enhancement (SE) aims to restore clean speech from noisy measurements by leveraging spatiotemporal signal features. In ad-hoc array conditions, microphone invariance (MI) requires systems to handle different microphone numbers and array geometries. From a practical perspective, multichannel recordings inevitably increase the computational burden for edge-device applications, highlighting the necessity of lightweight and efficient deployments. In this work, we propose a lightweight attentive beamforming network (LABNet) to integrate MI in a low-complexity real-time SE system. We design a three-stage framework for efficient intra-channel modeling and inter-channel interaction. A cross-channel attention module is developed to aggregate features from each channel selectively. Experimental results demonstrate our LABNet achieves impressive performance with ultra-light resource overhead while maintaining the MI, indicating great potential for ad-hoc array processing.


[176] 2507.21642

Whilter: A Whisper-based Data Filter for "In-the-Wild" Speech Corpora Using Utterance-level Multi-Task Classification

Large-scale in-the-wild speech datasets have become more prevalent in recent years due to increased interest in models that can learn useful features from unlabelled data for tasks such as speech recognition or synthesis. These datasets often contain undesirable features, such as multiple speakers, non-target languages, and music, which may impact model learning. The Whilter model is proposed as a multitask solution to identify these undesirable samples. Whilter uses a Whisper encoder with an attention-based classifier to solve five diverse classification problems at once. In addition, an annotated dataset is published for a subset of two popular in-the-wild corpora. Whilter achieves F1 scores above 85% and equal error rates of 6.5% to 7.8% for three of five subtasks, outperforming a state-of-the-art BEATs classifier on speech-specific classes, with a notable decrease in processing time compared to a combination of single-task alternatives.


[177] 2507.21886

Efficient Pain Recognition via Respiration Signals: A Single Cross-Attention Transformer Multi-Window Fusion Pipeline

Pain is a complex condition that affects a large portion of the population. Accurate and consistent evaluation is essential for individuals experiencing pain and supports the development of effective and advanced management strategies. Automatic pain assessment systems provide continuous monitoring, aid clinical decision-making, and aim to reduce distress while preventing functional decline. This study has been submitted to the Second Multimodal Sensing Grand Challenge for Next-Gen Pain Assessment (AI4PAIN). The proposed method introduces a pipeline that employs respiration as the input signal and integrates a highly efficient cross-attention transformer with a multi-windowing strategy. Extensive experiments demonstrate that respiration serves as a valuable physiological modality for pain assessment. Furthermore, results show that compact and efficient models, when properly optimized, can deliver strong performance, often surpassing larger counterparts. The proposed multi-window strategy effectively captures short-term and long-term features, along with global characteristics, enhancing the model's representational capacity.


[178] 2508.02148

Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation

Large-scale models (LSMs) can be an effective framework for semantic representation and understanding, thereby providing a suitable tool for designing semantic communication (SC) systems. However, their direct deployment is often hindered by high computational complexity and resource requirements. In this paper, a novel robust knowledge distillation based semantic communication (RKD-SC) framework is proposed to enable efficient and \textcolor{black}{channel-noise-robust} LSM-powered SC. The framework addresses two key challenges: determining optimal compact model architectures and effectively transferring knowledge while maintaining robustness against channel noise. First, a knowledge distillation-based lightweight differentiable architecture search (KDL-DARTS) algorithm is proposed. This algorithm integrates knowledge distillation loss and a complexity penalty into the neural architecture search process to identify high-performance, lightweight semantic encoder architectures. Second, a novel two-stage robust knowledge distillation (RKD) algorithm is developed to transfer semantic capabilities from an LSM (teacher) to a compact encoder (student) and subsequently enhance system robustness. To further improve resilience to channel impairments, a channel-aware transformer (CAT) block is introduced as the channel codec, trained under diverse channel conditions with variable-length outputs. Extensive simulations on image classification tasks demonstrate that the RKD-SC framework significantly reduces model parameters while preserving a high degree of the teacher model's performance and exhibiting superior robustness compared to existing methods.


[179] 2508.13521

AI-Augmented Photon-Trapping Spectrometer-on-a-Chip on Silicon Platform with Extended Near-Infrared Sensitivity

We present a compact, noise-resilient reconstructive spectrometer-on-a-chip that achieves high-resolution hyperspectral imaging across an extended near-infrared (NIR) range up to 1100nm. The device integrates monolithically fabricated silicon photodiodes enhanced with photon-trapping surface textures (PTST), enabling improved responsivity in the low-absorption NIR regime. Leveraging a fully connected neural network, we demonstrate accurate spectral reconstruction from only 16 uniquely engineered detectors, achieving <0.05 RMSE and 8nm resolution over a wide spectral range of 640nm to 1100nm. Our system outperforms conventional spectrometers, maintaining signal-to-noise ratio above 30dB even with 40dB of added detector noise; extending functionality to longer wavelengths up to 1100nm, while the traditional spectrometers fail to perform beyond 950nm due to poor detector efficiency and noise performance. With a footprint of 0.4mm2, dynamic range of 50dB, ultrafast time response (57ps), and high photodiode gain (>7000), this AI-augmented silicon spectrometer is well-suited for portable, real-time, and low-light applications in biomedical imaging, environmental monitoring, and remote sensing. The results establish a pathway toward fully integrated, high-performance hyperspectral sensing in a CMOS-compatible platform.


[180] 2508.14106

High-Throughput Low-Cost Segmentation of Brightfield Microscopy Live Cell Images

Live cell culture is crucial in biomedical studies for analyzing cell properties and dynamics in vitro. This study focuses on segmenting unstained live cells imaged with bright-field microscopy. While many segmentation approaches exist for microscopic images, none consistently address the challenges of bright-field live-cell imaging with high throughput, where temporal phenotype changes, low contrast, noise, and motion-induced blur from cellular movement remain major obstacles. We developed a low-cost CNN-based pipeline incorporating comparative analysis of frozen encoders within a unified U-Net architecture enhanced with attention mechanisms, instance-aware systems, adaptive loss functions, hard instance retraining, dynamic learning rates, progressive mechanisms to mitigate overfitting, and an ensemble technique. The model was validated on a public dataset featuring diverse live cell variants, showing consistent competitiveness with state-of-the-art methods, achieving 93% test accuracy and an average F1-score of 89% (std. 0.07) on low-contrast, noisy, and blurry images. Notably, the model was trained primarily on bright-field images with limited exposure to phase- contrast microscopy (<20%), yet it generalized effectively to the phase-contrast LIVECell dataset, demonstrating modality, robustness and strong performance. This highlights its potential for real- world laboratory deployment across imaging conditions. The model requires minimal compute power and is adaptable using basic deep learning setups such as Google Colab, making it practical for training on other cell variants. Our pipeline outperforms existing methods in robustness and precision for bright-field microscopy segmentation. The code and dataset are available for reproducibility 1.