Channel estimation is fundamental to wireless communications, yet it becomes increasingly challenging in massive multiple-input multiple-output (MIMO) systems where base stations employ hundreds of antennas. Traditional least-squares methods require prohibitive pilot overhead that scales with antenna count, while sparse estimation methods depend on precise channel models that may not always be practical. This paper proposes a model-free approach combining deep autoencoders and LSTM networks. The method first learns low-dimensional channel representations preserving temporal correlation through augmenting a channel charting-inspired loss function, then tracks these features to recover full channel information from limited pilots. Simulation results using ray-tracing datasets show that the proposed approach achieves up to 9 dB improvement in normalized mean square error compared to the least-squares methods under ill-conditioned scenarios, while maintaining scalability across MIMO configurations.
In a world increasingly powered by renewables and aiming for greenhouse gas-neutral industrial production, the future competitiveness of todays top manufacturing countries is questioned. This study applies detailed energy system modeling to quantify the Renewable Pull, an incentive for industry relocation exerted by countries with favorable renewable conditions. Results reveal that the Renewable Pull is not a cross-industrial phenomenon but strongly depends on the relationship between energy costs and transport costs. The intensity of the Renewable Pull varies, with China, India, and Japan facing a significantly stronger effect than Germany and the United States. Incorporating national capital cost assumptions proves critical, reducing Germanys Renewable Pull by a factor of six and positioning it as the second least affected top manufacturing country after Saudi Arabia. Using Germany as a case study, the analysis moreover illustrates that targeted import strategies, especially within the EU, can nearly eliminate the Renewable Pull, offering policymakers clear options for risk mitigation.
Everyday speech conveys far more than words, it reflects who we are, how we feel, and the circumstances surrounding our interactions. Yet, most existing speech datasets are acted, limited in scale, and fail to capture the expressive richness of real-life communication. With the rise of large neural networks, several large-scale speech corpora have emerged and been widely adopted across various speech processing tasks. However, the field of voice conversion (VC) still lacks large-scale, expressive, and real-life speech resources suitable for modeling natural prosody and emotion. To fill this gap, we release NaturalVoices (NV), the first large-scale spontaneous podcast dataset specifically designed for emotion-aware voice conversion. It comprises 5,049 hours of spontaneous podcast recordings with automatic annotations for emotion (categorical and attribute-based), speech quality, transcripts, speaker identity, and sound events. The dataset captures expressive emotional variation across thousands of speakers, diverse topics, and natural speaking styles. We also provide an open-source pipeline with modular annotation tools and flexible filtering, enabling researchers to construct customized subsets for a wide range of VC tasks. Experiments demonstrate that NaturalVoices supports the development of robust and generalizable VC models capable of producing natural, expressive speech, while revealing limitations of current architectures when applied to large-scale spontaneous data. These results suggest that NaturalVoices is both a valuable resource and a challenging benchmark for advancing the field of voice conversion. Dataset is available at: this https URL
Network digital twin (NDT) models are virtual models that replicate the behavior of physical communication networks and are considered a key technology component to enable novel features and capabilities in future 6G networks. In this work, we focus on NDTs that model the communication quality properties of a multi-cell, dynamically changing wireless network over a workspace populated with multiple moving users. We propose an NDT modeled as a hybrid system, where each mode corresponds to a different base station and comprises sub-modes that correspond to areas of the workspace with similar network characteristics. The proposed hybrid NDT is identified and continuously improved through an annealing optimization-based learning algorithm, driven by online data measurements collected by the users. The advantages of the proposed hybrid NDT are studied with respect to memory and computational efficiency, data consumption, and the ability to timely adapt to network changes. Finally, we validate the proposed methodology on real experimental data collected from a two-cell 5G testbed.
In low-carbon grids, system flexibility can be enhanced through mechanisms such as Demand Response (DR), enabling the efficient utilization of renewable energy. However, as Synchronous Generators (SGs) are being replaced with renewable energy characterized by Inverter-Based Resources (IBR), system stability is severely affected. Due to the limited overload capability of IBR, their Short-Circuit Current (SCC) contribution is much smaller than that of SGs, which may result in protection devices failing to trip during faults. Consequently, the remaining SGs play a key role in offering sufficient SCC volumes. Given that the commitment of SGs is closely related to system load, DR can thus indirectly affect their SCC provision, a relationship that has not been investigated. Therefore, this paper incorporates both DR and SCC constraints into a unit commitment model and conducts studies on an IEEE 30-bus system. The results show that although DR can reduce social costs by lowering power demand, it may also lead to inadequate SCC levels. Nevertheless, the cost increases by only 0.3% when DR is combined with SCC constraints, indicating that DR can actually help achieve a stable system in a cost-effective manner.
The escalating adoption of electric vehicles (EVs) and the growing demand for charging solutions are driving a surge in EV charger installations in distribution networks. However, this rising EV load strains the distribution grid, causing severe voltage drops, particularly at feeder extremities. This study proposes a proactive voltage management (PVM) framework that can integrate Monte Carlo-based simulations of varying EV charging loads to (i) identify potential voltage violations through a voltage violation analysis (VVA) model, and (ii) then mitigate those violations with optimally-invested battery energy storage systems (BESS) through an optimal expansion planning (OEP) model. A novel spatio-temporal adaptive targeting (STAT) strategy is proposed to alleviate the computational complexity of the OEP model by defining a targeted OEP (T-OEP) model, solved by applying the OEP model to (i) a reduced set of representative critical time periods and (ii) candidate BESS installation nodes. The efficacy and scalability of the proposed approach are validated on 33-bus, 69-bus, and a large-scale 240-bus system. Results demonstrate that the strategic sizing and placement of BESS not only effectively mitigate voltage violations but also yield substantial cost savings on electricity purchases under time-of-use tariffs. This research offers a cost-effective and scalable solution for integrating high penetrations of EVs, providing crucial insights for future distribution network planning.
This paper investigates using large language models (LLMs) to generate control actions directly, without requiring control-engineering expertise or hand-tuned algorithms. We implement several variants: (i) prompt-only, (ii) tool-assisted with access to historical data, and (iii) prediction-assisted using learned or simple models to score candidate actions. We compare them on tracking accuracy and actuation effort, with and without a prompt that requests lower actuator usage. Results show prompt-only LLMs already produce viable control, while tool-augmented versions adapt better to changing objectives but can be more sensitive to constraints, supporting LLM-in-the-loop control for evolving cyber-physical systems today and operator and human inputs.
Hybrid dynamical systems are viewed as the most complicated systems with continuous and event-based behaviors. Since traditional controllers cannot handle these systems, some newly-developed controllers have been published in recent decades to deal with them. This paper presents a novel implementable constrained final-state controller based on partitioning the system's state-space, computational simulations, and graph theory. Experimental results and a comparison with Model Predictive Controller on the three tank benchmark and swing-up control of a pendulum show the effectiveness of the proposed Computational Hybrid Controller(CHC).
Accurate segmentation of pediatric brain tumors in multi-parametric magnetic resonance imaging (mpMRI) is critical for diagnosis, treatment planning, and monitoring, yet faces unique challenges due to limited data, high anatomical variability, and heterogeneous imaging across institutions. In this work, we present an advanced nnU-Net framework tailored for BraTS 2025 Task-6 (PED), the largest public dataset of pre-treatment pediatric high-grade gliomas. Our contributions include: (1) a widened residual encoder with squeeze-and-excitation (SE) attention; (2) 3D depthwise separable convolutions; (3) a specificity-driven regularization term; and (4) small-scale Gaussian weight initialization. We further refine predictions with two postprocessing steps. Our models achieved first place on the Task-6 validation leaderboard, attaining lesion-wise Dice scores of 0.759 (CC), 0.967 (ED), 0.826 (ET), 0.910 (NET), 0.928 (TC) and 0.928 (WT).
Invariant extended Kalman filter (InEKF) possesses excellent trajectory-independent property and better consistency compared to conventional extended Kalman filter (EKF). However, when applied to scenarios involving both global-frame and body-frame observations, InEKF may fail to preserve its trajectory-independent property. This work introduces the concept of equivalence between error states and covariance matrices among different error-state Kalman filters, and shows that although InEKF exhibits trajectory independence, its covariance propagation is actually equivalent to EKF. A covariance transformation-based error-state Kalman filter (CT-ESKF) framework is proposed that unifies various error-state Kalman filtering algorithms. The framework gives birth to novel filtering algorithms that demonstrate improved performance in integrated navigation systems that incorporate both global and body-frame observations. Experimental results show that the EKF with covariance transformation outperforms both InEKF and original EKF in a representative INS/GNSS/Odometer integrated navigation system.
Algorithmic bias in medical imaging can perpetuate health disparities, yet its causes remain poorly understood in segmentation tasks. While fairness has been extensively studied in classification, segmentation remains underexplored despite its clinical importance. In breast cancer segmentation, models exhibit significant performance disparities against younger patients, commonly attributed to physiological differences in breast density. We audit the MAMA-MIA dataset, establishing a quantitative baseline of age-related bias in its automated labels, and reveal a critical Biased Ruler effect where systematically flawed labels for validation misrepresent a model's actual bias. However, whether this bias originates from lower-quality annotations (label bias) or from fundamentally more challenging image characteristics remains unclear. Through controlled experiments, we systematically refute hypotheses that the bias stems from label quality sensitivity or quantitative case difficulty imbalance. Balancing training data by difficulty fails to mitigate the disparity, revealing that younger patient cases are intrinsically harder to learn. We provide direct evidence that systemic bias is learned and amplified when training on biased, machine-generated labels, a critical finding for automated annotation pipelines. This work introduces a systematic framework for diagnosing algorithmic bias in medical segmentation and demonstrates that achieving fairness requires addressing qualitative distributional differences rather than merely balancing case counts.
This paper investigates the ambiguity function (AF) of communication signals carrying random data payloads, which is a fundamental metric characterizing sensing capability in ISAC systems. We first develop a unified analytical framework to evaluate the AF of communication-centric ISAC signals constructed from arbitrary orthonormal bases and independent identically distributed (i.i.d.) constellation symbols. Subsequently, we derive the discrete periodic ambiguity function (DP-AF) and provide closed-form expressions for its expected integrated sidelobe level (EISL) and average sidelobe level. Notably, we prove that the normalized EISL is invariant across all constellations and modulation bases. Finally, the theoretical findings are validated through simulations.
The rapid development of technology has led to an increase in the number of devices that rely on position, velocity, and time (PVT) information to perform their functions. As such, the Global Navigation Satellite Systems (GNSS) have been adopted as one of the most promising solutions to provide PVT. Consequently, there are renewed efforts aimed at enhancing GNSS capabilities to meet emerging use cases and their requirements. For example, GNSS is evolving to rely on low-earth-orbit satellites, shifting the focus from traditional medium-earth-orbit satellites. Unfortunately, these developments also bring forth higher risks of interference signals such as spoofers, which pose serious security threats. To address this challenge, artificial intelligence (AI)-inspired solutions are being developed to overcome the limitations of conventional mathematics-based approaches, which have proven inflexible when dealing with diverse forms of interference. In this paper, we advance this direction by proposing a meta-learning framework that enables GNSS receivers to detect various types of spoofers. Specifically, our approach exploits the radio frequency fingerprints present in the signal at both the pre-correlation and post-correlation stages of the receiver. The proposed solution has superior generalization properties compared to the state-of-the-art solutions. Numerical results demonstrate that our proposed solution significantly detects spoofers of different forms, with spoofing detection accuracies of more than 95% on multiple datasets from the Texas Spoofing Test Battery (TEXBAT) and the Oak Ridge Spoofing and Interference Test Battery (OAKBAT) repositories
The growing number of smart devices supporting bandwidth-intensive and latency-sensitive applications, such as real-time video analytics, smart sensing, and Extended Reality (XR), necessitates reliable wireless connectivity in indoor environments. Therein, accurate estimation of Radio Environment Maps (REMs) enables adaptive wireless network planning and optimization of Access Point (AP) placement. However, generating realistic REMs remains challenging due to the complexity of indoor spaces. To overcome this challenge, this paper introduces a multimodal dataset that integrates high-resolution 3D LiDAR scans with Wi-Fi Received Signal Strength Indicator (RSSI) measurements collected under 20 distinct AP configurations in a multi-room indoor environment. The dataset captures two measurement scenarios: the first without human presence in the environment, and the second with human presence. Thus, the presented dataset supports the study of dynamic environmental effects on wireless signal propagation. This resource is designed to facilitate research in data-driven wireless modeling, particularly in the context of emerging high-frequency standards such as IEEE 802.11be (Wi-Fi 7), and aims to advance the development of robust, high-capacity indoor communication systems.
Conservation agriculture features a soil surface covered with crop residues, which brings benefits of improving soil health and saving water. However, one significant challenge in conservation agriculture lies in precisely controlling the seeding depth on the soil covered with crop residues. This is constrained by the lack of ground distance information, since current distance measurement techniques, like laser, ultrasonic, or mechanical displacement sensors, are incapable of differentiating whether the distance information comes from the residue or the soil. This paper presents an image-based method to get the ground distance information for the crop-residues-covered soil. This method is performed with 3D camera and RGB camera, obtaining depth image and color image at the same time. The color image is used to distinguish the different areas of residues and soil and finally generates a mask image. The mask image is applied to the depth image so that only the soil area depth information can be used to calculate the ground distance, and residue areas can be recognized and excluded from ground distance detection. Experimentation shows that this distance measurement method is feasible for real-time implementation, and the measurement error is within plus or minus 3mm. It can be applied in conservation agriculture machinery for precision depth seeding, as well as other depth-control-demanding applications like transplant or tillage.
Low-altitude economy (LAE) is an emerging technological paradigm that enables continuous airspace coverage at multiple altitudes by providing highly reliable data connectivity for numerous low-altitude applications. However, existing networks cannot sufficiently support LAE development, as current base stations (BSs) are primarily designed for terrestrial users and lack the capability to provide continuous coverage at low altitudes. To overcome these challenges, rotatable antenna system (RAS) is introduced in LAE, enabling flexible beamforming by dynamically adjusting the boresight of directional antennas to extend low-altitude coverage and enhance the stability of data transmission. In this article, we first provide an overview of RAS-empowered LAE applications, including low-altitude communication, sensing, control, and computation. Then, we present two practical RAS deployment strategies for LAE scenarios, namely RAS-aided multi-BS and multi-unmanned aerial vehicle (UAV) cooperative coverages, as well as provide detailed discussions on their system architectures and performance benefits. Additionally, key design issues of RAS in LAE are discussed, including channel modeling and estimation, cellular access and interference cancellation, as well as RAS configuration and boresight optimization. Finally, we demonstrate the performance gains of RAS in LAE networks through experimental and simulation results.
Optimizing the topology of networks is an important challenge across engineering disciplines. In energy systems, network reconfiguration can substantially reduce losses and costs and thus support the energy transition. Unfortunately, many related optimization problems are NP hard, restricting practical applications. In this article, we address the problem of minimizing losses in radial networks, a problem that routinely arises in distribution grid operation. We show that even the computation of approximate solutions is computationally hard and propose quantum optimization as a promising alternative. We derive two quantum algorithmic primitives based on the Quantum Alternating Operator Ansatz (QAOA) that differ in the sampling of network topologies: a tailored sampling of radial topologies and simple sampling with penalty terms to suppress non-radial topologies. We show how to apply these algorithmic primitives to distribution grid reconfiguration and quantify the necessary quantum resources.
Aerosol Jet (AJ) printing is a versatile additive manufacturing technique capable of producing high-resolution interconnects on both 2D and 3D substrates. The AJ process is complex and dynamic with many hidden and unobservable states that influence the machine performance, including aerosol particle diameter, aerosol carrier density, vial level, and ink deposition in the tube and nozzle. Despite its promising potential, the widespread adoption of AJ printing is limited by inconsistencies in print quality that often stem from variability in these hidden states. To address these challenges, we develop a digital twin model of the AJ process that offers real-time insights into the machine's operations. The digital twin is built around a physics-based macro-model created through simulation and experimentation. The states and parameters of the digital model are continuously updated using probabilistic sequential estimation techniques to closely align with real-time measurements extracted from the AJ system's sensor and video data. The result is a digital model of the AJ process that continuously evolves over a physical machine's lifecycle. The digital twin enables accurate monitoring of unobservable physical characteristics, detects and predicts anomalous behavior, and forecasts the effect of control adjustments. This work presents a comprehensive end-to-end digital twin framework that integrates customized computer vision techniques, physics-based macro-modeling, and advanced probabilistic estimation methods to construct an evolving digital representation of the AJ equipment and process. While the methodologies are customized for aerosol jet printing, the process for constructing the digital twin can be applied for other advanced manufacturing techniques.
Parameter identification for electrochemical battery models has always been challenging due to the multitude of parameters involved, most of which cannot be directly measured. This paper evaluates the efficiency and optimality of three widely-used parameter identification methods for electrochemical battery models: Least Squares Method (LS), Particle Swarm Optimization (PSO), and Genetic Algorithm (GA). Therefore, a Single Particle Model (SPM) of a battery was developed and discretized. Battery parameter grouping was then performed to reduce the number of parameters required. Using a set of parameters previously identified from a real battery as a benchmark, we generated fitting and validation datasets to assess the methods' runtime and accuracy. The comparative analysis reveals that PSO outperforms the other methods in terms of accuracy and stability, making it highly effective for parameter identification when there is no prior knowledge of the battery's internal parameters. In contrast, LS is better suited for minor adjustments in parameters, particularly for aging batteries, whereas GA lags behind in both computational efficiency and optimality with respect to PSO.
Registration of optical and synthetic aperture radar (SAR) remote sensing images serves as a critical foundation for image fusion and visual navigation tasks. This task is particularly challenging because of their modal discrepancy, primarily manifested as severe nonlinear radiometric differences (NRD), geometric distortions, and noise variations. Under large geometric transformations, existing classical template-based and sparse keypoint-based strategies struggle to achieve reliable registration results for optical-SAR image pairs. To address these limitations, we propose GDROS, a geometry-guided dense registration framework leveraging global cross-modal image interactions. First, we extract cross-modal deep features from optical and SAR images through a CNN-Transformer hybrid feature extraction module, upon which a multi-scale 4D correlation volume is constructed and iteratively refined to establish pixel-wise dense correspondences. Subsequently, we implement a least squares regression (LSR) module to geometrically constrain the predicted dense optical flow field. Such geometry guidance mitigates prediction divergence by directly imposing an estimated affine transformation on the final flow predictions. Extensive experiments have been conducted on three representative datasets WHU-Opt-SAR dataset, OS dataset, and UBCv2 dataset with different spatial resolutions, demonstrating robust performance of our proposed method across different imaging resolutions. Qualitative and quantitative results show that GDROS significantly outperforms current state-of-the-art methods in all metrics. Our source code will be released at: this https URL.
We consider the problem of high-dimensional channel estimation in fast time-varying millimeter-wave MIMO systems with a hybrid architecture. By exploiting the low-rank and sparsity properties of the channel matrix, we propose a two-phase compressed sensing framework consisting of observation matrix completion and channel matrix sparse recovery, respectively. First, we formulate the observation matrix completion problem as a low-rank matrix completion (LRMC) problem and develop a robust rank-one matrix completion (R1MC) algorithm that enables the matrix and its rank to iteratively update. This approach achieves high-precision completion of the observation matrix and explicit rank estimation without prior knowledge. Second, we devise a rank-aware batch orthogonal matching pursuit (OMP) method for achieving low-latency sparse channel recovery. To handle abrupt rank changes caused by user mobility, we establish a discrete-time autoregressive (AR) model that leverages the temporal rank correlation between continuous-time instances to obtain a complete observation matrix capable of perceiving rank changes for more accurate channel estimates. Simulation results confirm the effectiveness of the proposed channel estimation frame and demonstrate that our algorithms achieve state-of-the-art performance in low-rank matrix recovery with theoretical guarantees.
Data centers play an increasingly critical role in societal digitalization, yet their rapidly growing energy demand poses significant challenges for sustainable operation. To enhance the energy efficiency of geographically distributed data centers, this paper formulates a multi-period optimization model that captures the interdependence of electricity, heat, and data flows. The optimization of such multicast flows inherently involves mixed-integer formulations and the access to proprietary or sensitive datasets, which correspondingly exacerbate computational complexity and raise data-privacy concerns. To address these challenges, an adaptive federated learning-to-optimization approach is proposed, accounting for the heterogeneity of datasets across distributed data centers. To safeguard privacy, cryptography techniques are leveraged in both the learning and optimization processes. A model acceptance criterion with convergence guarantee is developed to improve learning performance and filter out potentially contaminated data, while a verifiable double aggregation mechanism is further proposed to simultaneously ensure privacy and integrity of shared data during optimization. Theoretical analysis and numerical simulations demonstrate that the proposed approach preserves the privacy and integrity of shared data, achieves near-optimal performance, and exhibits high computational efficiency, making it suitable for large-scale data center optimization under privacy constraints.
This paper compares the impact of different conventional and emerging technologies and control strategies on frequency quality. We study, in particular, the long-term dynamic performance of grid-forming (GFM) and grid-following (GFL) inverter-based resources (IBRs) as well as conventional synchronous machines. Extensive simulations and several realistic scenarios consider both short-term and long-term aspects of frequency quality. It is shown that, while overall GFM IBRs significantly improve frequency quality, a combination of GFL IBRs providing frequency support such as wind and batteries, and synchronous condensers, might be enough to meet similar frequency quality standards. Another result of the paper is that the need for automatic generation control (AGC) becomes less clear in GFM IBR-dominated grids from a frequency quality perspective.
An autonomous vehicle can generate several terabytes of sensor data per day. A significant portion of this data consists of 3D point clouds produced by depth sensors such as LiDARs. This data must be transferred to cloud storage, where it is utilized for training machine learning models or conducting analyses, such as forensic investigations in the event of an accident. To reduce network and storage costs, this paper introduces DejaView. Although prior work uses interframe redundancies to compress data, DejaView searches for and uses redundancies on larger temporal scales (days and months) for more effective compression. We designed DejaView with the insight that the operating area of autonomous vehicles is limited and that vehicles mostly traverse the same routes daily. Consequently, the 3D data they collect daily is likely similar to the data they have captured in the past. To capture this, the core of DejaView is a diff operation that compactly represents point clouds as delta w.r.t. 3D data from the past. Using two months of LiDAR data, an end-to-end implementation of DejaView can compress point clouds by a factor of 210 at a reconstruction error of only 15 cm.
Accurately simulating rare but safety-critical driving behaviors is essential for the evaluation and certification of autonomous vehicles (AVs). However, current models often fail to reproduce realistic collision rates when calibrated on real-world data, largely due to inadequate representation of long-tailed behavioral distributions. Here, we uncover a simple yet unifying shifted power law that robustly characterizes the stochasticity of both human-driven vehicle (HV) and AV behaviors, especially in the long-tail regime. The model adopts a parsimonious analytical form with only one or two parameters, enabling efficient calibration even under data sparsity. Analyzing large-scale, micro-level trajectory data from global HV and AV datasets, the shifted power law achieves an average R2 of 0.97 and a nearly identical tail distribution, uniformly fits both frequent behaviors and rare safety-critical deviations, significantly outperforming existing Gaussian-based baselines. When integrated into an agent-based traffic simulator, it enables forward-rolling simulations that reproduce realistic crash patterns for both HVs and AVs, achieving rates consistent with real-world statistics and improving the fidelity of safety assessment without post hoc correction. This discovery offers a unified and data-efficient foundation for modeling high-risk behavior and improves the fidelity of simulation-based safety assessments for mixed AV/HV traffic. The shifted power law provides a promising path toward simulation-driven validation and global certification of AV technologies.
In this paper, we investigate the integration of simultaneously transmitting and reflecting reconfigurable intelligent surfaces (STAR-RIS) with rate-splitting multiple access (RSMA) for improving physical layer security (PLS) in integrated sensing and communication (ISAC) systems. Specifically, we consider a multi-user, multi-sensing target scenario, where each sensing target is treated as a potential eavesdropper, reflecting realistic deployment conditions. To enable fairness-aware secure communication among users while maintaining sensing performance, we formulate a joint optimization problem that designs the base station beamforming vectors and STAR-RIS coefficients, aiming to maximize the minimum secrecy rate under a minimum beampattern gain constraint. To solve the resulting non-convex problem, we propose an efficient algorithm based on alternating optimization (AO) and the majorization-minimization (MM) method. Simulation results verify the fast convergence of the proposed algorithm and demonstrate significant improvements in secure communication performance.
Resource scheduling is critical in many industries, especially in power systems. The Unit Commitment problem determines the on/off status and output levels of generators under many constraints. Traditional exact methods, such as mathematical programming methods or dynamic programming, remain the backbone of UC solution techniques, but they often rely on linear approximations or exhaustive search, leading to high computational burdens as system size grows. Metaheuristic approaches, such as genetic algorithms, particle swarm optimization, and other evolutionary methods, have been explored to mitigate this complexity; however, they typically lack optimality guarantees, exhibit sensitivity to initial conditions, and can become prohibitively time-consuming for large-scale systems. In this paper, we introduce a quantum-classical hybrid algorithm for UC and, by extension, other resource scheduling problems, that leverages Benders decomposition to decouple binary commitment decisions from continuous economic dispatch. The binary master problem is formulated as a quadratic unconstrained binary optimization model and solved on a quantum annealer. The continuous subproblem, which minimizes generation costs, with Lagrangian cuts feeding back to the master until convergence. We evaluate our hybrid framework on systems scaled from 10 to 1,000 generation units. Compared against a classical mixed-integer nonlinear programming baseline, the hybrid algorithm achieves a consistently lower computation-time growth rate and maintains an absolute optimality gap below 1.63%. These results demonstrate that integrating quantum annealing within a hybrid quantum-classical Benders decomposition loop can significantly accelerate large-scale resource scheduling without sacrificing solution quality, pointing toward a viable path for addressing the escalating complexity of modern power grids.
The power grid is the foundation of modern society, however extreme weather events have increasingly caused widespread outages. Enhancing grid resilience is therefore critical to maintaining secure and reliable operations. In disaster relief and restoration, vehicle-to-grid (V2G) technology allows electric vehicles (EVs) to serve as mobile energy resources by discharging to support critical loads or regulating grid frequency as needed. Effective V2G operation requires coordinated charging and discharging of many EVs through optimization. Similarly, in grid restoration, EVs must be strategically routed to affected areas, forming the mobile charging station placement (CSP) problem, which presents another complex optimization challenge. This work reviews state-of-the-art optimization methods for V2G and mobile CSP applications, outlines their limitations, and explores how quantum computing (QC) could overcome current computational bottlenecks. A QC-focused perspective is presented on enhancing grid resilience and accelerating restoration as extreme weather events grow more frequent and severe.
Several novel methods, including magnetogenetics and magnetoelectric stimulation, use high frequency alternating magnetic fields to precisely manipulate neural activity. To quantify the behavioral effects of such interventions in a freely moving mouse, we developed a dual-channel magnetic chamber, specifically designed for rate-sensitive magnetothermal-genetic stimulation, and adaptable for other uses of alternating magnetic fields. Through an optimized coil design, the system allows independent control of two spatially orthogonal uniform magnetic fields delivered at different frequencies within a 10 cm x 10 cm x 6 cm chamber. The two channels have nominal frequencies of 50 and 550 kHz with peak magnetic field strengths of 88 and 12.5 mT, achieved with resonant coil drives having peak voltages of 1.6 and 1.8 kV and currents of 1.0 and 0.26 kA, respectively. Additionally, a liquid cooling system enables magnetic field generation for second-level duration, and an observation port and camera allow video capture of the animal's behavior within the chamber. The system generates high-amplitude magnetic fields across two widely separated frequency channels with negligible interference (< 1%). Relatively uniform magnetic field distribution (+/-10% across 94% of the chamber volume) is maintained throughout the chamber, and temperature increase of the inner side of the coil enclosure during the operation is limited to < 0.35 °C/s to ensure in vivo safety. Using cobalt-doped and undoped iron oxide nanoparticles, we demonstrate channel-specific heating rates of 3.5 °C/s and 1.5 °C/s, respectively, validating frequency-selectivity. Both channels can run continuously for four seconds stably.
This paper presents a Deep Q-Network (DQN)- based algorithm for NOMA-aided resource allocation in smart factories, addressing the stringent requirements of Ultra-Reliable Low-Latency Communication (URLLC). The proposed algorithm dynamically allocates sub-channels and optimizes power levels to maximize throughput while meeting strict latency constraints. By incorporating a tunable parameter {\lambda}, the algorithm balances the trade-off between throughput and latency, making it suitable for various devices, including robots, sensors, and controllers, each with distinct communication needs. Simulation results show that robots achieve higher throughput, while sensors and controllers meet the low-latency requirements of URLLC, ensuring reliable communication for real-time industrial applications.
This paper presents analysis for target detection using tightly-coupled antenna (TCA) arrays with high mutual coupling (MC). We show that the wide operational bandwidth of TCAs is advantageous for target detection. We assume a sensing receiver equipped with a TCA array that collects joint time and frequency samples of the target's echo signals. Echoes are assumed to be unknown wideband signals, and noise at the TCA array follows a frequency-varying correlation model due to MC. We also assume that the echo signals are time varying, with no assumption on the temporal variation. We consider three regimes in frequency as constant, slowly or rapidly varying, to capture all possible spectral dynamics of the echoes. We propose a novel detector for the slowly-varying regime, and derive detectors based on maximum likelihood estimation (MLE) for the other regimes. For the rapidly-varying regime, we derive an extended energy detector for correlated noise with frequency and time samples. We analyze the performance of all the detectors. We also derive and analyze an ideal detector giving an upper bound on performance. We validate our analysis with simulations and demonstrate that our proposed detector outperforms the MLE-based detectors in terms of robustness to frequency variation. Also, we highlight that TCA arrays offer clear advantages over weakly-coupled antenna arrays in target detection.
Unmanned Aerial Vehicles (UAVs) play a crucial role in Maritime Search and Rescue (MSAR), contributing to the improvement of rescue efficiency and reduction of casualties. Typically, UAVs equipped with cameras collect data from disaster areas and transmit it to the shore-based rescue command centers. By deploying Mobile Edge Computing (MEC) servers, UAVs can pre-process video footage to reduce data transmission volume, thus reducing transmission delays. However, the limited computational capacity and energy of UAVs pose significant challenges to the efficiency of UAV-assisted MSAR systems. To address these problems, in this paper, we investigate a multi-UAV assisted MSAR system consisting of multiple Surveillance UAVs (S-UAVs) and a Relay UAV (R-UAV). Then, we formulate a joint optimization problem to minimize the maximum total latency among all S-UAVs via jointly making the computing offloading decisions, R-UAV deployment, and the association between a S-UAV and rescue targets while ensuring that all targets are monitored by S-UAVs. Since the formulated optimization problem is typically hard to solve due to its non-convexity, we propose an effective iterative algorithm by breaking it into three sub-problems. Numerical simulation results show the effectiveness of the proposed algorithm with various performance parameters.
Spoken Dialogue Models (SDMs) have advanced rapidly, yet their ability to sustain genuinely interactive multi-turn conversations remains underexplored, as most benchmarks focus on single-turn exchanges. We introduce Multi-Bench, the first benchmark explicitly designed to evaluate SDMs in multi-turn interactive dialogue with an emphasis on emotional intelligence. Multi-Bench employs a hierarchical structure with a basic track for emotion understanding and reasoning and an advanced track for emotion support and application. It comprises five carefully designed tasks and about 3.2K samples, ranging from emotion recognition to complex reasoning and interactive dialogue, supported by a reproducible evaluation framework. We evaluate six representative SDMs on eight subsets of Multi-Bench. Results show that while current SDMs achieve good performance on basic understanding tasks, they still have room for improvement in advanced multi-turn interactive dialogue and reasoning-related tasks, particularly in emotion awareness and application.
Stacked intelligent metasurfaces (SIMs) have recently emerged as an effective solution for next-generation wireless networks. A SIM comprises multiple metasurface layers that enable signal processing directly in the wave domain. Moreover, recent advances in flexible metamaterials have highlighted the potential of flexible intelligent metasurfaces (FIMs), which can be physically morphed to enhance communication performance. In this paper, we propose a stacked flexible intelligent metasurface (SFIM)-based communication system for the first time, where each metasurface layer is deformable to improve the system's performance. We first present the system model, including the transmit and receive signal models as well as the channel model, and then formulate an optimization problem to maximize the system sum rate under constraints on the transmit power budget, morphing distance, and the unit-modulus condition of the meta-atom responses. To solve this problem, we develop an alternating optimization framework based on the gradient projection method. Simulation results demonstrate that the proposed SFIM-based system achieves significant performance gains compared to its rigid SIM counterpart.
Purpose: To evaluate deep learning (DL) models for enhancing vitreous optical coherence tomography (OCT) image quality and reducing acquisition time. Methods: Conditional Denoising Diffusion Probabilistic Models (cDDPMs), Brownian Bridge Diffusion Models (BBDMs), U-Net, Pix2Pix, and Vector-Quantised Generative Adversarial Network (VQ-GAN) were used to generate high-quality spectral-domain (SD) vitreous OCT images. Inputs were SD ART10 images, and outputs were compared to pseudoART100 images obtained by averaging ten ART10 images per eye location. Model performance was assessed using image quality metrics and Visual Turing Tests, where ophthalmologists ranked generated images and evaluated anatomical fidelity. The best model's performance was further tested within the manually segmented vitreous on newly acquired data. Results: U-Net achieved the highest Peak Signal-to-Noise Ratio (PSNR: 30.230) and Structural Similarity Index Measure (SSIM: 0.820), followed by cDDPM. For Learned Perceptual Image Patch Similarity (LPIPS), Pix2Pix (0.697) and cDDPM (0.753) performed best. In the first Visual Turing Test, cDDPM ranked highest (3.07); in the second (best model only), cDDPM achieved a 32.9% fool rate and 85.7% anatomical preservation. On newly acquired data, cDDPM generated vitreous regions more similar in PSNR to the ART100 reference than true ART1 or ART10 B-scans and achieved higher PSNR on whole images when conditioned on ART1 than ART10. Conclusions: Results reveal discrepancies between quantitative metrics and clinical evaluation, highlighting the need for combined assessment. cDDPM showed strong potential for generating clinically meaningful vitreous OCT images while reducing acquisition time fourfold. Translational Relevance: cDDPMs show promise for clinical integration, supporting faster, higher-quality vitreous imaging. Dataset and code will be made publicly available.
We investigate how fully-passive electromagnetic skins (EMSs) can be engineered to enhance channel charting (CC) in dense urban environments. We employ two complementary state-of-the-art CC techniques, semi-supervised t-distributed stochastic neighbor embedding (t-SNE) and a semi-supervised Autoencoder (AE), to verify the consistency of results across nonparametric and parametric mappings. We show that the accuracy of CC hinges on a balance between signal-to-noise ratio (SNR) and spatial dissimilarity: EMS codebooks that only maximize gain, as in conventional Reconfigurable Intelligent Surface (RIS) optimization, suppress location fingerprints and degrade CC, while randomized phases increase diversity but reduce SNR. To address this trade-off, we design static EMS phase profiles via a quantile-driven criterion that targets worst-case users and improves both trustworthiness and continuity. In a 3D ray-traced city at 30 GHz, the proposed EMS reduces the 90th-percentile localization error from > 50 m to < 25 m for both t-SNE and AE-based CC, and decreases severe trajectory dropouts by over 4x under 15% supervision. The improvements hold consistently across the evaluated configurations, establishing static, pre-configured EMS as a practical enabler of CC without reconfiguration overheads.
Dynamic Wireless Electric Vehicle Charging (DWC) on electrified roadways is an emerging technology that can significantly reduce battery sizes, eliminate charging downtime, and alleviate range anxiety, specially for long-haul transportation and fleet operations of electric vehicles (EVs). However, these systems introduce new challenges for power system planning due to their short-duration and high-power demands which can strain the grid if not properly managed. As the energy demands from DWC depend on vehicle speed, density, dwell time in charging zones, and load profiles along road segments, there is a need for integrated planning of such systems, jointly considering both traffic behavior and EV energy consumption. In this paper, we propose a traffic-aware grid planning framework for DWC. We leverage a macroscopic Cell Transmission Model of traffic flow to estimate real-time, spatiotemporal EV charging demand from DWC corridors. The demand model is then integrated into an AC Optimal Power Flow based formulation to optimally size a microgrid that supports DWC under varying traffic conditions while minimizing the cost of operation. Our framework explicitly models how spatiotemporal traffic patterns affect the utilization of grid resources to obtain system designs that achieve lower costs and are easier to operationalize as compared to planning models that rely on worst-case traffic data. We demonstrate the framework on data from a 14-mile segment of the I-210W highway in California, USA, evaluating multiple traffic scenarios like free-flow, severe congestion, accidents of varying severity, and natural disasters like forest fires. Our results demonstrate that traffic-aware grid planning significantly reduces infrastructure costs as compared to worst-scenario based modeling, while ensuring reliability of service in terms of meeting charging demands under diverse traffic conditions.
With the growing application of deep learning in wearable devices, lightweight and efficient models are critical to address the computational constraints in resource-limited platforms. The performance of these approaches can be potentially improved by using various preprocessing methods. This study proposes a lightweight ResNet-based deep learning framework with Squeeze-and-Excitation (SE) modules for photoplethysmography (PPG) signal quality assessment (SQA) and compares different input configurations, including the PPG signal alone, its first derivative (FDP), its second derivative (SDP), the autocorrelation of PPG (ATC), and various combinations of these channels. Experimental evaluations on the Moore4Medical (M4M) and MIMIC-IV datasets demonstrate the model's performance, achieving up to 96.52% AUC on the M4M test dataset and up to 84.43% AUC on the MIMIC-IV dataset. The novel M4M dataset was collected to explore PPG-based monitoring for detecting atrial fibrillation (AF) and AF burden in high-risk patients. Compared to the five reproduced existing studies, our models achieves over 99% reduction in parameters and more than 60% reduction in floating-point operations (FLOPs).
This article investigates the security issue caused by false data injection attacks in distributed estimation, wherein each sensor can construct two types of residues based on local estimates and neighbor information, respectively. The resource-constrained attacker can select partial channels from the sensor network and arbitrarily manipulate the transmitted data. We derive necessary and sufficient conditions to reveal system vulnerabilities, under which the attacker is able to diverge the estimation error while preserving the stealthiness of all residues. We propose two defense strategies with mechanisms of exploiting the Euclidean distance between local estimates to detect attacks, and adopting the coding scheme to protect the transmitted data, respectively. It is proven that the former has the capability to address the majority of security loopholes, while the latter can serve as an additional enhancement to the former. By employing the time-varying coding matrix to mitigate the risk of being cracked, we demonstrate that the latter can safeguard against adversaries injecting stealthy sequences into the encoded channels. Hence, drawing upon the security analysis, we further provide a procedure to select security-critical channels that need to be encoded, thereby achieving a trade-off between security and coding costs. Finally, some numerical simulations are conducted to demonstrate the theoretical results.
Early and reliable detection of heart murmurs is essential for the timely diagnosis of cardiovascular diseases, yet traditional auscultation remains subjective and dependent on expert interpretation. This work investigates artificial intelligence (AI)-based murmur detection using the CirCor Heart Sound dataset, with a focus on enabling uncertainty-aware, resource-efficient deployment on edge devices. Three convolutional neural network (CNN) architectures of increasing complexity (Light, Baseline, and Heavy) were compared in terms of classification performance, computational cost, and suitability for on-device inference. Additionally, Monte Carlo Dropout was applied for uncertainty estimation, providing confidence measures to improve prediction sensitivity. Results show that lightweight models can achieve accuracy comparable to deeper networks (91%) while requiring two orders of magnitude fewer parameters. Incorporating uncertainty-based selective classification further improved sensitivity by 3%, enhancing robustness and clinical reliability. The findings highlight the feasibility of developing computationally efficient, uncertainty-aware AI systems for heart murmur screening in low-resource and remote healthcare settings.
With neural video codecs (NVCs) emerging as promising alternatives for traditional compression methods, it is increasingly important to determine whether existing quality metrics remain valid for evaluating their performance. However, few studies have systematically investigated this using well-designed subjective tests. To address this gap, this paper presents a subjective quality assessment study using two traditional (AV1 and VVC) and two variants of a neural video codec (DCVC-FM and DCVC-RT). Six source videos (8-10 seconds each, 4K/UHD-1, 60 fps) were encoded at four resolutions (360p to 2160p) using nine different QP values, resulting in 216 sequences that were rated in a controlled environment by 30 participants. These results were used to evaluate a range of full-reference, hybrid, and no-reference quality metrics to assess their applicability to the induced quality degradations. The objective quality assessment results show that VMAF and AVQBits|H0|f demonstrate strong Pearson correlation, while FasterVQA performed best among the tested no-reference metrics. Furthermore, PSNR shows the highest Spearman rank order correlation for within-sequence comparisons across the different codecs. Importantly, no significant performance differences in metric reliability are observed between traditional and neural video codecs across the tested metrics. The dataset, consisting of source videos, encoded videos, and both subjective and quality metric scores will be made publicly available following an open-science approach (this https URL).
We establish structural properties of optimal stopping problems under time-consistent dynamic (coherent) risk measures, focusing on value function monotonicity and the existence of control limit (threshold) optimal policies. While such results are well developed for risk-neutral (expected-value) models, they remain underexplored in risk-averse settings. Coherent risk measures typically lack the tower property and are subadditive rather than additive, complicating structural analysis. We show that value function monotonicity mirrors the risk-neutral case. Moreover, if the risk envelope associated with each coherent risk measure admits a minimal element, the risk-averse optimal stopping problem reduces to an equivalent risk-neutral formulation. We also develop a general procedure for identifying control limit optimal policies and use it to derive practical, verifiable conditions on the risk measures and MDP structure that guarantee their existence. We illustrate the theory and verify these conditions through optimal stopping problems arising in operations, marketing, and finance.
We analyze subliminal transfer in Transformer models, where a teacher embeds hidden traits that can be linearly decoded by a student without degrading main-task performance. Prior work often attributes transferability to global representational similarity, typically quantified with Centered Kernel Alignment (CKA). Using synthetic corpora with disentangled public and private labels, we distill students under matched and independent random initializations. We find that transfer strength hinges on alignment within a trait-discriminative subspace: same-seed students inherit this alignment and show higher leakage {\tau \approx} 0.24, whereas different-seed students -- despite global CKA > 0.9 -- exhibit substantially reduced excess accuracy {\tau \approx} 0.12 - 0.13. We formalize this with subspace-level CKA diagnostic and residualized probes, showing that leakage tracks alignment within the trait-discriminative subspace rather than global representational similarity. Security controls (projection penalty, adversarial reversal, right-for-the-wrong-reasons regularization) reduce leakage in same-base models without impairing public-task fidelity. These results establish seed-induced uniqueness as a resilience property and argue for subspace-aware diagnostics for secure multi-model deployments.
This work proposes a conformal approach for energy storage arbitrage to control the downside risks arose from imperfect price forecasts. Energy storage arbitrage relies solely on predictions of future market prices, while inaccurate price predictions may lead to significant profit losses. Based on conformal decision theory, we describe a controller that dynamically adjusts decision conservativeness through prediction sets without distributional assumptions. To enable online calibration when online profit loss feedback is unobservable, we establish that a temporal difference error serves as a measurable proxy. Building on this insight, we develop two online calibration strategies: prediction error-based adaptation targeting forecast accuracy, and value error-based calibration focusing on decision quality. Analysis of the conformal controller proves bounded long-term risk with convergence guarantees in temporal difference error, which further effectively manages risk exposure in potential profit losses. Case studies demonstrate superior performance in balancing risk and opportunity compared to benchmarks under varying forecast conditions.
In this paper, we propose a non-myopic sensor management algorithm for multi-target tracking, with multiple sensors operating in the same surveillance area. The algorithm is based on multi-Bernoulli filtering and selects the actions that solve a non-myopic minimisation problem, where the cost function is the mean square generalised optimal sub-pattern assignment (GOSPA) error, over a future time window. For tractability, the sensor management algorithm actually uses an upper bound of the GOSPA error and is implemented via Monte Carlo Tree Search (MCTS). The sensors have the ability to jointly optimise and select their actions with the considerations of all other sensors in the surveillance area. The benefits of the proposed algorithm are analysed via simulations.
Whispered speech lacks vocal-fold excitation and exhibits reduced energy and shifted formant frequencies, making natural and intelligible voice reconstruction highly challenging. To address this issue, we propose \emph{WhisperVC}, a three-stage framework for Mandarin whisper-to-speech (W2S) conversion. Stage~1 employs a fine-tuned Content Encoder based on the OpenAI Whisper-large~V3 model and a Conformer-based variational autoencoder with soft-DTW alignment to learn domain-invariant and temporally consistent representations. Stage~2 introduces a deterministic Length--Channel Aligner and a duration-free FastSpeech~2 model conditioned on speaker embeddings for controllable timbre and stable prosody. Stage~3 fine-tunes a HiFi-GAN vocoder on predicted mel-spectrograms to synthesize high-fidelity waveforms. Experiments on the AISHELL6-Whisper corpus demonstrate that WhisperVC achieves near ground-truth quality (\textbf{DNSMOS~3.11}, \textbf{UTMOS~2.52}, \textbf{CER~18.67\%}), while maintaining speaker similarity (\textbf{cosine~0.76}) and robust performance under whisper-only inference.
Feedback control algorithms traditionally rely on periodic execution on digital platforms. While this simplifies design and analysis, it often leads to inefficient resource usage (e.g., CPU, network bandwidth) in embedded control and shared networks. This work investigates self-triggering implementations of linear controllers in sampled-data systems with synchronous measurements. Our approach precomputes the next sampling sequence over a finite horizon based on current state information. We introduce a novel optimal self-triggering scheme that guarantees exponential stability for unperturbed systems and global uniform ultimate boundedness for perturbed systems. This ensures robustness against external disturbances with explicit performance guarantees. Simulations demonstrate the benefits of our approach.
In this paper, we address the problem of synthesizing safe and stabilizing controllers for nonlinear systems subject to complex safety specifications and input constraints. We introduce the Universal Barrier Function (UBF), a single continuously differentiable scalar-valued function that encodes both stability and safety criteria while accounting for input constraints. Using the UBF, we formulate a Quadratic Program (UBF-QP) to generate control inputs that are both safe and stabilizing under input constraints. We demonstrate that the UBF-QP is feasible if a UBF exists. Furthermore, under mild conditions, we prove that a UBF always exists. The proposed framework is then extended to systems with higher relative degrees. Finally, numerical simulations illustrate the effectiveness of our proposed approach.
The Pinching-Antenna System (PASS) reconfigures wireless channels through \emph{pinching beamforming}, in which the active positions of pinching antennas (PAs) along dielectric waveguides are optimized to shape the radiation pattern. This article investigates the performance of PASS-enabled tri-hybrid beamforming, where pinched waveguides are integrated with a hybrid digital-analog beamformer to mitigate path loss and enhance spectral efficiency. The channel capacity of the proposed system is characterized by deriving the optimal tri-hybrid beamformer at both the digital and analog domains, as well as the optimal placement of PAs. Closed-form upper and lower bounds of the channel capacity are obtained, leading to a capacity scaling law with respect to the number of PAs. Numerical results verify the tightness of the derived bounds and demonstrate that applying PASS to tri-hybrid beamforming yields a significant performance gain over conventional hybrid beamforming under the same number of radio-frequency chains.
Allocating costs, benefits, and emissions fairly among power system participant entities represents a persistent challenge. The Shapley value provides an axiomatically fair solution, yet computational barriers have limited its adoption beyond small-scale applications. This paper presents SurroShap, a scalable Shapley value approximation framework combining efficient coalition sampling with deep learning surrogate models that accelerate characteristic function evaluations. Exemplified through carbon emission responsibility allocation in power networks, SurroShap enables Shapley-based fair allocation for power systems with thousands of entities for the first time. We derive theoretical error bounds proving that time-averaged SurroShap allocations converge to be $\varepsilon$-close to exact Shapley values. Experiments on nine systems ranging from 26 to 1,951 entities demonstrate completion within the real-time operational window even at maximum scale, achieving 10^4-10^5 speedups over other sampling-based methods while maintaining tight error bounds. The resulting Shapley-based carbon allocations possess six desirable properties aligning individual interests with decarbonization goals. Year-long simulations on the Texas 2000-bus system validate real-world applicability, with regional analysis revealing how renewable-rich areas offset emission responsibility through exports while load centers bear responsibility for driving system-wide generation.
Transformers have become state-of-the-art (SOTA) for time-series classification, with models like PatchTST demonstrating exceptional performance. These models rely on patching the time series and learning relationships between raw temporal data blocks. We argue that this approach is blind to critical, non-obvious high-frequency information that is complementary to the temporal dynamics. In this letter, we propose Hi-WaveTST, a novel Hybrid architecture that augments the original temporal patch with a learnable, High-Frequency wavelet feature stream. Our wavelet stream uses a deep Wavelet Packet Decomposition (WPD) on each patch and extracts features using a learnable Generalized Mean (GeM) pooling layer. On the UCI-HAR benchmark dataset, our hybrid model achieves a mean accuracy of 93.38 percent plus-minus 0.0043, significantly outperforming the SOTA PatchTST baseline (92.59 percent plus-minus 0.0039). A comprehensive ablation study proves that every component of our design-the hybrid architecture, the deep high-frequency wavelet decomposition, and the learnable GeM pooling-is essential for this state-of-the-art performance.
In the era of large language models (LLMs) and artificial general intelligence (AGI), computer audition must evolve beyond traditional paradigms to fully leverage the capabilities of foundation models, towards more comprehensive understanding, more natural generation and more human-like interaction. Audio, as a modality rich in semantic, emotional, and contextual cues, plays a vital role in achieving naturalistic and embodied machine intelligence. This survey provides a comprehensive review of recent progress in integrating audio into LLMs, with a focus on four key areas: audio comprehension, audio generation, speech-based interaction, and audio-visual understanding. We analyze how LLMs are reshaping audio perception and reasoning, enabling systems to understand sound at a deeper semantic level, generate expressive audio outputs, and engage in human-like spoken interaction. Furthermore, we explore how the fusion of audio and visual modalities enhances situational awareness and cross-modal reasoning, pushing the boundaries of multimodal intelligence. This survey not only synthesizes existing research but also identifies critical challenges and future directions for building audio-native AGI systems capable of perceiving, understanding, and interacting through sound as naturally as humans do.
Model augmentation is a promising approach for integrating first-principles-based models with machine learning components. Augmentation can result in better model accuracy and faster convergence compared to black-box system identification methods, while maintaining interpretability of the models in terms of how the original dynamics are complemented by learning. A widely used augmentation structure in the literature is based on the parallel connection of the physics-based and learning components, for both of which the corresponding parameters are jointly optimized. However, due to overlap in representation of the system dynamics by such an additive structure, estimation often leads to physically unrealistic parameters, compromising model interpretability. To overcome this limitation, this paper introduces a novel orthogonal-by-construction model augmentation structure for input-output models, that guarantees recovery of the physically true parameters under appropriate identifiability conditions.
The most commonly used electrical rotary machines in the field are induction machines. In this paper, we propose an antenna based approach for the classification of motor faults in induction motors using the reflection coefficient S11 and the transmission coefficient S21 of the antenna. The spectrograms of S11 and S21 are seen to possess unique signatures for various fault conditions that are used for the classification. To learn the required characteristics and classification boundaries, deep convolution neural network (DCNN) is applied to the spectrogram of the S-parameter. DCNN has been found to reach classification accuracy 93% using S11, 98.1% using S21 and 100% using both S11 and S21. The effect of antenna operating frequency, its location and duration of signal on the classification accuracy is also presented and discussed.
This work presents a supervised deep hashing method for retrieving similar audio events. The proposed method, named AudioNet, is a deep-learning-based system for efficient hashing and retrieval of similar audio events using an audio example as a query. AudioNet achieves high retrieval performance on multiple standard datasets by generating binary hash codes for similar audio events, setting new benchmarks in the field, and highlighting its efficacy and effectiveness compare to other hashing methods. Through comprehensive experiments on standard datasets, our research represents a pioneering effort in evaluating the retrieval performance of similar audio events. A novel loss function is proposed which incorporates weighted contrastive and weighted pairwise loss along with hashcode balancing to improve the efficiency of audio event retrieval. The method adopts discrete gradient propagation, which allows gradients to be propagated through discrete variables during backpropagation. This enables the network to optimize the discrete hash codes using standard gradient-based optimization algorithms, which are typically used for continuous variables. The proposed method showcases promising retrieval performance, as evidenced by the experimental results, even when dealing with imbalanced datasets. The systematic analysis conducted in this study further supports the significant benefits of the proposed method in retrieval performance across multiple datasets. The findings presented in this work establish a baseline for future studies on the efficient retrieval of similar audio events using deep audio embeddings.
Compared to real-valued signals, complex-valued signals provide a unique and intuitive representation of the phase of real physical systems and processes, which holds fundamental significance and is widely applied across many fields of science and engineering. In this paper, we propose a robust modal decomposition (RMD) in the complex domain as a natural and general extension of the original real-valued RMD. We revisit and derive the mathematical principles of RMD in the complex domain, and develop an algorithmic version tailored for this domain. Extensive experiments are conducted on synthetic simulation datasets and real-world datasets from diverse fields, including a millimeter-wave radar physiological signal detection dataset, a faulty bearing dataset, a radio-frequency unmanned aerial vehicle identification dataset, and a WiFi CSI-based respiration detection dataset. The results demonstrate that the proposed complex-domain robust modal decomposition significantly improves performance across these various applications.
This paper presents the design, development, and on vehicle implementation and validation of a safety critical controller for autonomous driving under sensing and communication uncertainty. Cooperative sensing, fused via a Wasserstein barycenter (WB), is used to optimize the distribution of the dynamic obstacle locations. The Conditional Value at Risk (CVaR) is introduced to form a risk aware control-barrier-function (CBF) framework with the optimized distribution samplings. The proposed WB CVaR CBF safety filter improves control inputs that minimize tail risk while certifying forward invariance of the safe set. A model predictive controller (MPC) performs path tracking, and the safety filter modulates the nominal control inputs to enforce risk aware constraints. We detail the software architecture and integration with vehicle actuation and cooperative sensing. The approach is evaluated on a full-scale autonomous vehicle (AV) in scenarios with measurement noise, communication perturbations, and input disturbances, and is compared against a baseline MPC CBF design. Results demonstrate improved safety margins and robustness, highlighting the practicality of deploying the risk-aware safety filter on an actual AV.
Accurate reconstruction of static and rapidly moving targets demands three-dimensional imaging solutions with high temporal and spatial resolution. Radar sensors are a promising sensing modality because of their fast capture rates and their independence from lighting conditions. To achieve high spatial resolution, MIMO radars with large apertures are required. Yet, they are infrequently used for dynamic scenarios due to significant limitations in signal processing algorithms. These limitations impose substantial hardware constraints due to their computational intensity and reliance on large signal bandwidths, ultimately restricting the sensor's capture rate. One solution of previous work is to use few frequencies only, which enables faster capture and requires less computation; however, this requires coarse knowledge of the target's position and works in a limited depth range only. To address these challenges, we extend previous work into the multimodal domain with MM-2FSK, which leverages an assistive optical depth sensing modality to obtain a depth prior, enabling high framerate capture with only few frequencies. We evaluate our method using various target objects with known ground truth geometry that is spatially registered to real millimeter-wave MIMO radar measurements. Our method demonstrates superior performance in terms of depth quality, being able to compete with the time- and resource-intensive measurements with many frequencies.
Using environmental sensory data can enhance communications beam training and reduce its overhead compared to conventional methods. However, the availability of fresh sensory data during inference may be limited due to sensing constraints or sensor failures, necessitating a realistic model for multimodal sensing. This paper proposes a joint multimodal sensing and beam prediction framework that operates under a constraint on the average sensing rate, i.e., how often fresh sensory data should be obtained. The proposed method combines deep reinforcement learning, i.e., a deep Q-network (DQN), with a neural network (NN)-based beam predictor. The DQN determines the sensing decisions, while the NN predicts the best beam from the codebook. To capture the effect of limited fresh data during inference, the age of information (AoI) is incorporated into the training of both the DQN and the beam predictor. Lyapunov optimization is employed to design a reward function that enforces the average sensing constraint. Simulation results on a real-world dataset show that AoI-aware training improves top-1 and top-3 inference accuracy by 44.16% and 52.96%, respectively, under a strict sensing constraint. The performance gain, however, diminishes as the sensing constraint is relaxed.
The robust estimation of the mounting angle for millimeter-wave automotive radars installed on moving vehicles is investigated. We propose a novel signal processing pipeline that combines radar and inertial measurement unit (IMU) data to achieve accurate and reliable performance in realistic driving scenarios. Unlike previous studies, the method employs neural networks to process sparse and noisy radar measurements, reject detections from moving objects, and estimate radar motion. In addition, a measurement model is introduced to correct IMU bias and scale factor errors. Using vehicle kinematics, the radar mounting angle is then computed from the estimated radar motion and the vehicle's yaw rate. To benchmark performance, the proposed approach is comprehensively compared with two problem formulations and four estimation techniques reported in the literature. Validation is carried out on the challenging RadarScenes dataset, covering over 79 km of real-world driving. Results show that the proposed method achieves state-of-the-art accuracy and robustness, with reliable estimates obtained within approximately 25 seconds of driving. To the best of our knowledge, this is the first study to demonstrate that automotive radar mounting angles can be accurately estimated in complex, real traffic conditions, without requiring controlled environments, dedicated targets, or specially designed driving routes.
We study a dynamic game with a large population of players who choose actions from a finite set in continuous time. Each player has a state in a finite state space that evolves stochastically with their actions. A player's reward depends not only on their own state and action but also on the distribution of states and actions across the population, capturing effects such as congestion in traffic networks. While prior work in evolutionary game theory has primarily focused on static games without individual player state dynamics, we present the first comprehensive evolutionary analysis of such dynamic games. We propose an evolutionary model together with a mean field approximation of the finite-population game and establish strong approximation guarantees. We show that standard solution concepts for dynamic games lack an evolutionary interpretation, and we propose a new concept - the Mixed Stationary Nash Equilibrium (MSNE) - which admits one. We analyze the relationship between MSNE and the rest points of the mean field evolutionary model and study the evolutionary stability of MSNE.
Large multiple antenna arrays coupled with accurate beamforming are essential in terahertz (THz) communications to ensure link reliability. However, as the number of antennas increases, beam alignment (focusing) and beam tracking in mobile networks incur prohibitive overhead. Additionally, the near-field region expands both with the size of antenna arrays and the carrier frequency, calling for adjustments in the beamforming to account for spherical wavefront instead of the conventional planar wave assumption. In this letter, we introduce a novel beam coherence time for mobile THz networks, to drastically reduce the rate of beam updates. Then, we propose a deep learning model, relying on a simple feedforward neural network with a time-dependent input, to predict the beam coherence time and adjust the beamforming on the fly with minimal overhead. Our numerical results demonstrate the effectiveness of the proposed approach by enabling higher data rates while reducing the overhead, especially at high (i.e., vehicular) mobility.
Movable antennas (MA) and transmissive reconfigurable intelligent surfaces (TRIS) represent two innovative technologies that significantly enhance the flexibility of wireless communication systems. In this paper, we propose a novel and compact base station architecture that synergistically integrates a movable antenna with a transmissive RIS in the near field, enabling joint optimization of antenna positioning and TRIS phase adjustments. The proposed model compensates for phase quantization loss and significantly enhances signal strength, even with low-resolution (1-2 bit) phase shifters. Leveraging this framework, we systematically evaluate system performance as a function of TRIS size and antenna placement. Our results indicate that antenna mobility provides an additional degree of freedom to enhance the desired signal and achieve a higher SNR, particularly when combined with TRIS capabilities. These findings demonstrate that MA-TRIS integration offers a cost-effective and energy-efficient pathway toward compact 6G base stations, combining hardware simplicity with strong performance gains.
The coexistence of radar and communications in wireless systems marks a paradigm shift for the sixth-generation (6G) networks. As 6G systems are expected to operate at higher frequencies and employ larger antenna arrays than fifth-generation (5G) systems, they can also enable more accurate sensing capabilities. To this end, the integrated sensing and communication (ISAC) paradigm aims to unify the physical and radio frequency (RF) domains by introducing the sensing functionality into the communication network. However, the clutter poses a challenge, as it can significantly degrade the sensing accuracy in ISAC systems. This paper presents a novel two-dimensional root multiple signal classification (2D-rootMUSIC)-based algorithm for static background clutter suppression. Computer simulation results indicate that the proposed method effectively mitigates the strong background clutter, yields accurate parameter estimation performance, and offers a notable improvement in the signal-to-clutter-and-noise ratio (SCNR), while outperforming the prior-art benchmark methods.
Image downscaling is a fundamental operation in image processing, crucial for adapting high-resolution content to various display and storage constraints. While classic methods often introduce blurring or aliasing, recent learning-based approaches offer improved adaptivity. However, achieving maximal fidelity against ground-truth low-resolution (LR) images, particularly by accounting for channel-specific characteristics, remains an open challenge. This paper introduces ADK-Net (Adaptive Downscaling Kernel Network), a novel deep convolutional neural network framework for high-fidelity supervised image downscaling. ADK-Net explicitly addresses channel interdependencies by learning to predict spatially-varying, adaptive resampling kernels independently for each pixel and uniquely for each color channel (RGB). The architecture employs a hierarchical design featuring a ResNet-based feature extractor and parallel channel-specific kernel generators, themselves composed of ResNet-based trunk and branch sub-modules, enabling fine-grained kernel prediction. Trained end-to-end using an L1 reconstruction loss against ground-truth LR data, ADK-Net effectively learns the target downscaling transformation. Extensive quantitative and qualitative experiments on standard benchmarks, including the RealSR dataset, demonstrate that ADK-Net establishes a new state-of-the-art in supervised image downscaling, yielding significant improvements in PSNR and SSIM metrics compared to existing learning-based and traditional methods.
This paper investigates the idea of designing data-driven partial estimators for nonlinear systems showing parametric uncertainties using sparse multivariate polynomial relationships. A general framework is first presented and then validated on two illustrative examples with comparison to different possible Machine/Deep-Learning based alternatives. The results suggests the superiority of the proposed sparse identification scheme, at least when the learning data is small.
Target Language Extraction aims to extract speech in a specific language from a mixture waveform that contains multiple speakers speaking different languages. The human auditory system is adept at performing this task with the knowledge of the particular language. However, the performance of the conventional extraction systems is limited by the lack of this prior knowledge. Speech pre-trained models, which capture rich linguistic and phonetic representations from large-scale in-the-wild corpora, can provide this missing language knowledge to these systems. In this work, we propose a novel end-to-end framework to leverage language knowledge from speech pre-trained models. This knowledge is used to guide the extraction model to better capture the target language characteristics, thereby improving extraction quality. To demonstrate the effectiveness of our proposed approach, we construct the first publicly available multilingual dataset for Target Language Extraction. Experimental results show that our method achieves improvements of 1.22 dB and 1.12 dB in SI-SNR for English and German extraction, respectively, from mixtures containing both languages.
Background: Photoplethysmography (PPG) offers a noninvasive and accessible modality for health monitoring beyond clinical settings. However, existing studies are limited by the scale and diversity of labeled data, constraining model accuracy, generalizability, and the exploration of broader applications. This study investigates the potential of PPG for holistic health profiling through the integration of foundation model techniques. Methods: We present AnyPPG, a PPG foundation model pretrained on large-scale, multi-source synchronized PPG-ECG data. By aligning PPG and ECG representations within a shared space, AnyPPG learns physiologically meaningful features from unlabeled signals. Its capability was further evaluated across a diverse set of downstream tasks, encompassing both conventional physiological analysis and comprehensive multi-organ disease diagnosis. Results: Across eleven physiological analysis tasks spanning six independent datasets, AnyPPG achieved state-of-the-art performance, with average improvements of 12.8% in regression and 9.1% in classification tasks over the next-best model. In multi-organ disease diagnosis, AnyPPG demonstrated broad cross-system diagnostic potential. Among 1,014 ICD-10 three-digit disease categories, 13 achieved an AUC above 0.8 and 137 exceeded 0.7. Beyond strong performance in cardiovascular diseases such as heart failure, valvular disorders, and hypertension, AnyPPG also showed substantial diagnostic value for non-cardiovascular conditions, exemplified by Parkinson's disease (AUC = 0.78) and chronic kidney disease (AUC = 0.74). Conclusions: AnyPPG demonstrates that a PPG foundation model trained through physiological alignment with ECG can produce accurate and robust signal representations. Building on this capability, it underscores the potential of PPG as a modality for comprehensive assessment of systemic and multi-organ health.
Holographic multiple-input multiple-output (MIMO) has emerged as a key enabler for 6G networks, yet conventional planar implementations suffer from spatial correlation and mutual coupling at sub-wavelength spacing, which fundamentally limit the effective degrees of freedom (EDOF) and channel capacity. Three-dimensional (3-D) holographic MIMO offers a pathway to overcome these constraints by exploiting volumetric array configurations that enlarge the effective aperture and unlock additional spatial modes. This work presents the first systematic evaluation that jointly incorporates electromagnetic (EM) characteristics, such as mutual coupling and radiation efficiency, into the analysis of 3-D arrays under Clarke, Kronecker, and standardized 3rd Generation Partnership Project (3GPP) channel models. Analytical derivations and full-wave simulations demonstrate that 3-D architectures achieve higher EDOF, narrower beamwidths, and notable capacity improvements compared with planar baselines. In 3GPP urban macro channels with horizontal element spacing of 0.3 lambda, 3-D configurations yield approximately 20% capacity improvement over conventional 2-D arrays, confirming the robustness and scalability of volumetric designs under realistic conditions. These findings bridge the gap between theoretical feasibility and practical deployment, offering design guidance for next-generation 6G base station arrays.
The surge in AI workloads and escalating data center requirements have created demand for ultra-high-speed interconnects exceeding 200 Gb/s. As unit intervals (UI) shrink, even a few picoseconds of intra-pair skew can significantly degrade serializer-deserializer (SerDes) performance. To quantify the impact of intra-pair skew, conventional time-domain methods are often unreliable for coupled interconnects due to skew variations across voltage levels, while frequency-domain approaches frequently fail to address reciprocity and symmetry issues. This can result in channels that meet skew specifications in one direction but not the other, despite the inherently reciprocal nature of skew impact. To address these limitations, we introduce two new reciprocal parameters for quantifying intra-pair skew effects: Skew-Induced Insertion Loss Deviation (SILD) and its complementary Figure of Merit (FOM SILD). Measurements conducted using 224 Gb/s SerDes IP and a variety of channels with different intra-pair skews demonstrate a strong correlation between FOM SILD and bit error rate (BER). Results show that when FOM SILD is below 0.2-0.3 dB, BER remains stable, indicating minimal signal integrity degradation; however, BER increases noticeably as FOM SILD exceeds 0.3 dB. Statistical analysis across more than 3,000 high-speed twinax cables reveals that the majority exhibit FOM SILD values less than 0.1 dB, underscoring the practical relevance of the proposed metrics for high-speed interconnect assessment.
LiDAR-based roadside perception is a cornerstone of advanced Intelligent Transportation Systems (ITS). While considerable research has addressed optimal LiDAR placement for infrastructure, the profound impact of differing LiDAR scanning patterns on perceptual performance remains comparatively under-investigated. The inherent nature of various scanning modes - such as traditional repetitive (mechanical/solid-state) versus emerging non-repetitive (e.g. prism-based) systems - leads to distinct point cloud distributions at varying distances, critically dictating the efficacy of object detection and overall environmental understanding. To systematically investigate these differences in infrastructure-based contexts, we introduce the "InfraLiDARs' Benchmark," a novel dataset meticulously collected in the CARLA simulation environment using concurrently operating infrastructure-based LiDARs exhibiting both scanning paradigms. Leveraging this benchmark, we conduct a comprehensive statistical analysis of the respective LiDAR scanning abilities and evaluate the impact of these distinct patterns on the performance of various leading 3D object detection algorithms. Our findings reveal that non-repetitive scanning LiDAR and the 128-line repetitive LiDAR were found to exhibit comparable detection performance across various scenarios. Despite non-repetitive LiDAR's limited perception range, it's a cost-effective option considering its low price. Ultimately, this study provides insights for setting up roadside perception system with optimal LiDAR scanning patterns and compatible algorithms for diverse roadside applications, and publicly releases the "InfraLiDARs' Benchmark" dataset to foster further research.
This paper explores a deliberately over-engineered approach to the classical problem of parity detection -- determining whether a number is odd or even -- by combining wavelet-based feature extraction with unsupervised clustering. Instead of relying on modular arithmetic, integers are transformed into wavelet-domain representations, from which multi-scale statistical features are extracted and clustered using the k-means algorithm. The resulting feature space reveals meaningful structural differences between odd and even numbers, achieving a classification accuracy of approximately 69.67% without any label supervision. These results suggest that classical signal-processing techniques, originally designed for continuous data, can uncover latent structure even in purely discrete symbolic domains. Beyond parity detection, the study provides an illustrative perspective on how feature engineering and clustering may be repurposed for unconventional machine learning problems, potentially bridging symbolic reasoning and feature-based learning.
Robotic systems have become integral to smart environments, enabling applications ranging from urban surveillance and automated agriculture to industrial automation. However, their effective operation in dynamic settings - such as smart cities and precision farming - is challenged by continuously evolving topographies and environmental conditions. Traditional control systems often struggle to adapt quickly, leading to inefficiencies or operational failures. To address this limitation, we propose a novel framework for autonomous and dynamic reconfiguration of robotic controllers using Digital Twin technology. Our approach leverages a virtual replica of the robot's operational environment to simulate and optimize movement trajectories in response to real-world changes. By recalculating paths and control parameters in the Digital Twin and deploying the updated code to the physical robot, our method ensures rapid and reliable adaptation without manual intervention. This work advances the integration of Digital Twins in robotics, offering a scalable solution for enhancing autonomy in smart, dynamic environments.
The optimization-based damage detection and damage state digital twinning capabilities are examined here of a novel conditional-labeled generative adversarial network methodology. The framework outperforms current approaches for fault anomaly detection as no prior information is required for the health state of the system: a topic of high significance for real-world applications. Specifically, current artificial intelligence-based digital twinning approaches suffer from the uncertainty related to obtaining poor predictions when a low number of measurements is available, physics knowledge is missing, or when the damage state is unknown. To this end, an unsupervised framework is examined and validated rigorously on the benchmark structural health monitoring measurements of Z24 Bridge: a post-tensioned concrete highway bridge in Switzerland. In implementing the approach, firstly, different same damage-level measurements are used as inputs, while the model is forced to converge conditionally to two different damage states. Secondly, the process is repeated for a different group of measurements. Finally, the convergence scores are compared to identify which one belongs to a different damage state. The process for both healthy-to-healthy and damage-to-healthy input data creates, simultaneously, measurements for digital twinning purposes at different damage states, capable of pattern recognition and machine learning data generation. Further to this process, a support vector machine classifier and a principal component analysis procedure is developed to assess the generated and real measurements of each damage category, serving as a secondary new dynamics learning indicator in damage scenarios. Importantly, the approach is shown to capture accurately damage over healthy measurements, providing a powerful tool for vibration-based system-level monitoring and scalable infrastructure resilience.
The dynamic structural load identification capabilities of the gated recurrent unit, long short-term memory, and convolutional neural networks are examined herein. The examination is on realistic small dataset training conditions and on a comparative view to the physics-based residual Kalman filter (RKF). The dynamic load identification suffers from the uncertainty related to obtaining poor predictions when in civil engineering applications only a low number of tests are performed or are available, or when the structural model is unidentifiable. In considering the methods, first, a simulated structure is investigated under a shaker excitation at the top floor. Second, a building in California is investigated under seismic base excitation, which results in loading for all degrees of freedom. Finally, the International Association for Structural Control-American Society of Civil Engineers (IASC-ASCE) structural health monitoring benchmark problem is examined for impact and instant loading conditions. Importantly, the methods are shown to outperform each other on different loading scenarios, while the RKF is shown to outperform the networks in physically parametrized identifiable cases.
Spectral Graph Neural Networks (GNNs) have achieved state-of-the-art results by defining graph convolutions in the spectral domain. A common approach, popularized by ChebyNet, is to use polynomial filters based on continuous orthogonal polynomials (e.g., Chebyshev). This creates a theoretical disconnect, as these continuous-domain filters are applied to inherently discrete graph structures. We hypothesize this mismatch can lead to suboptimal performance and fragility to hyperparameter settings. In this paper, we introduce MeixnerNet, a novel spectral GNN architecture that employs discrete orthogonal polynomials -- specifically, the Meixner polynomials $M_k(x; \beta, c)$. Our model makes the two key shape parameters of the polynomial, beta and c, learnable, allowing the filter to adapt its polynomial basis to the specific spectral properties of a given graph. We overcome the significant numerical instability of these polynomials by introducing a novel stabilization technique that combines Laplacian scaling with per-basis LayerNorm. We demonstrate experimentally that MeixnerNet achieves competitive-to-superior performance against the strong ChebyNet baseline at the optimal K = 2 setting (winning on 2 out of 3 benchmarks). More critically, we show that MeixnerNet is exceptionally robust to variations in the polynomial degree K, a hyperparameter to which ChebyNet proves to be highly fragile, collapsing in performance where MeixnerNet remains stable.
Liquid cooling is critical for thermal management in high-density data centers with the rising AI workloads. However, machine learning-based controllers are essential to unlock greater energy efficiency and reliability, promoting sustainability. We present LC-Opt, a Sustainable Liquid Cooling (LC) benchmark environment, for reinforcement learning (RL) control strategies in energy-efficient liquid cooling of high-performance computing (HPC) systems. Built on the baseline of a high-fidelity digital twin of Oak Ridge National Lab's Frontier Supercomputer cooling system, LC-Opt provides detailed Modelica-based end-to-end models spanning site-level cooling towers to data center cabinets and server blade groups. RL agents optimize critical thermal controls like liquid supply temperature, flow rate, and granular valve actuation at the IT cabinet level, as well as cooling tower (CT) setpoints through a Gymnasium interface, with dynamic changes in workloads. This environment creates a multi-objective real-time optimization challenge balancing local thermal regulation and global energy efficiency, and also supports additional components like a heat recovery unit (HRU). We benchmark centralized and decentralized multi-agent RL approaches, demonstrate policy distillation into decision and regression trees for interpretable control, and explore LLM-based methods that explain control actions in natural language through an agentic mesh architecture designed to foster user trust and simplify system management. LC-Opt democratizes access to detailed, customizable liquid cooling models, enabling the ML community, operators, and vendors to develop sustainable data center liquid cooling control solutions.
The increasing energy demands and carbon footprint of large-scale AI require intelligent workload management in globally distributed data centers. Yet progress is limited by the absence of benchmarks that realistically capture the interplay of time-varying environmental factors (grid carbon intensity, electricity prices, weather), detailed data center physics (CPUs, GPUs, memory, HVAC energy), and geo-distributed network dynamics (latency and transmission costs). To bridge this gap, we present DCcluster-Opt: an open-source, high-fidelity simulation benchmark for sustainable, geo-temporal task scheduling. DCcluster-Opt combines curated real-world datasets, including AI workload traces, grid carbon intensity, electricity markets, weather across 20 global regions, cloud transmission costs, and empirical network delay parameters with physics-informed models of data center operations, enabling rigorous and reproducible research in sustainable computing. It presents a challenging scheduling problem where a top-level coordinating agent must dynamically reassign or defer tasks that arrive with resource and service-level agreement requirements across a configurable cluster of data centers to optimize multiple objectives. The environment also models advanced components such as heat recovery. A modular reward system enables an explicit study of trade-offs among carbon emissions, energy costs, service level agreements, and water use. It provides a Gymnasium API with baseline controllers, including reinforcement learning and rule-based strategies, to support reproducible ML research and a fair comparison of diverse algorithms. By offering a realistic, configurable, and accessible testbed, DCcluster-Opt accelerates the development and validation of next-generation sustainable computing solutions for geo-distributed data centers.
Accurate downhole depth measurement is essential for oil and gas well operations, directly influencing reservoir contact, production efficiency, and operational safety. Collar correlation using a casing collar locator (CCL) is fundamental for precise depth calibration. While neural network-based CCL signal recognition has achieved significant progress in collar identification, preprocessing methods for such applications remain underdeveloped. Moreover, the limited availability of real well data poses substantial challenges for training neural network models that require extensive datasets. This paper presents a system integrated into downhole tools for CCL signal acquisition to facilitate dataset construction. We propose comprehensive preprocessing methods for data augmentation and evaluate their effectiveness using our AlexNet-based neural network models. Through systematic experimentation across various configuration combinations, we analyze the contribution of each augmentation method. Results demonstrate that standardization, label distribution smoothing (LDS), and random cropping are fundamental requirements for model training, while label smoothing regularization (LSR), time scaling, and multiple sampling significantly enhance model generalization capability. The F1 scores of our two benchmark models trained with the proposed augmentation methods maximumly improve from 0.937 and 0.952 to 1.0 and 1.0, respectively. Performance validation on real CCL waveforms confirms the effectiveness and practical applicability of our approach. This work addresses the gaps in data augmentation methodologies for training casing collar recognition models in CCL data-limited environments.
This paper presents a proof-of-concept supply chain attack against the Secure ROS 2 (SROS 2) framework, demonstrated on a Quanser QCar2 autonomous vehicle platform. A Trojan-infected Debian package modifies core ROS 2 security commands to exfiltrate newly generated keystore credentials via DNS in base64-encoded chunks to an attacker-controlled nameserver. Possession of these credentials enables the attacker to rejoin the SROS 2 network as an authenticated participant and publish spoofed control or perception messages without triggering authentication failures. We evaluate this capability on a secure ROS 2 Humble testbed configured for a four-stop-sign navigation routine using an Intel RealSense camera for perception. Experimental results show that control-topic injections can cause forced braking, sustained high-speed acceleration, and continuous turning loops, while perception-topic spoofing can induce phantom stop signs or suppress real detections. The attack generalizes to any data distribution service (DDS)-based robotic system using SROS 2, highlighting the need for both supply chain integrity controls and runtime semantic validation to safeguard autonomous systems against insider and impersonation threats.
This paper addresses the multivariable gradient-based extremum seeking control (ESC) subject to saturation. Two distinct saturation scenarios are investigated here: saturation acting on the input of the function to be optimized, which is addressed using an anti-windup compensation strategy, and saturation affecting the gradient estimate. In both cases, the unknown Hessian matrix is represented using a polytopic uncertainty description, and sufficient conditions in the form of linear matrix inequalities (LMIs) are derived to design a stabilizing control gain. The proposed conditions guarantee exponential stability of the origin for the average closed-loop system under saturation constraints. With the proposed design conditions, non-diagonal control gain matrices can be obtained, generalizing conventional ESC designs that typically rely on diagonal structures. Stability and convergence are rigorously proven using the Averaging Theory for dynamical systems with Lipschitz continuous right-hand sides. Numerical simulations illustrate the effectiveness of the proposed ESC algorithms, confirming the convergence even in the presence of saturation.
The increasing adoption of satellite Internet with low-Earth-orbit (LEO) satellites in mega-constellations allows ubiquitous connectivity to rural and remote areas. However, weather events have a significant impact on the performance and reliability of satellite Internet. Adverse weather events such as snow and rain can disturb the performance and operations of satellite Internet's essential ground terminal components, such as satellite antennas, significantly disrupting the space-ground link conditions between LEO satellites and ground stations. This challenge calls for not only region-based weather forecasts but also fine-grained detection capability on ground terminal components of fine-grained weather conditions. Such a capability can assist in fault diagnostics and mitigation for reliable satellite Internet, but its solutions are lacking, not to mention the effectiveness and generalization that are essential in real-world deployments. This paper discusses an efficient transfer learning (TL) method that can enable a ground component to locally detect representative weather-related conditions. The proposed method can detect snow, wet, and other conditions resulting from adverse and typical weather events and shows superior performance compared to the typical deep learning methods, such as YOLOv7, YOLOv9, Faster R-CNN, and R-YOLO. Our TL method also shows the advantage of being generalizable to various scenarios.
An automated, standoff acoustic leak detection scheme has been designed, built, and tested. It merges the principles of glass breakage and smoke detection to alert for the presence of leaks emanating from pressurized plumbing. A simulated water leak flowing at 0.15 l/min has been reliably detected at a standoff distance of more than 10 m. The device is also effective at identifying the presence of leaks located behind surfaces such as walls, doors, floors, and ceilings. The anticipated application is as an autonomous, battery-powered, remote wireless node. All signal processing and analysis takes place on the edge with no need to stream audio data to the cloud. Sensor status is conveyed on-demand with only a few bytes of information, requiring minimal bandwidth. Power consumption is the range of 20--200 micro-Watts, depending on the amount of environmental noise and desired sensor latency. To attain optimum sensitivity and reliability, the hardware operates at acoustic frequencies well above the range of human conversations, making eavesdropping impossible. Development has been done with water escaping from pressurized plumbing, but the sensor concept can be used effectively to detect gas leaks.
Symbol decoding in multiple-input multiple-output (MIMO) wireless communication systems requires the deployment of fast, energy-efficient computing hardware deployable at the edge. The brute-force, exact maximum likelihood (ML) decoder, solved on conventional classical digital hardware, has exponential time complexity. Approximate classical solvers implemented on the same hardware have polynomial time complexity at the best. In this article, we design an alternative ring-oscillator-based coupled oscillator array to act as an oscillator Ising machine (OIM) and heuristically solve the ML-based MIMO detection problem. Complementary metal oxide semiconductor (CMOS) technology is used to design the ring oscillators, and ferroelectric field effect transistor (FeFET) technology is chosen as the coupling element (X) between the oscillators in this CMOS + X OIM design. For this purpose, we experimentally report high linear range of conductance variation (1 micro-S to 60 micro-S) in a FeFET device fabricated at 28 nm high-K/ metal gate (HKMG) CMOS technology node. We incorporate the conductance modulation characteristic in SPICE simulation of the ring oscillators connected in an all-to-all fashion through a crossbar array of these FeFET devices. We show that the above range of conductance variation of the FeFET device is suitable to obtain optimum OIM performance with no significant performance drop up to a MIMO size of 100 transmitting and 100 receiving antennas, thereby making FeFET a suitable device for this application. Our simulations and associated analysis using the Kuramoto model of oscillators also predict that this designed classical analog OIM, if implemented experimentally, will offer logarithmic scaling of computation time with MIMO size, thereby offering a huge improvement (in terms of computation speed) over aforementioned MIMO decoders run on conventional digital hardware.
Optimal transport (OT) theory provides a principled framework for modeling mass movement in applications such as mobility, logistics, and economics. Classical formulations, however, generally ignore capacity limits that are intrinsic in applications, in particular in traffic flow problems. We address this limitation by incorporating fundamental diagrams into a dynamic continuous-flow OT model on graphs, thereby including empirical relations between local density and maximal flux. We adopt an Eulerian kinetic action on graphs that preserves displacement interpolation in direct analogy with the continuous theory. Momentum lives on edges and density on nodes, mirroring road-network semantics in which segments carry speed and intersections store mass. The resulting fundamental-diagram-constrained OT problem preserves mass conservation and admits a convex variational discretization, yielding optimal congestion-aware traffic flow over road networks. We establish the existence and uniqueness of the optimal flow with sources and sinks, and develop an efficient convex optimization method. Numerical studies begin with a single-lane line network and scale to a city-level road network.
This paper investigates Multi-Object Tracking (MOT) in panoramic imagery, which introduces unique challenges including a 360° Field of View (FoV), resolution dilution, and severe view-dependent distortions. Conventional MOT methods designed for narrow-FoV pinhole cameras generalize unsatisfactorily under these conditions. To address panoramic distortion, large search space, and identity ambiguity under a 360° FoV, OmniTrack++ adopts a feedback-driven framework that progressively refines perception with trajectory cues. A DynamicSSM block first stabilizes panoramic features, implicitly alleviating geometric distortion. On top of normalized representations, FlexiTrack Instances use trajectory-informed feedback for flexible localization and reliable short-term association. To ensure long-term robustness, an ExpertTrack Memory consolidates appearance cues via a Mixture-of-Experts design, enabling recovery from fragmented tracks and reducing identity drift. Finally, a Tracklet Management module adaptively switches between end-to-end and tracking-by-detection modes according to scene dynamics, offering a balanced and scalable solution for panoramic MOT. To support rigorous evaluation, we establish the EmboTrack benchmark, a comprehensive dataset for panoramic MOT that includes QuadTrack, captured with a quadruped robot, and BipTrack, collected with a bipedal wheel-legged robot. Together, these datasets span wide-angle environments and diverse motion patterns, providing a challenging testbed for real-world panoramic perception. Extensive experiments on JRDB and EmboTrack demonstrate that OmniTrack++ achieves state-of-the-art performance, yielding substantial HOTA improvements of +25.5% on JRDB and +43.07% on QuadTrack over the original OmniTrack. Datasets and code will be made publicly available at this https URL.
Bipedal balance is challenging due to its multi-phase, hybrid nature and high-dimensional state space. Traditional balance control approaches for bipedal robots rely on low-dimensional models for locomotion planning and reactive control, constraining the full robot to behave like these simplified models. This involves tracking preset reference paths for the Center of Mass and upper body obtained through low-dimensional models, often resulting in inefficient walking patterns with bent knees. However, we observe that bipedal balance is inherently low-dimensional and can be effectively described with simple state and action descriptors in a low-dimensional state space. This allows the robot's motion to evolve freely in its high-dimensional state space, only constraining its projection in the low-dimensional state space. In this work, we propose a novel control approach that avoids prescribing a low-dimensional model to the full model. Instead, our control framework uses a descriptive model with the minimum degrees of freedom necessary to maintain balance, allowing the remaining degrees of freedom to evolve freely in the high-dimensional space. This results in an efficient human-like walking gait and improved robustness.
Improvisation-the art of spontaneous creation that unfolds moment-to-moment without a scripted outcome-requires practitioners to continuously sense, adapt, and create anew. It is a fundamental mode of human creativity spanning music, dance, and everyday life. The open-ended nature of improvisation produces a stream of novel, unrepeatable moments-an aspect highly valued in artistic creativity. In parallel, open-endedness (OE)-a system's capacity for unbounded novelty and endless "interestingness"-is exemplified in natural or cultural evolution and has been considered "the last grand challenge" in artificial life (ALife). The rise of generative AI now raises the question in computational creativity (CC) research: What makes a "good" improvisation for AI? Can AI learn to improvise in a genuinely open-ended way? In this work-in-progress paper, we report insights from in-depth interviews with 6 experts in improvisation across dance, music, and contact improvisation. We draw systemic connections between human improvisational arts and the design of future experiential AI agents that could improvise alone or alongside humans-or even with other AI agents-embodying qualities of improvisation drawn from practice: active listening (umwelt and awareness), being in the time (mindfulness and ephemerality), embracing the unknown (source of randomness and serendipity), non-judgmental flow (acceptance and dynamical stability, balancing structure and surprise (unpredictable criticality at edge of chaos), imaginative metaphor (synaesthesia and planning), empathy, trust, boundary, and care (mutual theory of mind), and playfulness and intrinsic motivation (maintaining interestingness).
Accurate prediction of the remaining useful life (RUL) of industrial machinery is essential for reducing downtime and optimizing maintenance schedules. Existing approaches, such as long short-term memory (LSTM) networks and convolutional neural networks (CNNs), often struggle to model both global temporal dependencies and fine-grained degradation trends in multivariate sensor data. We propose a hybrid model, FTT-GRU, which combines a Fast Temporal Transformer (FTT) -- a lightweight Transformer variant using linearized attention via fast Fourier transform (FFT) -- with a gated recurrent unit (GRU) layer for sequential modeling. To the best of our knowledge, this is the first application of an FTT with a GRU for RUL prediction on NASA CMAPSS, enabling simultaneous capture of global and local degradation patterns in a compact architecture. On CMAPSS FD001, FTT-GRU attains RMSE 30.76, MAE 18.97, and $R^2=0.45$, with 1.12 ms CPU latency at batch=1. Relative to the best published deep baseline (TCN--Attention), it improves RMSE by 1.16\% and MAE by 4.00\%. Training curves averaged over $k=3$ runs show smooth convergence with narrow 95\% confidence bands, and ablations (GRU-only, FTT-only) support the contribution of both components. These results demonstrate that a compact Transformer-RNN hybrid delivers accurate and efficient RUL predictions on CMAPSS, making it suitable for real-time industrial prognostics.
With the surging demand for ultra-reliable, low-latency, and ubiquitous connectivity in Sixth-Generation (6G) networks, Non-Terrestrial Networks (NTNs) emerge as a key complement to terrestrial networks by offering flexible access and global coverage. Despite the significant potential, NTNs still face critical challenges, including dynamic propagation environments, energy constraints, and dense interference. As a key 6G technology, Fluid Antennas (FAs) can reshape wireless channels by reconfiguring radiating elements within a limited space, such as their positions and rotations, to provide higher channel diversity and multiplexing gains. Compared to fixed-position antennas, FAs can present a promising integration path for NTNs to mitigate dynamic channel fading and optimize resource allocation. This paper provides a comprehensive review of FA-assisted NTNs. We begin with a brief overview of the classical structure and limitations of existing NTNs, the fundamentals and advantages of FAs, and the basic principles of FA-assisted NTNs. We then investigate the joint optimization solutions, detailing the adjustments of FA configurations, NTN platform motion modes, and resource allocations. We also discuss the combination with other emerging technologies and explore FA-assisted NTNs as a novel network architecture for intelligent function integrations. Furthermore, we delve into the physical layer security and covert communication in FA-assisted NTNs. Finally, we highlight the potential future directions to empower broader applications of FA-assisted NTNs.
Various coils for transcranial magnetic stimulation (TMS) are widely available for clinical and research use. These coils are almost all designed as air coils, which require large levels of energy to achieve a given magnetic flux density and in turn electric field strength, whereas in other sectors, such as power electronics or electrical machines, magnetic materials have been used for a long time to achieve higher efficiencies. We tested the impact on the electric and magnetic properties of different soft magnetic materials, including various ferrite cores, laminated sheet materials of nonisotropic corn-oriented silicon-steel, non-oriented silicon-steel, as well as cobalt-iron, and soft magnetic compound powder cores with insulated particles. Every material led to a reduction in coil current and voltage for the same target electric field strength. For the same field energy, every material yielded lower losses. Most common materials saturated already at very low currents. More material in thicker layers could shift the saturation point but at the cost of high weight. Due to their low saturation flux density, ferrites appear unsuitable for the high amplitude requirements of TMS. Laminated sheet materials and powder cores reduce the pulse energy, but the laminated sheet material adds more weight for the same effect than powder cores. Thus, appropriate magnetic materials can reduce the required pulse energy. Saturation flux density is the most relevant parameter, whereas the permeability beyond a certain base level is practically irrelevant. Most importantly, the weight of a magnetic-core coil may always be increased compared to an air coil for the same target field.
Underwater multi-robot cooperative coverage remains challenging due to partial observability, limited communication, environmental uncertainty, and the lack of access to global localization. To address these issues, this paper presents a semantics-guided fuzzy control framework that couples Large Language Models (LLMs) with interpretable control and lightweight coordination. Raw multimodal observations are compressed by the LLM into compact, human-interpretable semantic tokens that summarize obstacles, unexplored regions, and Objects Of Interest (OOIs) under uncertain perception. A fuzzy inference system with pre-defined membership functions then maps these tokens into smooth and stable steering and gait commands, enabling reliable navigation without relying on global positioning. Then, we further coordinate multiple robots by introducing semantic communication that shares intent and local context in linguistic form, enabling agreement on who explores where while avoiding redundant revisits. Extensive simulations in unknown reef-like environments show that, under limited sensing and communication, the proposed framework achieves robust OOI-oriented navigation and cooperative coverage with improved efficiency and adaptability, narrowing the gap between semantic cognition and distributed underwater control in GPS-denied, map-free conditions.
Autonomous systems often must predict the motions of nearby agents from partial and noisy data. This paper asks and answers the question: "can we learn, in real-time, a nonlinear predictive model of another agent's motions?" Our online framework denoises and forecasts such dynamics using a modified sliding-window Hankel Dynamic Mode Decomposition (Hankel-DMD). Partial noisy measurements are embedded into a Hankel matrix, while an associated Page matrix enables singular-value hard thresholding (SVHT) to estimate the effective rank. A Cadzow projection enforces structured low-rank consistency, yielding a denoised trajectory and local noise variance estimates. From this representation, a time-varying Hankel-DMD lifted linear predictor is constructed for multi-step forecasts. The residual analysis provides variance-tracking signals that can support downstream estimators and risk-aware planning. We validate the approach in simulation under Gaussian and heavy-tailed noise, and experimentally on a dynamic crane testbed. Results show that the method achieves stable variance-aware denoising and short-horizon prediction suitable for integration into real-time control frameworks.
Markov Chain Monte Carlo (MCMC) algorithms are standard approaches to solve imaging inverse problems and quantify estimation uncertainties, a key requirement in absence of ground-truth data. To improve estimation quality, Plug-and-Play MCMC algorithms, such as PnP-ULA, have been recently developed to accommodate priors encoded by a denoising neural network. Designing scalable samplers for high-dimensional imaging inverse problems remains a challenge: drawing and storing high-dimensional samples can be prohibitive, especially for high-resolution images. To address this issue, this work proposes a distributed sampler based on approximate data augmentation and PnP-ULA to solve very large problems. The proposed sampler uses lightweight denoising convolutional neural network, to efficiently exploit multiple GPUs on a Single Program Multiple Data architecture. Reconstruction performance and scalability are evaluated on several imaging problems. Communication and computation overheads due to the denoiser are carefully discussed. The proposed distributed approach noticeably combines three very precious qualities: it is scalable, enables uncertainty quantification, for a reconstruction performance comparable to other PnP methods.
This paper aims to discuss the impacts of low-cost airlines on the air transport market and, in particular, to present the most recent findings from the specialized literature in this field. To this end, several papers published on the topic since 2015 were selected and analyzed. Based on this analysis, the main subjects addressed in the studies were categorized into five groups: (i) impacts of low-cost airlines on competing carriers; (ii) impacts on airports; (iii) general effects on air transport demand; (iv) effects on passengers' choice processes; and (v) broader effects on geographical regions.
This paper presents an efficient parallel Cholesky factorization and triangular solve algorithm for the Karush-Kuhn-Tucker (KKT) systems arising in multistage optimization problems, with a focus on model predictive control and trajectory optimization for racing. The proposed approach directly parallelizes solving the KKT systems with block-tridiagonal-arrow KKT matrices on the linear algebra level arising in interior-point methods. The algorithm is implemented as a new backend of the PIQP solver and released as open source. Numerical experiments on the chain-of-masses benchmarks and a minimum curvature race line optimization problem demonstrate substantial performance gains compared to other state-of-the-art solvers.
We introduce Model-Bound Latent Exchange (MoBLE), a decoder-binding property in Transformer autoencoders formalized as Zero-Shot Decoder Non-Transferability (ZSDN). In identity tasks using iso-architectural models trained on identical data but differing in seeds, self-decoding achieves more than 0.91 exact match and 0.98 token accuracy, while zero-shot cross-decoding collapses to chance without exact matches. This separation arises without injected secrets or adversarial training, and is corroborated by weight-space distances and attention-divergence diagnostics. We interpret ZSDN as model binding, a latent-based authentication and access-control mechanism, even when the architecture and training recipe are public: encoder's hidden state representation deterministically reveals the plaintext, yet only the correctly keyed decoder reproduces it in zero-shot. We formally define ZSDN, a decoder-binding advantage metric, and outline deployment considerations for secure artificial intelligence (AI) pipelines. Finally, we discuss learnability risks (e.g., adapter alignment) and outline mitigations. MoBLE offers a lightweight, accelerator-friendly approach to secure AI deployment in safety-critical domains, including aviation and cyber-physical systems.
Medical imaging relies heavily on large, labeled datasets. But, unfortunately, they are not always easily accessible in clinical settings. Additionally, many practitioners often face various structural obstacles like limited data availability, fragmented data systems, and unbalanced datasets. These barriers often lead to the increased diagnostic uncertainty, underrepresentation of certain conditions, reduced model robustness, and biased diagnostic decisions. In response to these challenges, approaches such as transfer learning, meta-learning, and multimodal fusion have made great strides. However, they still need a solid theoretical justification for why they succeed or fail in situations where data is scarce. To address this gap, we propose a unified theoretical framework that characterizes learning and inference under low-resource medical imaging conditions. We first formalize the learning objective under few-shot conditions and compute sample complexity constraints to estimate the smallest quantity of data needed to achieve clinically reliable accuracy. Then based on ideas from PAC-learning and PAC-Bayesian theory, we explain how multimodal integration encourages generalization and quantifies uncertainty under sparse supervision. We further propose a formal metric for explanation stability, offering interpretability guarantees under low-data conditions. Taken together, the proposed framework establishes a principled foundation for constructing dependable, data-efficient diagnostic systems by jointly characterizing sample efficiency, uncertainty quantification, and interpretability in a unified theoretical setting.
Deep learning has emerged as a leading approach for Automatic Modulation Classification (AMC), demonstrating superior performance over traditional methods. However, vulnerability to adversarial attacks and susceptibility to data distribution shifts hinder their practical deployment in real-world, dynamic environments. To address these threats, we propose a novel, unified framework that integrates meta-learning with domain adaptation, making AMC systems resistant to both adversarial attacks and environmental changes. Our framework utilizes a two-phase strategy. First, in an offline phase, we employ a meta-learning approach to train the model on clean and adversarially perturbed samples from a single source domain. This method enables the model to generalize its defense, making it resistant to a combination of previously unseen attacks. Subsequently, in the online phase, we apply domain adaptation to align the model's features with a new target domain, allowing it to adapt without requiring substantial labeled data. As a result, our framework achieves a significant improvement in modulation classification accuracy against these combined threats, offering a critical solution to the deployment and operational challenges of modern AMC systems.
Neural receivers have demonstrated strong performance in wireless communication systems. However, their effectiveness typically depends on access to large-scale, scenario-specific channel data for training, which is often difficult to obtain in practice. Recently, generative artificial intelligence (AI) models, particularly diffusion models (DMs), have emerged as effective tools for synthesizing high-dimensional data. This paper presents a scenario-specific channel generation method based on conditional DMs, which accurately model channel distributions conditioned on user location and velocity information. The generated synthetic channel data are then employed for data augmentation to improve the training of a neural receiver designed for superimposed pilot-based transmission. Experimental results show that the proposed method generates high-fidelity channel samples and significantly enhances neural receiver performance in the target scenarios, outperforming conventional data augmentation and generative adversarial network-based techniques.
To fulfill their promise, quantum networks must transform from isolated testbeds into scalable infrastructures for distributed quantum applications. In this paper, we present a prototype orchestrator for the Argonne Quantum Network (ArQNet) testbed that leverages design principles of software-defined networking (SDN) to automate typical quantum communication experiments across buildings in the Argonne campus connected over deployed, telecom fiber. Our implementation validates a scalable architecture supporting service-level abstraction of quantum networking tasks, distributed time synchronization, and entanglement verification across remote nodes. We present a prototype service of continuous, stable entanglement distribution between remote sites that ran for 12 hours, which defines a promising path towards scalable quantum networks.
Role-play has become a key testbed for generative models, expanding from text-only dialogue to multimodal interaction. Extending role-play to speech captures prosody, emotion, and delivery, but also poses new evaluation challenges. Current pipelines often use audio large language models (ALLMs) as zero-shot judges, which miss paralinguistic cues, collapse multiple aspects into coarse scores, and rely on synthetic speech references that fail to reflect real-world roles. We present Speech-DRAME, a unified framework that contributes at three levels: (i) Speech-DRAME-EvalBench, an evaluation benchmark with bilingual human-annotated data and protocols for training and testing speech evaluation models (SEMs), (ii) DRAME-Eval, a fine-tuned evaluation model, which substantially outperforms zero-shot and few-shot ALLMs, and (iii) Speech-DRAME-RoleBench, a speech role-play benchmark that leverages DRAME-Eval as an automatic judge to compare speech foundation models (SFMs). Speech-DRAME distinguishes between two complementary evaluation strategies: Archetype Evaluation, a top-down approach measuring adherence to broad role archetypes, and Realism Evaluation, a bottom-up approach grounded in real human speech that emphasizes nuanced role quality. Compared to zero-shot ALLM judges, DRAME-Eval achieves stronger agreement with human ratings (Pearson correlation from 0.480 to 0.629 in archetypes, and 0.390 to 0.625 in realism). By integrating transparent benchmark resources, modeling approaches, and system-level evaluation, Speech-DRAME provides the first comprehensive, reproducible foundation for assessing spoken role-play.
The application of machine learning (ML) to communication systems is expected to play a pivotal role in future artificial intelligence (AI)-based next-generation wireless networks. While most existing works focus on ML techniques for static wireless environments, they often face limitations when applied to highly dynamic environments, such as flying ad hoc networks (FANETs). This paper explores the use of data-driven Koopman approaches to address these challenges. Specifically, we investigate how these approaches can model UAV trajectory dynamics within FANETs, enabling more accurate predictions and improved network performance. By leveraging Koopman operator theory, we propose two possible approaches -- centralized and distributed -- to efficiently address the challenges posed by the constantly changing topology of FANETs. To demonstrate this, we consider a FANET performing surveillance with UAVs following pre-determined trajectories and predict signal-to-interference-plus-noise ratios (SINRs) to ensure reliable communication between UAVs. Our results show that these approaches can accurately predict connectivity and isolation events that lead to modelled communication outages. This capability could help UAVs schedule their transmissions based on these predictions.
This paper designs a new spherical robot structure capable of supporting high-speed motion at up to 10 m/s. Building upon a single-pendulum-driven spherical robot, the design incorporates a momentum wheel with an axis aligned with the secondary pendulum, creating a novel spherical robot structure. Practical experiments with the physical prototype have demonstrated that this new spherical robot can achieve stable high-speed motion through simple decoupled control, which was unattainable with the original structure. The spherical robot designed for high-speed motion not only increases speed but also significantly enhances obstacle-crossing performance and terrain robustness.
We introduce the notion of descriptor embedding for nonlinear systems and use it for the data-driven design of stabilizing controllers. Specifically, we provide sufficient data-dependent LMI conditions which, if feasible, return a stabilizing nonlinear controller of the form $u=K(x)Z(x)$ where $K(x)$ belongs to a polytope and $Z$ is a user-defined function. The proposed method is then extended to account for the presence of uncertainties and noisy data. Furthermore, a method to estimate the resulting region of attraction is given using only data. Simulation examples are used to illustrate the results and compare them to existing methods from the literature.
Numerically efficient and stable algorithms are essential for kernel-based regularized system identification. The state of art algorithms exploit the semiseparable structure of the kernel and are based on the generator representation of the kernel matrix. However, as will be shown from both the theory and the practice, the algorithms based on the generator representation are sometimes numerically unstable, which limits their application in practice. This paper aims to address this issue by deriving and exploiting an alternative Givens-vector representation of some widely used kernel matrices. Based on the Givens-vector representation, we derive algorithms that yield more accurate results than existing algorithms without sacrificing efficiency. We demonstrate their usage for the kernel-based regularized system identification. Monte Carlo simulations show that the proposed algorithms admit the same order of computational complexity as the state-of-the-art ones based on generator representation, but without issues with numerical stability.
The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.
This article presents a Multi-Modal Bipedal Intelligent Urban Scout robot (MOBIUS) capable of walking, crawling, climbing, and rolling. MOBIUS features four limbs--two 6-DoF arms with two-finger grippers for manipulation and climbing, and two 4-DoF legs for locomotion--enabling smooth transitions across diverse terrains without reconfiguration. A hybrid control architecture combines reinforcement learning-based locomotion with model-based predictive and admittance control enhanced for safety by a Reference Governor toward compliant contact interactions. A high-level MIQCP planner autonomously selects locomotion modes to balance stability and energy efficiency. Hardware experiments demonstrate robust gait transitions, dynamic climbing, and full-body load support via pinch grasp. Overall, MOBIUS demonstrates the importance of tight integration between morphology, high-level planning, and control to enable mobile loco-manipulation and grasping, substantially expanding its interaction capabilities, workspace, and traversability.
This paper proposes passive WiFi indoor localization. Instead of using WiFi signals received by mobile devices as fingerprints, we use signals received by routers to locate the mobile carrier. Consequently, software installation on the mobile device is not required. To resolve the data insufficiency problem, flow control signals such as request to send (RTS) and clear to send (CTS) are utilized. In our model, received signal strength indicator (RSSI) and channel state information (CSI) are used as fingerprints for several algorithms, including deterministic, probabilistic and neural networks localization algorithms. We further investigated localization algorithms performance through extensive on-site experiments with various models of phones at hundreds of testing locations. We demonstrate that our passive scheme achieves an average localization error of 0.8 m when the phone is actively transmitting data frames and 1.5 m when it is not transmitting data frames.
A novel over-the-air computation (AirComp) framework empowered by movable antennas (MAs) is proposed to significantly enhance computation accuracy. Within this framework, the joint optimization of transmit power control, antenna positioning, and receive beamforming is investigated. Two design strategies are developed: (i) a dynamic design, where MA positions are optimized based on fast-varying instantaneous channel state information (CSI); and (ii) a static design, where antenna positions are optimized using only slow-varying statistical CSI. Numerical results validate the superior MSE performance of the proposed MA-enabled AirComp framework and demonstrate its clear advantage over benchmark systems employing conventional fixed-position antennas (FPAs).
Code-switching (CS) refers to the switching of languages within a speech signal and results in language confusion for automatic speech recognition (ASR). To address language confusion, we propose a language alignment loss (LAL) that aligns acoustic features to pseudo-language labels learned from the ASR decoder during ASR training. This approach enables frame-level language identification without the need for frame-level language annotations. To further tackle the complex token alternatives for language modeling in bilingual scenarios, we propose to employ large language models via a generative error correction method. A linguistic hint, derived from LAL outputs and decoded hypotheses, is introduced to guide the prompting and enhance the LLM-based generative error correction for CS-ASR. The proposed methods are evaluated on the SEAME dataset and data from the ASRU 2019 Mandarin-English code-switching speech recognition challenge. The incorporation of the proposed language alignment loss improves CS-ASR performance for both hybrid CTC/attention and Whisper models on both datasets, with only a negligible increase in the number of parameters. This work also highlights the efficacy of language alignment loss in balancing primary-language-dominant bilingual data during training, with an 8.6% relative improvement on the ASRU dataset compared to the baseline model. Performance evaluation using large language models reveals the advantage of the linguistic hint by achieving 14.1% and 5.5% relative improvement on test sets of the ASRU and SEAME datasets, respectively.
Unlike fixed- or variable-rate image coding, progressive image coding (PIC) aims to compress various qualities of images into a single bitstream, increasing the versatility of bitstream utilization and providing high compression efficiency compared to simulcast compression. Research on neural network (NN)-based PIC is in its early stages, mainly focusing on applying varying quantization step sizes to the transformed latent representations in a hierarchical manner. These approaches are designed to compress only the progressively added information as the quality improves, considering that a wider quantization interval for lower-quality compression includes multiple narrower sub-intervals for higher-quality compression. However, the existing methods are based on handcrafted quantization hierarchies, resulting in sub-optimal compression efficiency. In this paper, we propose an NN-based progressive coding method that firstly utilizes learned quantization step sizes via learning for each quantization layer. We also incorporate selective compression with which only the essential representation components are compressed for each quantization layer. We demonstrate that our method achieves significantly higher coding efficiency than the existing approaches with decreased decoding time and reduced model size. The source code is publicly available at this https URL
PieceWise Affine (PWA) approximations for nonlinear functions have been extensively used for tractable, computationally efficient control of nonlinear systems. However, reaching a desired approximation accuracy without prior information about the behavior of the nonlinear systems remains a challenge in the function approximation and control literature. As the name suggests, PWA approximation aims at approximating a nonlinear function or system by dividing the domain into multiple subregions where the nonlinear function or dynamics is approximated locally by an affine function also called local mode. Without prior knowledge of the form of the nonlinearity, the required number of modes, the locations of the subregions, and the local approximations need to be optimized simultaneously, which becomes highly complex for large-scale systems with multi-dimensional nonlinear functions. This paper introduces a novel approach for PWA approximation of multi-dimensional nonlinear systems, utilizing a hinging hyperplane formalism for cut-based partitioning of the domain. The complexity of the PWA approximation is iteratively increased until reaching the desired accuracy level. Further, the tractable cut definitions allow for different forms of subregions, as well as the ability to impose continuity constraints on the PWA approximation. The methodology is explained via multiple examples and its performance is compared to two existing approaches through case studies, showcasing its efficacy.
In this work, we propose using a unified representation, termed Factorized Features, for low-level vision tasks, where we test on Single Image Super-Resolution (SISR) and \textbf{Image Compression}. Motivated by the shared principles between these tasks, they require recovering and preserving fine image details, whether by enhancing resolution for SISR or reconstructing compressed data for Image Compression. Unlike previous methods that mainly focus on network architecture, our proposed approach utilizes a basis-coefficient decomposition as well as an explicit formulation of frequencies to capture structural components and multi-scale visual features in images, which addresses the core challenges of both tasks. We replace the representation of prior models from simple feature maps with Factorized Features to validate the potential for broad generalizability. In addition, we further optimize the compression pipeline by leveraging the mergeable-basis property of our Factorized Features, which consolidates shared structures on multi-frame compression. Extensive experiments show that our unified representation delivers state-of-the-art performance, achieving an average relative improvement of 204.4% in PSNR over the baseline in Super-Resolution (SR) and 9.35% BD-rate reduction in Image Compression compared to the previous SOTA. Project page: this https URL
Feasibility of the promising large intelligent surface (LIS) concept, as well as its scalability, relies on the use of low-cost hardware components, raising concerns about the effects of hardware distortion. We analyze LIS systems with receive-chain (RX-chain) hardware distortion, showing how it may limit performance gains when scaling up these systems. In particular, using the memory-less polynomial model, analytical expressions are derived for the signal to noise plus distortion ratio (SNDR) after applying maximum ratio combining (MRC). We also study the effect of back-off and automatic gain control on the RX-chains. The derived expressions enable us to evaluate the scalability of LIS when hardware impairments are present. The cost of assuming ideal hardware is further analyzed by quantifying the minimum scaling required to achieve the same performance with non-ideal hardware. The analytical expressions derived in this work are also used to propose practical antenna selection schemes for LIS, and we show that such schemes can improve the performance significantly leading to increased energy efficiency. Specifically, by turning off RX-chains with lower contribution to the post-MRC SNDR, we can reduce the energy consumption while maintaining performance. We also consider a more practical scenario where the LIS is deployed as a grid of multi-antenna panels, and we propose panel selection schemes to optimize the complexity-performance trade-offs and improve the system overall efficiency.
Estimating the Doppler frequency shift caused by moving targets is one of the key objectives of Integrated Sensing And Communication (ISAC) systems, as it enables applications such as target classification, human activity recognition, and gait analysis. In practical scenarios, Doppler estimation is hindered by the movement of transmitter and receiver devices, and by the phase offsets caused by their clock asynchrony. Existing approaches have separately addressed these two aspects, either assuming clock-synchronous moving devices or asynchronous static ones. In fact, jointly tackling device motion and clock asynchrony is extremely challenging, as the Doppler shift from device movement differs for each propagation path and the phase offsets are time-varying. In this work, we present AsyMov, a method to estimate the bistatic Doppler frequency of a target and its velocity in ISAC setups featuring mobile and asynchronous devices. It leverages the channel impulse response at the receiver, by originally exploiting the invariance of phase offsets across propagation paths and the bistatic geometry, where the target Doppler and the device velocity are jointly estimated by a newly proposed alternating minimization algorithm. Moreover, it can be seamlessly integrated with device velocity measurements obtained from onboard sensors (if available), for enhanced reliability. Here, AsyMov is thoroughly characterized by way of theory (Cramér-Rao bound), simulation, and experiments, implementing it on an IEEE 802.11ay testbed and testing it on multiple setups in the 60 GHz and 28 GHz bands, including moving human subjects. Numerical and experimental results show superior performance against state-of-the-art methods and are on par with scenarios featuring static ISAC devices.
In this paper, the problem of distributively seeking the equilibria of aggregative games with bilevel structures is studied. Different from the traditional aggregative games, here the aggregation is determined by the minimizer of a virtual leader's objective function in the inner level, which depends on the actions of the players in the outer level. Moreover, the global objective function of the virtual leader is formed by the sum of some local functions with two arguments, each of which is strongly convex with respect to the second argument. When making decisions, each player in the outer level only has access to a local part of the virtual leader's objective function. To handle this problem, first, we propose a second order gradient-based distributed algorithm, where the Hessian matrices associated with the objective functions of the leader are involved. By the algorithm, players update their actions while cooperatively minimizing the objective function of the virtual leader to estimate the aggregation by communicating with their neighbors via a connected graph. Under mild assumptions on the graph and cost functions, we prove that the actions of players asymptotically converge to the Nash equilibrium point. Then, for the case where the Hessian matrices associated with the objective functions of the virtual leader are not available, we propose a first order gradient-based distributed algorithm, where a distributed two-point estimate strategy is developed to estimate the gradients of players' cost functions in the outer level. Under the same conditions, we prove that the convergence errors of players' actions to the Nash equilibrium point are linear with respect to the estimate parameters. Finally, simulations are provided to demonstrate the effectiveness of our theoretical results.
We investigate the problem of synthesizing distributionally robust control policies for stochastic systems under safety and reach-avoid specifications. Using a game-theoretical framework, we consider the setting where the probability distribution of the disturbance at each time step is selected from an ambiguity set defined by the Wasserstein distance. The goal is to synthesize a distributionally robust control policy that ensures the satisfaction probability exceeds a specified threshold under any distribution within the ambiguity set. First, for both safety and reach-avoid specifications, we establish the existence of optimal policies by leveraging the dynamic programming principles. Then we demonstrate how the associated optimization problem can be efficiently solved using the dual representation of Wasserstein distributionally robust optimization. Furthermore, for safety specifications in particular, we introduce a novel concept of distributionally robust control barrier certificates and show how these enable the efficient synthesis of controllers through sum-of-squares programming techniques. Finally, our experimental results reveal that incorporating distributional robustness during the synthesis phase significantly improves the satisfaction probability during online execution, even with limited statistical knowledge of the disturbance distribution.
Fast and accurate waveform simulation is critical for understanding fiber channel characteristics, developing digital signal processing (DSP) technologies, optimizing optical network configurations, and advancing the optical fiber transmission system towards wideband. Deep learning (DL) has emerged as a powerful tool for waveform modeling, offering high accuracy and low complexity compared to traditional split-step Fourier method (SSFM), due to its strong nonlinear fitting capabilities and efficient parallel computation. However, most DL methods are designed for few-channel and low-rate WDM systems, leaving their scalability to wideband systems uncertain. Moreover, the lack of a standardized accuracy evaluation method and the inconsistent results between waveform errors and transmission performance errors, hinders fair comparisons of various DL schemes. In this paper, we introduce a DSP-assisted accuracy evaluation method integrated with nonlinear DSP, providing a fair benchmark for evaluating the accuracy of DL models. Using this method, we conduct a comprehensive comparison of DL schemes, ranging from simple configurations to more complex wideband setups. The feature decoupled distributed method combining with bidirectional long short-term memory achieves the better performance compared to other DL schemes. Furthermore, in scenarios with more-channel and higher-rate, the performance advantages of FDD-BiLSTM will be further improved. However, as the number of channels and symbol rates increase, the performance of FDD-BiLSTM still gradually deteriorate. We analyze these challenges from three perspectives: the more intricate linear and nonlinear effects, the higher sampling rate required for SSFM. To address these challenges, we discuss potential solutions from two aspects: incorporating more prior physical knowledge and optimizing the structure of DL models.
Diffusion models excel at joint pixel sampling for image generation but lack efficient training-free methods for partial conditional sampling (e.g., inpainting with known pixels). Prior work typically formulates this as an intractable inverse problem, relying on coarse variational approximations, heuristic losses requiring expensive backpropagation, or slow stochastic sampling. These limitations preclude: (1) accurate distributional matching in inpainting results, (2) efficient inference modes without gradient, (3) compatibility with fast ODE-based samplers. To address these limitations, we propose LanPaint: a training-free, asymptotically exact partial conditional sampling methods for ODE-based and rectified flow diffusion models. By leveraging carefully designed Langevin dynamics, LanPaint enables fast, backpropagation-free Monte Carlo sampling. Experiments demonstrate that our approach achieves superior performance with precise partial conditioning and visually coherent inpainting across diverse tasks.
Intravoxel incoherent motion (IVIM) is a diffusion-weighted magnetic resonance imaging (MRI) method that may be applied to the placenta to help diagnose abnormal pregnancies. IVIM requires prolonged scan times, followed by a model-based estimation procedure. Maternal or fetal motion during the scan affects the accuracy of this estimation. In this work, we proposed to address this challenging motion correction and data fitting problem by using additional anatomical information that is routinely collected at the beginning of the examination. Super-resolution reconstruction (SRR) was applied to these anatomical data, to provide a patient-specific, 3D isotropic, anatomic reference. Our first contribution is a novel framework with a two-step motion correction that uses both IVIM and the SRR anatomic data, accounting for both intra- and inter-scan, non-rigid motion. Our second contribution is an automation and acceleration of the IVIM data fitting, using a state-of-the-art Bayesian-type algorithm, modified with a preconditioned Crank-Nicholson (pCN) sampling strategy. The accuracy of the IVIM parameter fitting was improved by the proposed motion correction strategy, as assessed by the mean absolute fitting error in the region of interest, which was 4.14 before and 3.02 after correction (arbitrary units of signal intensity). The novel sampling strategy accelerated parameter estimation by 39% in average, with the same accuracy as that of the conventional Bayesian approach. In conclusion, the proposed method may be applied to obtain fast and reliable IVIM parameter estimates in challenging scenarios such as prenatal MRI.
Glaucoma is a major cause of irreversible blindness, with significant diagnostic subjectivity. This inherent uncertainty, combined with the overconfidence of models optimized solely for accuracy can lead to fatal issues such as overdiagnosis or missing critical diseases. To ensure clinical trust, model calibration is essential for reliable predictions, yet study in this field remains limited. Existing calibration study have overlooked glaucoma's systemic associations and high diagnostic subjectivity. To overcome these limitations, we propose V-ViT (Voting-based ViT), a framework that enhances calibration by integrating a patient's binocular information and metadata. Furthermore, to mitigate diagnostic subjectivity, V-ViT utilizes an iterative dropout-based Voting System to maximize calibration performance. The proposed framework achieved state-of-the-art performance across all metrics, including the primary calibration metrics. Our results demonstrate that V-ViT effectively resolves the issue of overconfidence in predictions in glaucoma diagnosis, providing highly reliable predictions for clinical use. Our source code is available at this https URL.
Objective: Young children and infants, especially newborns, are highly susceptible to seizures, which, if undetected and untreated, can lead to severe long-term neurological consequences. Early detection typically requires continuous electroencephalography (cEEG) monitoring in hospital settings, involving costly equipment and highly trained specialists. This study presents a low-cost, active dry-contact electrode-based, adjustable electroencephalography (EEG) headset, combined with an explainable deep learning model for seizure detection from reduced-montage EEG, and a multimodal artifact removal algorithm to enhance signal quality. Methods: EEG signals were acquired via active electrodes and processed through a custom-designed analog front end for filtering and digitization. The adjustable headset was fabricated using three-dimensional printing and laser cutting to accommodate varying head sizes. The deep learning model was trained to detect neonatal seizures in real time, and a dedicated multimodal algorithm was implemented for artifact removal while preserving seizure-relevant information. System performance was evaluated in a representative clinical setting on a pediatric patient with absence seizures, with simultaneous recordings obtained from the proposed device and a commercial wet-electrode cEEG system for comparison. Results: Signals from the proposed system exhibited a correlation coefficient exceeding 0.8 with those from the commercial device. Signal-to-noise ratio analysis indicated noise mitigation performance comparable to the commercial system. The deep learning model achieved accuracy and recall improvements of 2.76% and 16.33%, respectively, over state-of-the-art approaches. The artifact removal algorithm effectively identified and eliminated noise while preserving seizure-related EEG features.
The Continuous Boostlet Transform (CBT) is introduced as a powerful tool for analyzing spatiotemporal signals, particularly acoustic wavefields. Overcoming the limitations of classical wavelets, the CBT leverages the Poincaré group and isotropic dilations to capture sparse features of natural acoustic fields. This paper presents the mathematical framework of the CBT, including its definition, fundamental properties, and associated uncertainty principles, such as Heisenberg's, logarithmic, Pitt's, and Nazarov's inequalities. These results illuminate the trade-offs between time and frequency localization in the boostlet domain. Practical examples with constant and exponential functions highlight the CBT's adaptability. With applications in radar, communications, audio processing, and seismic analysis, the CBT offers flexible time-frequency resolution, making it ideal for non-stationary and transient signals, and a valuable tool for modern signal processing.
This paper investigates unmanned aerial vehicle (UAV)-mounted intelligent reflecting surfaces (IRS) to leverage the benefits of this technology for future communication networks, such as 6G. Key advantages include enhanced spectral and energy efficiency, expanded network coverage, and flexible deployment. One of the main challenges in employing UAV-mounted IRS (UMI) technology is the random fluctuations of hovering UAVs. Focusing on this challenge, this paper explores the capabilities of UMI with passive/active elements affected by UAV fluctuations in both horizontal and vertical angles, considering the three-dimensional (3D) radiation pattern of the IRS. The relationship between UAV fluctuations and IRS pattern is investigated by taking into account the random angular vibrations of UAVs. A tractable and closed-form distribution function for the IRS pattern is derived, using linear approximation and by dividing it into several sectors. In addition, closed-form expressions for outage probability (OP) are obtained using central limit theorem (CLT) and Gamma approximation. The theoretical expressions are validated through Monte Carlo simulations. The findings indicate that the random fluctuations of hovering UAVs have a notable impact on the performance of UMI systems. To avoid link interruptions due to UAV instability, IRS should utilize fewer elements, even though this leads to a decrease in directivity. As a result, unlike terrestrial IRS, incorporating more elements into aerial IRS systems does not necessarily improve performance due to the fluctuations in UAV. Numerical results show that the OP can be minimized by selecting the optimal number of IRS elements and using active elements.
Manual annotation of volumetric medical images, such as magnetic resonance imaging (MRI) and computed tomography (CT), is a labor-intensive and time-consuming process. Recent advancements in foundation models for video object segmentation, such as Segment Anything Model 2 (SAM 2), offer a potential opportunity to significantly speed up the annotation process by manually annotating one or a few slices and then propagating target masks across the entire volume. However, the performance of SAM 2 in this context varies. Our experiments show that relying on a single memory bank and attention module is prone to error propagation, particularly at boundary regions where the target is present in the previous slice but absent in the current one. To address this problem, we propose Short-Long Memory SAM 2 (SLM-SAM 2), a novel architecture that integrates distinct short-term and long-term memory banks with separate attention modules to improve segmentation accuracy. We evaluate SLM-SAM 2 on four public datasets covering organs, bones, and muscles across MRI, CT, and ultrasound videos. We show that the proposed method markedly outperforms the default SAM 2, achieving an average Dice Similarity Coefficient improvement of 0.14 and 0.10 in the scenarios when 5 volumes and 1 volume are available for the initial adaptation, respectively. SLM-SAM 2 also exhibits stronger resistance to over-propagation, reducing the time required to correct propagated masks by 60.575% per volume compared to SAM 2, making a notable step toward more accurate automated annotation of medical images for segmentation model development.
Data-driven soft sensors have been widely applied in complex industrial processes. However, the interpretable spatio-temporal features extraction by soft sensors remains a challenge. In this light, this work introduces a novel method termed spatio-temporal consistent and interpretable model (STCIM). First, temporal and spatial features are captured and aligned by a far topological spatio-temporal consistency extraction block. Then, the features are mapped into an interpretable latent space for further prediction by explicitly giving physical meanings to latent variables. The efficacy of the proposed STCIM is demonstrated through the modeling of two generated datasets and a real-life dataset of coal-fired power plants. The corresponding experiments show: 1) The generalization of STCIM outperforms other methods, especially in different operation situations. 2) The far topological spatio-temporal consistency is vital for feature alignment. 3) The hyper-parameters of physics-informed interpretable latent space loss decide the performance of STCIM.
Speech editing systems aim to naturally modify speech content while preserving acoustic consistency and speaker identity. However, previous studies often struggle to adapt to unseen and diverse acoustic conditions, resulting in degraded editing performance in real-world scenarios. To address this, we propose an instance-specific test-time training method for speech editing in the wild. Our approach employs direct supervision from ground-truth acoustic features in unedited regions and indirect supervision in edited regions via auxiliary losses based on duration constraints and phoneme prediction. This strategy mitigates the bandwidth discontinuity problem in speech editing, ensuring smooth acoustic transitions between unedited and edited regions. Additionally, it enables precise control over speech rate by adapting the model to target durations via mask length adjustment during test-time training. Experiments on in-the-wild benchmark datasets demonstrate that our method outperforms existing speech editing systems in both objective and subjective evaluations.
Medical image segmentation is a key task in the imaging workflow, influencing many image-based decisions. Traditional, fully-supervised segmentation models rely on large amounts of labeled training data, typically obtained through manual annotation, which can be an expensive, time-consuming, and error-prone process. This signals a need for accurate, automatic, and annotation-efficient methods of training these models. We propose ADA-SAM (automated, domain-specific, and adaptive segment anything model), a novel multitask learning framework for medical image segmentation that leverages class activation maps from an auxiliary classifier to guide the predictions of the semi-supervised segmentation branch, which is based on the Segment Anything (SAM) framework. Additionally, our ADA-SAM model employs a novel gradient feedback mechanism to create a learnable connection between the segmentation and classification branches by using the segmentation gradients to guide and improve the classification predictions. We validate ADA-SAM on real-world clinical data collected during rehabilitation trials, and demonstrate that our proposed method outperforms both fully-supervised and semi-supervised baselines by double digits in limited label settings. Our code is available at: this https URL.
The CHiME-7 and 8 distant speech recognition (DASR) challenges focus on multi-channel, generalizable, joint automatic speech recognition (ASR) and diarization of conversational speech. With participation from 9 teams submitting 32 diverse systems, these challenges have contributed to state-of-the-art research in the field. This paper outlines the challenges' design, evaluation metrics, datasets, and baseline systems while analyzing key trends from participant submissions. From this analysis it emerges that: 1) Most participants use end-to-end (e2e) ASR systems, whereas hybrid systems were prevalent in previous CHiME challenges. This transition is mainly due to the availability of robust large-scale pre-trained models, which lowers the data burden for e2e-ASR. 2) Despite recent advances in neural speech separation and enhancement (SSE), all teams still heavily rely on guided source separation, suggesting that current neural SSE techniques are still unable to reliably deal with complex scenarios and different recording setups. 3) All best systems employ diarization refinement via target-speaker diarization techniques. Accurate speaker counting in the first diarization pass is thus crucial to avoid compounding errors and CHiME-8 DASR participants especially focused on this part. 4) Downstream evaluation via meeting summarization can correlate weakly with transcription quality due to the remarkable effectiveness of large-language models in handling errors. On the NOTSOFAR-1 scenario, even systems with over 50% time-constrained minimum permutation WER can perform roughly on par with the most effective ones (around 11%). 5) Despite recent progress, accurately transcribing spontaneous speech in challenging acoustic environments remains difficult, even when using computationally intensive system ensembles.
Recently, large language models (LLMs) have driven promising progress in lossless image compression. However, directly adopting existing paradigms for medical images suffers from an unsatisfactory trade-off between compression performance and efficiency. Moreover, existing LLM-based compressors often overlook the security of the compression process, which is critical in modern medical scenarios. To this end, we propose a novel joint lossless compression and steganography framework. Inspired by bit plane slicing (BPS), we find it feasible to securely embed privacy messages into medical images in an invisible manner. Based on this insight, an adaptive modalities decomposition strategy is first devised to partition the entire image into two segments, providing global and local modalities for subsequent dual-path lossless compression. During this dual-path stage, we innovatively propose a segmented message steganography algorithm within the local modality path to ensure the security of the compression process. Coupled with the proposed anatomical priors-based low-rank adaptation (A-LoRA) fine-tuning strategy, extensive experimental results demonstrate the superiority of our proposed method in terms of compression ratios, efficiency, and security. The source code will be made publicly available.
An important limitation of Inverter-Based Resources (IBR) is their reduced contribution to Short-Circuit Current (SCC), as compared to that of Synchronous Generators (SGs). With increasing penetration of IBR in most power systems, the reducing SCC poses challenges to a secure system operation, as line protections may not trip when required. In order to address this issue, the SCC ancillary service could be procured via an economic mechanism, aiming at securing adequate SCC on all buses. However, the suitability of markets for SCC services is not well understood, given that these could be prone to market-power issues: since the SCC contributions from various SGs to a certain bus are determined by the electrical topology of the grid, this is a highly local service. It is necessary to understand if SGs at advantageous electrical locations could exert market power and, if so, how it could be mitigated. In order to fill this gap, this paper adopts an SCC-constrained bilevel model to investigate strategic behaviors of SGs. To address the non-convexity due to unit commitment variables, the model is restructured through a primal-dual formulation. Based on a modified IEEE 30-bus system, cases with strategic SGs placed at different buses are analyzed. These studies demonstrate that agents exerting market power could achieve up to triple revenues from SCC provision, highlighting the need to carefully design these markets.
Human Activity Recognition (HAR) plays a critical role in numerous applications, including healthcare monitoring, fitness tracking, and smart environments. Traditional deep learning (DL) approaches, while effective, often require extensive parameter tuning and may lack interpretability. In this work, we investigate the use of a single three-axis accelerometer and the Kolmogorov--Arnold Network (KAN) for HAR tasks, leveraging its ability to model complex nonlinear relationships with improved interpretability and parameter efficiency. The MotionSense dataset, containing smartphone-based motion sensor signals across various physical activities, is employed to evaluate the proposed approach. Our methodology involves preprocessing and normalization of accelerometer and gyroscope data, followed by KAN-based feature learning and classification. Experimental results demonstrate that the KAN achieves competitive or superior classification performance compared to conventional deep neural networks, while maintaining a significantly reduced parameter count. This highlights the potential of KAN architectures as an efficient and interpretable alternative for real-world HAR systems. The open-source implementation of the proposed framework is available at the Project's GitHub Repository.
Recent advances in pre-trained language models (PLMs) have demonstrated their capabilities in capturing universal knowledge, making them promising applications for radar signal processing. Nevertheless, directly fine-tuning PLMs on radar signals is both computationally expensive and prone to overfitting, particularly in low signal-to-clutter ratio (SCR) environments. In this paper, we propose a novel fine-tuning framework for PLM-based marine radar target detection. First, we design a lightweight adaptation module, enabling parameter-efficient fine-tuning while preserving the pretrained model's general knowledge. Second, a novel preference-aware loss is developed to selectively optimize different feature patches based on their online evaluated learning values, guiding the model to concentrate on the most generalizable feature patterns during optimization. Extensive experiments on real-world marine radar datasets demonstrate that the proposed finetuning framework achieves an average performance improvement of 9.9% over the standard approach under low SCR conditions. Furthermore, the fine-tuned model, RadarPLM, consistently outperforms state-of-the-art detectors, particularly when training data are limited.
Reasoning about spatial audio with large language models requires a spatial audio encoder as an acoustic front-end to obtain audio embeddings for further processing. Such an encoder needs to capture all information required to detect the type of sound events, as well as the direction and distance of their corresponding sources. Accomplishing this with a single audio encoder is demanding as the information required for each of these tasks is mostly independent of each other. As a result, the performance obtained with a single encoder is often worse than when using task-specific audio encoders. In this work, we present DSpAST, a novel audio encoder based on SpatialAST that learns disentangled representations of spatial audio while having only 0.2% additional parameters. Experiments on SpatialSoundQA with the spatial audio reasoning system BAT demonstrate that DSpAST significantly outperforms SpatialAST.
Mutual information (MI) is a promising candidate measure for the assessment and optimization of localization systems, as it captures nonlinear dependencies between random variables. However, the high cost of computing MI, especially for high-dimensional problems, prohibits its application for many real-world localization systems. We evaluate an algorithm from a new class of neural MI estimators called Mutual Information Neural Estimation (MINE) to approximate the MI between the set of feasible user element (UE) locations and the corresponding set of measurements from said UE locations used for positioning. We apply this estimator to a simulated multilateration (MLAT) system, where the true MI for benchmarking can be approximated by Monte Carlo simulation. The estimator is experimentally evaluated w.r.t. its convergence and consistency and we investigate the usefulness of MI for assessing simple MLAT systems.
Passive intermodulation (PIM) has emerged as a critical source of self-interference in modern MIMO-OFDM systems, especially under the stringent requirements of 5G and beyond. Conventional cancellation methods often rely on complex nonlinear models with limited scalability and high computational cost. In this work, we propose a lightweight deep learning framework for PIM cancellation that leverages depthwise separable convolutions and dilated convolutions to efficiently capture nonlinear dependencies across antennas and subcarriers. To further enhance convergence, we adopt a cyclic learning rate schedule and gradient clipping. In a controlled MIMO experimental setup, the method effectively suppresses third-order passive intermodulation (PIM) distortion, achieving up to 29dB of average power error (APE) with only 11k trainable parameters. These results highlight the potential of compact neural architectures for scalable interference mitigation in future wireless communication systems.
As video transmission increasingly serves machine vision systems (MVS) instead of human vision systems (HVS), video coding for machines (VCM) has become a critical research topic. Existing VCM methods often bind codecs to specific downstream models, requiring retraining or supervised data, thus limiting generalization in multi-task scenarios. Recently, unified VCM frameworks have employed visual backbones (VB) and visual foundation models (VFM) to support multiple video understanding tasks with a single codec. They mainly utilize VB/VFM to maintain semantic consistency or suppress non-semantic information, but seldom explore how to directly link video coding with understanding under VB/VFM guidance. Hence, we propose a Symmetric Entropy-Constrained Video Coding framework for Machines (SEC-VCM). It establishes a symmetric alignment between the video codec and VB, allowing the codec to leverage VB's representation capabilities to preserve semantics and discard MVS-irrelevant information. Specifically, a bi-directional entropy-constraint (BiEC) mechanism ensures symmetry between the process of video decoding and VB encoding by suppressing conditional entropy. This helps the codec to explicitly handle semantic information beneficial to MVS while squeezing useless information. Furthermore, a semantic-pixel dual-path fusion (SPDF) module injects pixel-level priors into the final reconstruction. Through semantic-pixel fusion, it suppresses artifacts harmful to MVS and improves machine-oriented reconstruction quality. Experimental results show our framework achieves state-of-the-art~(SOTA) in rate-task performance, with significant bitrate savings over VTM on video instance segmentation (37.4%), video object segmentation (29.8%), object detection (46.2%), and multiple object tracking (44.9%). We will release our code soon.
Accurate and robust surrogate modeling is essential for the real-time control and optimization of large-scale subsurface systems, such as geological CO2 storage and waterflood management. This study investigates the limits of classical Dynamic Mode Decomposition with control (DMDc) and introduces CCKM, as a robust alter-native, in enforcing control in pressure and water saturation reservoir dynamics under challenging prediction scenarios. We introduced a control-coherent incremental ({\Delta}) CCKM formulation, in which the field update is driven by actuator changes rather than rather than actuator levels as in the original level formulation and compared them both against DMDc and a Hybrid B-only surrogate that re-uses DMDcs bottom-B (same-step feed-through), showing that only CCKM remains stable and accurate under regime shifts. Two representative cases are considered: (i) an out-of-distribution shut-in and restart case, and (ii) an in-distribution bottomhole pressure (BHP) drawdown. Results show that only CCKM consistently maintains stability and accuracy across both scenarios, achieving sub-bar mean absolute error and sub-percent Frobenius norm percent change error (FPCE) even under regime shifts, while DMDc exhibit large unphysical errors during control transients. The findings demonstrate that strict control-coherence is critical for reliable surrogate modeling, particularly in settings with abrupt changes in control strategy. The proposed framework is broadly applicable to real-time reservoir optimization and can be integrated seamlessly into existing optimization and monitoring workflows, enabling fast and trustworthy deci-sion support in the presence of both expected and unexpected actuation regimes.
As 6G wireless communication systems evolve toward intelligence and high reconfigurability, the limitations of traditional fixed antenna (TFA) have become increasingly prominent. As a remedy, spatially movable antenna (SMA) and electromagnetically reconfigurable antenna (ERA) have respectively emerged as key technologies to break through this bottleneck. SMA activates spatial degree of freedom (DoF) by dynamically adjusting antenna positions, ERA regulates radiation characteristics using tunable metamaterials, thereby introducing DoF in the electromagnetic domain. However, the ``spatial-electromagnetic dual reconfiguration" paradigm formed by their integration poses severe challenges of high-dimensional hybrid optimization to signal processing. To address this issue, we integrate the spatial optimization of SMA and the electromagnetic reconfiguration of ERA, propose a unified modeling framework termed movable and reconfigurable antenna (MARA) and investigate the channel modeling and spectral efficiency (SE) optimization for MARA. Besides, we systematically review artificial intelligence (AI)-based solutions, focusing on analyzing the advantages of AI over traditional algorithms in solving high-dimensional non-convex optimization problems. This paper fills the gap in existing literature regarding the lack of a comprehensive review on the AI-driven signal processing paradigm under spatial-electromagnetic dual reconfiguration and provides theoretical guidance for the design and optimization of 6G wireless systems with advanced MARA.
Large language models (LLMs) have recently advanced auditory speech recognition (ASR), visual speech recognition (VSR), and audio-visual speech recognition (AVSR). However, understanding of their internal dynamics under fine-tuning remains limited. In natural language processing, recent work has revealed attention sinks, tokens that attract disproportionately high attention, and associated massive activations in which some features of sink tokens exhibit huge activation in LLMs. In this work, we are the first to study these phenomena in multimodal speech recognition. Through a detailed analysis of audio-visual LLMs, we identify attention sinks and massive activations not only at the BOS token but also at intermediate low-semantic tokens across ASR, VSR, and AVSR. We show that massive activations originate in the MLP layers and correspond to fixed feature indices across all sink tokens. We further show that intermediate sink tokens exhibit high cosine similarity to the BOS token, thereby amplifying attention and activation. Building on these insights, we introduce a simple decorrelation loss that reduces cosine similarity between BOS and other tokens, effectively mitigating intermediate sinks and massive activations. Furthermore, our method improves word error rate (WER) under high audio-visual feature downsampling while remaining stable at lower downsampling rates.
The AC Optimal Power Flow (AC-OPF) problem is central to power system operation but challenging to solve efficiently due to its nonconvex and nonlinear nature. Neural networks (NNs) offer fast surrogates, yet their black-box behavior raises concerns about constraint violations that can compromise safety. We propose a verification-informed NN framework that incorporates worst-case constraint violations directly into training, producing models that are both accurate and provably safer. Through post-hoc verification, we achieve substantial reductions in worst-case violations and, for the first time, verify all operational constraints of large-scale AC-OPF proxies. Practical feasibility is further enhanced via restoration and warm-start strategies for infeasible operating points. Experiments on systems ranging from 57 to 793 buses demonstrate scalability, speed, and reliability, bridging the gap between ML acceleration and safe, real-time deployment of AC-OPF solutions - and paving the way toward data-driven optimal control.
Tractography fiber clustering using diffusion MRI (dMRI) is a crucial method for white matter (WM) parcellation to enable analysis of brains structural connectivity in health and disease. Current fiber clustering strategies primarily use the fiber geometric characteristics (i.e., the spatial trajectories) to group similar fibers into clusters, while neglecting the functional and microstructural information of the fiber tracts. There is increasing evidence that neural activity in the WM can be measured using functional MRI (fMRI), providing potentially valuable multimodal information for fiber clustering to enhance its functional coherence. Furthermore, microstructural features such as fractional anisotropy (FA) can be computed from dMRI as additional information to ensure the anatomical coherence of the clusters. In this paper, we develop a novel deep learning fiber clustering framework, namely Deep Multi-view Fiber Clustering (DMVFC), which uses joint multi-modal dMRI and fMRI data to enable functionally consistent WM parcellation. DMVFC can effectively integrate the geometric and microstructural characteristics of the WM fibers with the fMRI BOLD signals along the fiber tracts. DMVFC includes two major components: (1) a multi-view pretraining module to compute embedding features from each source of information separately, including fiber geometry, microstructure measures, and functional signals, and (2) a collaborative fine-tuning module to simultaneously refine the differences of embeddings. In the experiments, we compare DMVFC with two state-of-the-art fiber clustering methods and demonstrate superior performance in achieving functionally meaningful and consistent WM parcellation results.
Efficient truck platooning is a key strategy for reducing freight costs, lowering fuel consumption, and mitigating emissions. Deadlines are critical in this context, as trucks must depart within specific time windows to meet delivery requirements and avoid penalties. In this paper, we investigate the optimal formation and dispatch of truck platoons at a highway station with finite capacity \(L\) and deadline constraints \(T\). The system operates in discrete time, with each arriving truck assigned a deadline of \(T\) slot units. The objective is to leverage the efficiency gains from forming large platoons while accounting for waiting costs and deadline violations. We formulate the problem as a Markov decision process and analyze the structure of the optimal policy \(\pi^\star\) for \(L = 3\), extending insights to arbitrary \(L\). We prove certain monotonicity properties of the optimal policy in the state space \(\mathcal{S}\) and identify classes of unreachable states. Moreover, since the size of \(\mathcal{S}\) grows exponentially with \(L\) and \(T\), we propose heuristics--including conditional and deep-learning based approaches--that exploit these structural insights while maintaining low computational complexity.
According to the Paris Climate Change Agreement, all nations are required to submit reports on their greenhouse gas emissions and absorption every two years by 2024. Consequently, forests play a crucial role in reducing carbon emissions, which is essential for meeting these obligations. Recognizing the significance of forest conservation in the global battle against climate change, Article 5 of the Paris Agreement emphasizes the need for high-quality forest data. This study focuses on enhancing methods for mapping aboveground biomass in tropical dry forests. Tropical dry forests are considered one of the least understood tropical forest environments; therefore, there is a need for accurate approaches to estimate carbon pools. We employ a comparative analysis of AGB estimates, utilizing different discrete and full-waveform laser scanning datasets in conjunction with Ordinary Least Squares and Bayesian approaches SVM. Airborne Laser Scanning, Unmanned Laser Scanning, and Space Laser Scanning were used as independent variables for extracting forest metrics. Variable selection, SVM regression tuning, and cross-validation via a machine-learning approach were applied to account for overfitting and underfitting. The results indicate that six key variables primarily related to tree height: Elevminimum, ElevL3, this http URL, Elevmode, ElevMADmedian, and Elevskewness, are important for AGB estimation using ALSD and ULSD, while Leaf Area Index, canopy coverage and height, terrain elevation, and full-waveform signal energy emerged as the most vital variables. AGB values estimated from ten permanent tropical dry forest plots in Costa Rica Guanacaste province ranged from 26.02 Mg/ha to 175.43 Mg/ha. The SVM regressions demonstrated a 17.89 error across all laser scanning systems, with SLSF W exhibiting the lowest error 17.07 in estimating total biomass per plot.
While large machine learning models have shown remarkable performance in various domains, their training typically requires iterating for many passes over the training data. However, due to computational and memory constraints and potential privacy concerns, storing and accessing all the data is impractical in many real-world scenarios where the data arrives in a stream. In this paper, we investigate the problem of one-pass learning, in which a model is trained on sequentially arriving data without retraining on previous datapoints. Motivated by the demonstrated effectiveness of overparameterized models and the phenomenon of benign overfitting, we propose Orthogonal Recursive Fitting (ORFit), an algorithm for one-pass learning which seeks to perfectly fit each new datapoint while minimally altering the predictions on previous datapoints. ORFit updates the parameters in a direction orthogonal to past gradients, similar to orthogonal gradient descent (OGD) in continual learning. We show that, interestingly, ORFit's update leads to an operation similar to the recursive least-squares (RLS) algorithm in adaptive filtering but with significantly improved memory and computational efficiency, i.e., linear, instead of quadratic, in the number of parameters. To further reduce memory usage, we leverage the structure of the streaming data via an incremental principal component analysis (IPCA). We show that using the principal components is minimax optimal, i.e., it minimizes the worst-case forgetting of previous predictions for unknown future updates. Further, we prove that, for overparameterized linear models, the parameter vector obtained by ORFit matches what the standard multi-pass stochastic gradient descent (SGD) would converge to. Finally, we extend our results to the nonlinear setting for highly overparameterized models, relevant for deep learning.
The Low Earth Orbit (LEO) satellite industry is undergoing rapid expansion, with operators competitively launching satellites due to the first-come, first-served principle governing orbital rights. This has led to the formation of increasingly large-scale, volumetric constellation where satellites operate across a diverse range of altitudes. To address the need for analyzing such complex networks, this paper establishes a new analytical framework for LEO constellations by leveraging a 3D Poisson point process (PPP). Specifically, we introduce a random height model (RHM) that can capture various altitude distributions by applying a random radial displacement to points generated by a homogeneous PPP on a nominal shell. Building on this, we derive an analytical expression for the downlink coverage probability. To motivate our model, we show that the altitude distributions of several leading satellite constellations, including Starlink, align with our model's assumptions. We then demonstrate through Monte Carlo simulations that the coverage probability of our RHM closely matches that of these real-world networks. Finally, we confirm the accuracy of our analytical expressions by showing their agreement with simulation results. Our work thereby provides a powerful tool for understanding and predict how the statistical distribution of satellite altitudes impacts network performance.
The application of artificial intelligence (AI) models in fields such as engineering is limited by the known difficulty of quantifying the reliability of an AI's decision. A well-calibrated AI model must correctly report its accuracy on in-distribution (ID) inputs, while also enabling the detection of out-of-distribution (OOD) inputs. A conventional approach to improve calibration is the application of Bayesian ensembling. However, owing to computational limitations and model misspecification, practical ensembling strategies do not necessarily enhance calibration. This paper proposes an extension of variational inference (VI)-based Bayesian learning that integrates calibration regularization for improved ID performance, confidence minimization for OOD detection, and selective calibration to ensure a synergistic use of calibration regularization and confidence minimization. The scheme is constructed successively by first introducing calibration-regularized Bayesian learning (CBNN), then incorporating out-of-distribution confidence minimization (OCM) to yield CBNN-OCM, and finally integrating also selective calibration to produce selective CBNN-OCM (SCBNN-OCM). Selective calibration rejects inputs for which the calibration performance is expected to be insufficient. Numerical results illustrate the trade-offs between ID accuracy, ID calibration, and OOD calibration attained by both frequentist and Bayesian learning methods. Among the main conclusions, SCBNN-OCM is seen to achieve best ID and OOD performance as compared to existing state-of-the-art approaches at the cost of rejecting a sufficiently large number of inputs.
In mathematics, a super-resolution problem can be formulated as acquiring high-frequency data from low-frequency measurements. This extrapolation problem in the frequency domain is well-known to be unstable. We propose a model-based super-resolution framework (Model-SR) for solving the super-resolution problem and analyzing its stability, aiming to narrow the gap between limited theory and the broad empirical success of super-resolution methods. The key rationale is that, to be determined by its low-frequency components, the target signal must possess a low-dimensional structure. Instead of assuming that the signal itself lies on a low-dimensional manifold in the signal space, we assume that it is generated from a model with a low-dimensional parameter space. This shift of perspective allows us to analyze stability directly through the model parameters. Within this framework, we can recover the signal by solving a nonlinear least square problem and achieve super-resolution by extracting its high-frequency components. Theoretically, the resolution-enhancing map is proven to have Lipschitz continuity, with a constant that depends crucially on parameter separation conditions\. This separation condition can be effectively enforced via sparsity modeling, which requires using the minimal number of parameters to represent the measured signal, thereby highlighting the role of sparsity in the stability of super-resolution. Moreover, the Lipschitz constant grows with the high-frequency cutoff, ultimately rendering extrapolation ineffective beyond a certain threshold. We apply the general theory to three concrete models and give the stability estimates for each model. Numerical experiments are conducted to show the super-resolution behavior of the proposed framework. The model-based mathematical framework can be extended to problems with similar structures.
Deep neural networks are increasingly used as an effective parameterization of control policies in various learning-based control paradigms. For continuous-time optimal control problems (OCPs), which are central to many decision-making tasks, control policy learning can be cast as a neural ordinary differential equation (NODE) problem wherein state and control constraints are naturally accommodated. This paper presents a NODE approach to solving continuous-time OCPs for the case of stabilizing a known constrained nonlinear system around a target state. The approach, termed Lyapunov-NODE control (L-NODEC), uses a novel Lyapunov loss formulation that incorporates an exponentially-stabilizing control Lyapunov function to learn a state-feedback neural control policy, bridging the gap of solving continuous-time OCPs via NODEs with stability guarantees. The proposed Lyapunov loss allows L-NODEC to guarantee exponential stability of the controlled system, as well as its adversarial robustness to perturbations to the initial state. The performance of L-NODEC is illustrated in two problems, including a dose delivery problem in plasma medicine. In both cases, L-NODEC effectively stabilizes the controlled system around the target state despite perturbations to the initial state and reduces the inference time necessary to reach the target.
Parallel to rapid advancements in foundation model research, the past few years have witnessed a surge in music AI applications. As AI-generated and AI-augmented music become increasingly mainstream, many researchers in the music AI community may wonder: what research frontiers remain unexplored? This paper outlines several key areas within music AI research that present significant opportunities for further investigation. We begin by examining foundational representation models and highlight emerging efforts toward explainability and interpretability. We then discuss the evolution toward multimodal systems, provide an overview of the current landscape of music datasets and their limitations, and address the growing importance of model efficiency in both training and deployment. Next, we explore applied directions, focusing first on generative models. We review recent systems, their computational constraints, and persistent challenges related to evaluation and controllability. We then examine extensions of these generative approaches to multimodal settings and their integration into artists' workflows, including applications in music editing, captioning, production, transcription, source separation, performance, discovery, and education. Finally, we explore copyright implications of generative music and propose strategies to safeguard artist rights. While not exhaustive, this survey aims to illuminate promising research directions enabled by recent developments in music foundation models.
Bioacoustics data from Passive acoustic monitoring (PAM) poses a unique set of challenges for classification, particularly the limited availability of complete and reliable labels in datasets due to annotation uncertainty, biological complexity due the heterogeneity in duration of cetacean vocalizations, and masking of target sounds due to environmental and anthropogenic noise. This means that data is often weakly labelled, with annotations indicating presence/absence of species over several minutes. In order to effectively capture the complex temporal patterns and key features of lengthy continuous audio segments, we propose an interdisciplinary framework comprising dataset standardisation, feature extraction via Variational Autoencoders (VAE) and classification via Temporal Convolutional Networks (TCN). This approach eliminates the necessity for manual threshold setting or time-consuming strong labelling. To demonstrate the effectiveness of our approach, we use sperm whale (<i>Physeter macrocephalus</i>) click trains in 4-minute recordings as a case study, from a dataset comprising diverse sources and deployment conditions to maximise generalisability. The value of feature extraction via the VAE is demonstrated by comparing classification performance against the traditional and explainable approach of expert handpicking of features. The TCN demonstrated robust classification capabilities achieving AUC scores exceeding 0.9.
Current approaches to dichotomous image segmentation (DIS) treat image matting and object segmentation as fundamentally different tasks. As improvements in image segmentation become increasingly challenging to achieve, combining image matting and grayscale segmentation techniques offers promising new directions for architectural innovation. Inspired by the possibility of aligning these two model tasks, we propose a new architectural approach for DIS called Confidence-Guided Matting (CGM). We created the first CGM model called Background Erase Network (BEN). BEN consists of two components: BEN Base for initial segmentation and BEN Refiner for confidence-based refinement. Our approach achieves substantial improvements over current state-of-the-art methods on the DIS5K validation dataset, demonstrating that matting-based refinement can significantly enhance segmentation quality. This work introduces a new paradigm for integrating matting and segmentation techniques, improving fine-grained object boundary prediction in computer vision.
We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at this https URL. An open-source implementation is provided at this https URL.
A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at this https URL.
Implicit neural representations (INR) have gained prominence for efficiently encoding multimedia data, yet their applications in audio signals remain limited. This study introduces the Kolmogorov-Arnold Network (KAN), a novel architecture using learnable activation functions, as an effective INR model for audio representation. KAN demonstrates superior perceptual performance over previous INRs, achieving the lowest Log-SpectralDistance of 1.29 and the highest Perceptual Evaluation of Speech Quality of 3.57 for 1.5 s audio. To extend KAN's utility, we propose FewSound, a hypernetwork-based architecture that enhances INR parameter updates. FewSound outperforms the state-of-the-art HyperSound, with a 33.3% improvement in MSE and 60.87% in SI-SNR. These results show KAN as a robust and adaptable audio representation with the potential for scalability and integration into various hypernetwork frameworks. The source code can be accessed at this https URL.
Multilingual speech translation (ST) and machine translation (MT) in the medical domain enhances patient care by enabling efficient communication across language barriers, alleviating specialized workforce shortages, and facilitating improved diagnosis and treatment, particularly during pandemics. In this work, we present the first systematic study on medical ST, to our best knowledge, by releasing MultiMed-ST, a large-scale ST dataset for the medical domain, spanning all translation directions in five languages: Vietnamese, English, German, French, and Simplified/Traditional Chinese, together with the models. With 290,000 samples, this is the largest medical MT dataset and the largest many-to-many multilingual ST among all domains. Secondly, we present the most comprehensive ST analysis in the field's history, to our best knowledge, including: empirical baselines, bilingual-multilingual comparative study, end-to-end vs. cascaded comparative study, task-specific vs. multi-task sequence-to-sequence comparative study, code-switch analysis, and quantitative-qualitative error analysis. All code, data, and models are available online: this https URL
This paper presents a control strategy based on dual quaternions for the coordinated formation flying of small UAV groups. A virtual structure is employed to define the desired formation, enabling unified control of its position, orientation, and shape. This abstraction makes formation management easier by allowing a low-level controller to compute individual UAV commands efficiently. The proposed controller integrates a pose control module with a geometry-based adaptive strategy, ensuring precise and robust task execution. The effectiveness of the approach is demonstrated through both simulation and experimental results.
Oversight and control, which we collectively call supervision, are often discussed as ways to ensure that AI systems are accountable, reliable, and able to fulfill governance and management requirements. However, the requirements for "human oversight" risk codifying vague or inconsistent interpretations of key concepts like oversight and control. This ambiguous terminology could undermine efforts to design or evaluate systems that must operate under meaningful human supervision. This matters because the term is used by regulatory texts such as the EU AI Act. This paper undertakes a targeted critical review of literature on supervision outside of AI, along with a brief summary of past work on the topic related to AI. We next differentiate control as ex-ante or real-time and operational rather than policy or governance, and oversight as performed ex-post, or a policy and governance function. Control aims to prevent failures, while oversight focuses on detection, remediation, or incentives for future prevention. Building on this, we make three contributions. 1) We propose a framework to align regulatory expectations with what is technically and organizationally plausible, articulating the conditions under which each mechanism is possible, where they fall short, and what is required to make them meaningful in practice. 2) We outline how supervision methods should be documented and integrated into risk management, and drawing on the Microsoft Responsible AI Maturity Model, we outline a maturity model for AI supervision. 3) We explicitly highlight boundaries of these mechanisms, including where they apply, where they fail, and where it is clear that no existing methods suffice. This foregrounds the question of whether meaningful supervision is possible in a given deployment context, and can support regulators, auditors, and practitioners in identifying both present and future limitations.
Previous work has suggested that preprocessing images through lossy compression can defend against adversarial perturbations, but comprehensive attack evaluations have been lacking. In this paper, we construct strong white-box and adaptive attacks against various compression models and identify a critical challenge for attackers: high realism in reconstructed images significantly increases attack difficulty. Through rigorous evaluation across multiple attack scenarios, we demonstrate that compression models capable of producing realistic, high-fidelity reconstructions are substantially more resistant to our attacks. In contrast, low-realism compression models can be broken. Our analysis reveals that this is not due to gradient masking. Rather, realistic reconstructions maintaining distributional alignment with natural images seem to offer inherent robustness. This work highlights a significant obstacle for future adversarial attacks and suggests that developing more effective techniques to overcome realism represents an essential challenge for comprehensive security evaluation.
Deciphering how visual stimuli are transformed into cortical responses is a fundamental challenge in computational neuroscience. This visual-to-neural mapping is inherently a one-to-many relationship, as identical visual inputs reliably evoke variable hemodynamic responses across trials, contexts, and subjects. However, existing deterministic methods struggle to simultaneously model this biological variability while capturing the underlying functional consistency that encodes stimulus information. To address these limitations, we propose SynBrain, a generative framework that simulates the transformation from visual semantics to neural responses in a probabilistic and biologically interpretable manner. SynBrain introduces two key components: (i) BrainVAE models neural representations as continuous probability distributions via probabilistic learning while maintaining functional consistency through visual semantic constraints; (ii) A Semantic-to-Neural Mapper acts as a semantic transmission pathway, projecting visual semantics into the neural response manifold to facilitate high-fidelity fMRI synthesis. Experimental results demonstrate that SynBrain surpasses state-of-the-art methods in subject-specific visual-to-fMRI encoding performance. Furthermore, SynBrain adapts efficiently to new subjects with few-shot data and synthesizes high-quality fMRI signals that are effective in improving data-limited fMRI-to-image decoding performance. Beyond that, SynBrain reveals functional consistency across trials and subjects, with synthesized signals capturing interpretable patterns shaped by biological neural variability. Our code is available at this https URL.
We derive closed-form extensions of Riccati's recursions (both sequential and parallel) for solving dual-regularized LQR problems. We show how these methods can be used to solve general constrained, non-convex, discrete-time optimal control problems via a regularized interior point method, while guaranteeing that each primal step is a descent direction of an Augmented Barrier-Lagrangian merit function. We provide MIT-licensed implementations of our methods in C++ and JAX.
The increasing size and complexity of medical imaging datasets, particularly in 3D formats, present significant barriers to collaborative research and transferability. This study investigates whether the ZFP compression technique can mitigate these challenges without compromising the performance of automated cerebrovascular segmentation, a critical first step in intracranial aneurysm detection. We apply ZFP in both its error tolerance and fixed-rate modes to a large scale, and one of the most recent, datasets in the literature, 3D medical dataset containing ground-truth vascular segmentations. The segmentation quality on the compressed volumes is rigorously compared to the uncompressed baseline (Dice approximately equals 0.8774). Our findings reveal that ZFP can achieve substantial data reduction--up to a 22.89:1 ratio in error tolerance mode--while maintaining a high degree of fidelity, with the mean Dice coefficient remaining high at 0.87656. These results demonstrate that ZFP is a viable and powerful tool for enabling more efficient and accessible research on large-scale medical datasets, fostering broader collaboration across the community.